=Paper= {{Paper |id=Vol-2786/Paper44 |storemode=property |title=Analysis of Semantic and Non-Semantic crawlers |pdfUrl=https://ceur-ws.org/Vol-2786/Paper44.pdf |volume=Vol-2786 |authors=Shridevi s,Shashwat Sanket,Jayraj Thakor,Dhivya M |dblpUrl=https://dblp.org/rec/conf/isic2/sSTM21 }} ==Analysis of Semantic and Non-Semantic crawlers== https://ceur-ws.org/Vol-2786/Paper44.pdf
                                                                                                                               360


Analysis of Semantic and Non-Semantic crawlers
Shridevi s, Shashwat Sanket, Jayraj Thakor, Dhivya M

Vellore Institute of Technology, Chennai, India

               Abstract
               A focused crawler goes through the world wide web and selects out those pages that
               are apropos to a predefined topic and neglects those pages that are not matter of
               interest. It collects the domain specific documents and is considered as one of the most
               important ways to gather information. However, centralized crawlers are not adequate
               to spider meaningful and relevant portions of the Web. A crawler which is scalable
               and which is good at load balancing can improve the overall performance. Therefore,
               with the size of web pages increasing over internet day by day, in order to download
               the pages efficiently in terms of time and increase the coverage of crawlers distributed
               web crawling is of prime importance. This paper describes about different semantic
               and non-semantic web crawler architectures: broadly classifying them into Non-
               semantic (Serial, Parallel and Distributed) and Semantic (Distributed and focused). An
               implementation of all the aforementioned types is done using the various libraries
               provided by Python 3, and a comparative analysis is done among them. The purpose
               of this paper is to outline how different processes can be run parallelly and on a
               distributed system and how all these interact with each other using shared variables
               and message passing algorithms.

               Keywords
               Semantic Crawler, Serial, Parallel, Distributed, message passing, shared variables

1. Introduction                                                                         A similar use is web archiving where the
                                                                                   web pages are collected and preserved or
    A web crawler, also known as a spiderbot is                                    stored for future use. Along with the above
    a system made up of a program or an                                            mentioned uses web crawlers are also used
    automated script that downloads web pages                                      to create a replica of visited pages which are
    on a large scale. Web crawlers are used in                                     processed by search engine for faster search
    various applications and in diverse                                            optimization and web data mining to analyse
    domains. In fact, web crawling is one of the                                   statistically. Also, web crawlers are used to
    impact factors for the growth of internet in                                   collect specific information like harvesting
    domains like marketing and E-commerce.                                         or collecting spam email addresses or
    In E- commerce, crawlers can be used for                                       application testing. Due to rapid increase of
    price comparison and to monitor the recent                                     web pages and most of the data on web are
    market trends. Similarly, it can be used to                                    unstructured, the semantic crawlers are used
    predict stock market movements by                                              for retrieval of context relevant web pages.
    analysing social media content blogs and                                       Semantic        crawlers     have     different
    other data from different websites. Web                                        architectures like distributed, parallel,
    crawlers are primary component of web                                          focused and increment crawler.
    search engines whose purpose is to collect                                          Today, web crawlers form an important
    web pages in bulk, index them and execute                                      part of various software services to evolve
    the user-defined query to find the web                                         into large scale integrated distributed
    pages.                                                                         software proving that they are not just a
                                                                                   program preserving a list of pages to be
    ISIC’21: International Semantic Intelligence                                   crawled. The web crawler is the principal
    Conference, February 25-27, 2021, Delhi, India                                 and time demanding element of web search
    EMAIL: shridevi.s@vit.ac.in (S. Shridevi);
    dhivya.m2019@vitstudent.ac.in (M. Dhivya).                                     engine. It consumes huge amount of CPU
    ORCID: 0000-0002-6927-1998 (M. Dhivya
                  ©️ 2021 Copyright for this paper by its authors. Use permitted
                                                                                   time, memory and storage space to crawl
                  under Creative Commons License Attribution 4.0
                   International (CC BY 4.0).
                                                                                   through ever increasing and dynamic web.
                  CEUR Workshop Proceedings                                        The time it consumes to crawl through web
                  (CEUR-WS.org)                                                    should be as small as possible to maintain its
                                                                                                  361


    recent updates of the search outputs.           They have described a system to extract deep-
    Parallel and distributed processing is one      web content which includes pre-computing
                                                    submissions for each HTML form and adding
    way to increase the speed of crawling
                                                    the resulting HTML pages into a search engine
    process due to technological advancement        index. The entire system is based on achieving
    and      improvement       in     hardware      three main goals. The first goal is to develop an
    architectures.The work consists of              approach that is time saving, automatic and
    implementation and comparison between           scalable to index the hidden web content from
    different web crawler architecture namely       HTML forms that are varied in domains and are
    Serial, Parallel and Distributed. The           in languages from all over the world. The
    purpose of this work is to outline how we       second aim is to develop two types of
    can increase the processing capabilities of     algorithm, one that can identify the inputs that
    web crawlers and get the query output in        accepts only specific value types and other to
    lesser amount of time. This paper covers        accept a keyword to select input values for text
    detailed information about how different        search inputs. The third aim is to develop an
    processes can be executed on parallel and       algorithm that goes through the possible input
    on a distributed system and how all these       combinations to identify and generate URLs
    interact with each other using shared           suitable for web search index.
    variables and message passing algorithms.           Anirban Kundu, Ruma Dutta, Rana
                                                    Dattagupta, and Debajyoti Mukhopadhyay in
2. Existing Work and Literature Survey              their paper “Mining the web with hierarchical
                                                    crawlers – a resource sharing based crawling
     In this section, the recent works related to   approach” [4] have proposed an extended
 crawler processing is described. In “Speeding      web crawling method to crawl over the
 up the web crawling process on a multi-core        internet on behalf of search engine. The
 processor using virtualization” [1] by Hussein     approach is combination of parallelism and
 Al- Bahadili, Hamzah Qtishat, and Reyadh S.        focused crawling using multiple crawlers.
 Naoum, they have presented and analysed            The algorithm divides the entire structure of
 their new approach to increase the crawler         the website into many levels based on
 efficiency in terms of time through                hyperlink structure to download web pages
 virtualization using multi- core processor. In     from the website and the number of crawlers
 their work they have divided the multi-core        is dynamic at each level. The number of
 processor into many VMs (Virtual Machines),        crawlers required is determined based on the
 so that the task can be executed concurrently      demand at run time by and by developing a
 on different data. In addition to this they have   thread-based program using the number of
 also described their implementation and            hyperlinks from the specificpage.
 analysis of VM-based distributed web crawler           M. Sunil Kumar and P. Neelima in their
 after rigorous testing.                            work “Design and Implementation of
     J. Cho, Hector G., L. Page [2] in their work   Scalable, Fully Distributed Web Crawler for
 have described in what sequence or in what         a Web Search Engine” [5] have presented
 order the URLs must be visited by the crawler      Dcrawler which is highly scalable and
 to obtain the important pages first. This          distributed. The core features of the presented
 method of obtaining pages of prime                 crawler are decentralization of tasks, an
 importance rapidly, helps to save time when a      assignment function that partitions the
 crawler is unable to go through the increasing     domain for the crawler to crawl effectively,
 and dynamically changing web. In this work         cooperative ability in order to work with other
 they created a dataset by downloading an           web servers and platform independence. For
 image of Stanford Webpages and performed           assignment      function      Identifier-Seeded
 experiment by modifying and using different        Consistent Hashing have been used. On
 large-scale and small-scale crawlers like          performing tests using distributed crawlers
 PageRank Crawler, Breadth-first and Depth-         they concluded that the Dcrawler performs
 first search crawler and Backlink-based            better than other traditional centralized
 crawlers.                                          crawlers and also performance can be
                                                    improved with addition of more crawlers.
    “Google‟s Deep-Web Crawl” by J.
                                                        T. Patidar and A. Ambasth in their paper
    Madhavan,                                       “Improvised Architecture for Distributed
 D. Ko et al [3] is another notable work            Web Crawling” [6] have proposed reliable
 describing how to crawl the contents of deep-      and efficient methods for a web crawler that
 web which is used in Google search engine.         is scalable. In addition to this they have
                                                                                                    362


discussed challenges and issues regarding           information gained from previous crawl in
web structure, job scheduling, spider traps         orderto crawl more efficiently the next time.
and URL canonicalization. The components
of their proposed work include Child
Manager, Cluster Manager, Bot Manager
and incremental batch analyser for re-
crawling. Their results show that they have
successfully     implemented       distributed
crawler along with politeness techniques and
selection policies but still they face
challenges like resource utilization.
    The work “A Hierarchical Approach to
Model Web Query Interfaces for Web Source
Integration” [7] by E. Dragut et al. describes
an algorithm which extracts and maps query
interfaces into a hierarchical representation.
The algorithm is divided into 4 steps namely
Token Extraction, Tree of Fields, Tree of Text
Tokens and Integration and therefore they
convert extraction algorithm into integration                    Figure 1 Flow of Serial Web
algorithm. They carried out experiments on                                 Crawler
three different datasets (ICQ, Tel8 and WISE)
and evaluated the algorithm based on               3. Architectures and Implementation
performance metrics like leaf labelling,
Schema Tree Structure and Gold Standard.
                                                      3.1 Serial Web Crawler
    D. H. P. Chau, S. Pandit, S. Wang, and C.
Faloutsos have described parallel crawling by          The crawler maintains a list of unvisited
illustrating it on an online auction website in        URLscalled the frontier which acts a queue.
their work “Parallel Crawling for Online
                                                       The list is initialized with seed URLs. Each
Social Networks” [8]. They have presented
                                                       crawling loop involves picking the next URL
this work for online social networks. They
have dynamic assignment architecture which             from the frontier to crawl, checking if the
ensures that failing of one crawler does not           URL is previously visited or not, if not
affect another crawler and that there is no            visited then fetch the page corresponding to
redundant crawling. They visited about 11              the URL through HTTP, followed by parsing
million users out of which approximately               the retrieved page to extract the URLs and
66,000 were completely crawled. J. Cho and             application specific information, and finally
H. Garcia-Molina [9] proposed different                adding the unvisited URLs to the frontier.
architectures for parallel web crawlers, metrics       The crawling process may be terminated
to evaluate the performance of parallel web            when a certain number of pages have been
crawlers and the issues related to parallel            crawled. If the crawler is ready to crawl
crawling. They described issues like Overlap,          another page and the frontier is empty, the
quality and communication bandwidth and                situation signals a dead-end for the crawler.
advantages of parallel crawlinglike scalability,       The crawler has no new page to fetch and
Network-load dispersion and Network- load              hence it stops. Figure 1 shows the flow of a
reduction.                                             basic serial web crawler. The complete
    C. C. Aggarwal, F. Al-Garawi, and P. S.            implementation of the above model can be
Yu in their work “Intelligent crawling on the          found in Implementation 1.
world wide web with arbitrary predicates”
[10] have described intelligent crawling as a          Algorithm:
method that learns properties and features of
the linkage structure of WWW while crawling.
The technique proposed by them is more                 1. Initialise a constructor crawler with
generalized than focused crawling which is                variable pageTable and revPageTable
based on pre-defined structure of web. The                assigns to HashMap
intelligent crawling described by them is              2. define a method get_seed() to get the
applicable to web pages that support arbitrary            the seed_url
user-defined topical and keyword queries. The
technique described is capable of reusing the
                                                                                                  363

                                                    processes. These multiple processes run in
                                                    parallel thus named parallel web crawler. Figure
                                                    2 shows the flow chart of the working of the
                                                    parallel web crawler. Here we created a process
                                                    pool that is managed by the process manager,
                                                    which is also responsible for spawning and
                                                    scheduling new processes. And a shared
                                                    memory that is used as Frontier. Note our
                                                    crawler is a simple parallel web crawler.
                                                    Although there are many different ways of URL
                                                    partitioning as mentioned in [11]. But the main
                                                    aim here is to create a Baselinemodel.
                                                    Algorithm: Crawler
                                                        1. Initialise a class crawler with a
                                                            constructor.
                                                        2. Define method testseedurl with seed
Figure 2 Flow of parallel webcrawler                        url as parameter
                                                        3. if hostname is same as that of seed
                                                            url then return false
    3. Parse the seed_url and save in
                                                        4. otherwise return true
        hostname                                        5. Define        method      getallurl
    4. Initialize the frontier.                              with        url         as parameter
    5. Define a method test seed url                    6. fetch and parse the html page
    6. if the url has same hostname as that of          7. for each
        seed url then return false otherwise                parsedPage.findAll(„a‟,href=TRUE)
        return true.                                    8. if (!safeURL(link)) then throw URL
    7. Define a method to get all urls                      invalid
    8. try fetching the url and throw                   9. otherwise append url
        exception                                           urls.append(url+link)
    9. parse the page using HTML parser                 10. Initialise the daemon server.
    10. for each link in parsed page find all „a‟      Algorithm: Frontier
    11. if parsed link is not safeURL                   1. Initialise a class frontier_manager
    12. throw invalid URL                                   witha constructor.
    13. otherwise append url and link                   2. Initialise the process pool and seed_url.
    14. //end if                                        3. Initialise the frontier with the
    15. //end for each                                      global list and assign token to
    16. Define a mothod crawl                               each process bylock acquire and
    17. for each current url in frontier                    release method
    18. fetch all the url and update the frontier       4. for each url in urls
        length.                                         5. if url is not in frontier push url into
    19. for each url in urls                                frontier and self-release the lock
    20. if current url is in page table then push       6. //end if
        url                                             7. Define a method to write to the url to
                                                            index table
    21. if url is not in frontier and                   8. The index acquires the lock and
        test_seed_url(url) then push url to                 check if seed url is not equal to
        frontier.                                           key in index table then add url to
    22. //end if                                            table.
             1. //end if                                9. otherwise
             2. //end for each                              indextable[local_seed_url].exte
             3. //end for each                              nd(url)
                                                        10. Define a method Make_write and
                                                            writethe urls to frontier and index
                                                            table
3.2 Parallel Web Crawler                                11. define a static method crawl and
Parallel crawlers can be understood as several              initialise a crawler
modified serial crawlers running as separate
                                                                                                 364


   12. return the seed url                        the form of checkpoints in the global store. On
                                                  failure of a nameserver, one ofthe crawlers will
   13. Define a method start and
                                                  be elected as the nameserver, and the new
       create apool process
                                                  nameserver will fetch the latest checkpoint and
   14. close the pool process                     continue the task. Note here the crawlers are
                                                  also receiving the heartbeat signal of a
The     above      Baseline    code    uses       nameserver, in order to identify when the
multiprocessing for creating and managing         nameserver is down.
multiple processes. Here lock-based system
is used to access the shared memory space.            Before going the flow, here we used two
The crawler code is very similar to that of       frontiers as local frontier and global frontier.
serial web crawler the modification is done       local frontier is the frontier of the worker
for the Frontier. The Frontier spawns two         instance where as a global frontier is part of the
crawler workers to fetch the pages.               global store .The client will trigger the
3.3 Distributed Web Crawler                       nameserver by providing the seed URLs to
                                                  crawl, the nameserver will initialize the global
    Distributed web crawlers [14,16,17] are a     frontier with the seed URLs, and will
technique in which many computers                 dynamically assign the URLs to the respective
participate by providing their computing          the crawlers local frontier. The crawlers
bandwidth in the crawling process. The            individually be acting as serial web-crawler
proposed architecture acts as a baseline for      with its own DNS resolver, Frontier queue, and
this technique. In this, there is a central       pagetable. For filtering the URLs they will also
server as nameserver and four other servers       communicate with the global store to check if
as workers as crawlers. Here we used the          visited or not. Upon completing the crawling
dynamic assignment as a policy where the          process, thecrawler will dump the pagetable in
nameserver dynamically assigns the URLs           the common storage and will ask the
and balances the load. Note the nameserver        nameserver to reallocate the new seed URLs.
here is not responsible for crawling in order     This process continues till termination
to reduce theworkload.                            triggered by the nameserver.
                                                     Algorithm: Crawler
                                                     1. Initialise a class crawler with a
                                                         constructor.
                                                     2. Define method testseedurl with seed
                                                         url as parameter
                                                     3. if hostname is same as that of seed
                                                         url then return false
                                                     4. otherwise return true
                                                     5. Define        method    getallurl
                                                          with        url       as parameter
                                                     6. fetch and parse the html page
                                                     7. for each
                                                         parsedPage.findAll(„a‟,href=TRUE)
                                                     8. if (!safeURL(link)) then throw URL
                                                         invalid
                                                     9. otherwise append url
                                                         urls.append(url+link)
                                                     10. Initialise the daemon server.
    Figure 3 flow of distributed web
                                                     Algorithm:
        crawler using client- server                 1. Initialize class
        architecture.                                    Frontier_Manager() and variable
Apart from the dynamic assignment job of the             seed url.
nameserver, it is also responsible for               2. create a index table using
monitoring the hearth beat and other meta                HashMap function
information of these worker crawlers. Since          3. Define method testseedurl with seed
the nameserver can be a single point of failure          url as parameter
(SPOF) during the task. To avoid this, the           4. if hostname is same as that of seed
nameserver saves all the meta- information in            url then return false
                                                                                           365


  5. otherwise return true
  6. Define method                              Algorithm: Crawler controller
      write_to_frontier with                        1. Initialise string with seed URL
      seed_url and host urls a                      2. check the whether the string is
      parameter.                                        present in Database
  7. for each url in urls                           3. check if (seed
  8. if(!(url in frontier and                           URL exist)
      test_seed_url(url))) then                         print already
      self.frontier.append(url.strip())                 exist
  9. Define method write_to_index_table.            4. otherwise insert the URL details
  10. if local seed url is not present              5. assign variable for statement and
      in index table then add the                       add the seed details to the
      local seed url                                    database
  11. otherwise                                     6. if (statement! = empty)
      index_Tble(local_seed_url).                       {execute statement}
      extend(urls)                                  7. otherwise print statement not
  12. write to frontier and                             executed.
      index_table the
      local_seed_url and urls.                  Algorithm: Model Extraction
  13. fetch the method crawl by                     1. Initialise variable id, url, html,
      using index and crawler                           langType
      variable                                      2. create an object and read the url.
  14. end if                                        3. Repeat till the statement is present
  15. end each for.                                 4. define              variables
                                                             and           get the
   3.4. Semantic Distributed web crawler                     subject, predicate, object
                                                        and URI.
    Distributed Semantic web crawlers are           5. if object is URIResource then
used for crawling both semantic web pages in            get URI and asiign to ob.
RDF/OWL format and HTML pages. The                  6. //end if
distributed semantic crawler uses a component       7. if langtype is HTML then
called page analyser for understanding the          8. if subject does not contain #
page context. The ontology analyser creates             and is not null then save subject
models for fetched OWL/RDF pages and these              to database,
models are stored. Later these models are           9. if predicate does not contain #
matched with the stored Ontology to make                and is not null then save object to
crawling decision                                       database.

                                                3.5 Focused web crawlers

                                                Focused web crawlers [18] are to use to
                                                collect web pages on a specific topic.
                                                These crawlers search the entire web on a
                                                predefined topic which in turn avoids
                                                irrelevant information to the user. Focused
                                                crawler saves the computational resource.
                                                Semantic focused crawler has multi thread
                                                and each thread takes a web page with
                                                highest dynamic semantic relevance from
                                                priority queue. The main work of the thread
                                                is to parse the various hyperlink and add
                                                them to the priority queue. Thus, the priority
                                                queue has the details of the web page that
                                                has to parsed by the thread Semantic
       Figure 4: Architecture of Distributed    focused crawler has another temporary
            Semantic web crawler                queue which maintains the visited web
                                                                                                 366


    pages. The thread also checks this             about 0.67%, semantic distributed crawler
    temporary queue for visited web pages.         gives an accuracy of 80% and focused
                                                   crawler gives an accuracy of about 86.66%.
    Algorithm: semantic focused crawler
        Q: Priority Queue
        DSR: Dynamic Semantic relevance
        Link: Queue of traversed URL
         1. Initialize priority queue Q with
              seed URLs
         2. Repeat till (!Q.empty() ||fetch cnt
              6 Limit) {
         3. web page.url = Q.top.getUrl ();
                       //Get most relevant
                       single URL from
                       priority queue
         4. Fetch and parse web page.url;               Table 1 Number of times a specified
         5. web page.urls = extract URLs               crawler outperforms other crawlers
              (hyperlinks) from web page.url;
                       //List of URLs
         6. For each web page.urls {                    5
         7. already exist = Check web
              page.urls[i] in Links;                    0
                       //Check for duplicates
         8. If (!already exist) {
         9. Enqueue web page.urls[i] in
              Links;                                                vit.ac.in        chenna
         10. Fetch and parse web page.urls[i];                                  i.vit.ac.in
                                                                    en.wiki
         11. Compute DSR of web
              page.urls[i];
                                                        Figure 6 Graphical representation of
         12. Enqueue (web page.urls[i], DSR
                                                       number of times a specified crawler
              ) in Q;
                                                                  outperforms
         13. Store (web page.urls[i], DSR )
              in local database;
                                                            Fig 6 shows the graphical
         14. } //end of If
                                                   presentation of the number of times a
         15. } //end of For each
                                                   specified crawler outperforms.
         16. }
              //
    Here using Pyro4 python library to            5. Conclusion
    stimulate the described architecture.
    Beautiful Soup to parse the HTML pages.                 It can be concluded that for majority
                                                   of time, a focused crawler and semantic
4. Results                                         distributed crawler gives the best result for
                                                   crawling a specific website. From the result it
                                                   is also clear that focused crawler works well
   Fig 6 shows the testing of all the semantic
                                                   as the number of crawling increases.
 and non- semantic crawlers for a given
 website. From the table in figure 6 the total    References
 number of test cases are 30 out of which the
 non-semantic crawlers (distributed crawler        [1] Al-Bahadili, H. & Qtishat, Hamzah &
 outperforms in 20 cases; parallel crawler             Naoum, Reyadh. (2013). Speeding Up the
 outperforms in 8 and serial crawler                   Web Crawling Process on a Multi-Core
 outperforms in only 2 cases) and semantic             Processor      Using     Virtualization.
 crawlers (semantic distributed crawler
                                                       International Journal on Web Service
 outperforms in 24 cases and focused crawler
 in 26 cases). Therefore, distributed crawler          Computing.           4.          19-37.
 achieves an accuracy of about 66.67%,                 10.5121/ijwsc.2013.4102.
 parallel crawler achieves an accuracy of          [2] Junghoo Cho, Hector Garcia-Molina, and
 26.67 %, serial crawler gives an accuracy of
                                                                                               367


    Lawrence Page. 1998. Efficient crawling              Arbitrary    Predicates.       96-105.
    through URL ordering. In Proceedings of             10.1145/371920.371955.
    the seventh international conference on      [11]Parallel Crawlers Junghoo Cho, Hector
    World Wide Web 7 (WWW7). Elsevier                Garcia-Molina University of California,
    Science Publishers B. V., NLD, 161–172.          Los Angel
[3] Jayant Madhavan, David Ko, Łucja Kot,        [12]Naresh Kumar, Manjeet Singh (2015).
    Vignesh Ganapathy, Alex Rasmussen,               Framework for Distributed Semantic
    and Alon Halevy. 2008. Google‟s Deep             Web Crawler. IEEE - International
    Web crawl. Proc. VLDB Endow. 1, 2                Conference     on     Computational
    (August 2008), 1241–1252.                        Intelligence  and    Communication
[4] Kundu, Anirban & Dutta, Ruma &                   Networks.
    Dattagupta, Rana & Mukhopadhyay,             [13]K.Lokeshwaran, A.Rajesh (2018). A
    Debajyoti. (2009). Mining the web with           Study of Various Semantic Web
    hierarchical crawlers - A resource sharing       Crawlers and Semantic Web Mining.
    based crawling approach. IJIIDS. 3. 90-          International Journal of Pure and Applied
    106. 10.1504/IJIIDS.2009.023040.                 Mathematics Volume 120 No. 5 2018,
[5] Kumar, M. & P, Neelima. (2011). "Design          1163-1173.
    and Implementation of Scalable, Fully        [14]      F. Liu and W. Xin, 2020
    Distributed Web Crawler for a Web                   "Implementation of Distributed Crawler
    Search Engine". International Journal of            System Based on Spark for Massive Data
    Computer        Applications.        15.            Mining," 2020 5th International
    10.5120/1963-2629.                                  Conference     on    Computer       and
[6] Patidar, T., & Ambasth, A. (2016).                  Communication Systems (ICCCS),
    Improvised Architecture for Distributed             Shanghai, China, 2020, pp. 482-485, doi:
    Web Crawling. International Journal of              10.1109/ICCCS49078.2020.9118442.
    Computer Applications, 151, 14-20.           [15]Rajiv, S and Navaneethan, C, 2020,
[7] Kabisch, Thomas & Dragut, Eduard &               "Keyword Weight Optimization using
    Yu, Clement & Leser, Ulf. (2009). A              Gradient Strategies in Event Focused
                                                     Web Crawling" Pattern Recognition
    Hierarchical Approach to Model Web
                                                     Letters 01678655 CrossRef.
    Query Interfaces for Web Source
    Integration.PVLDB.325-336.                   [16]S. K. Bal and G. Geetha, "Smart
    10.14778/1687627.1687665.                        distributed web crawler," 2016
                                                     International Conference on Information
[8] Duen Horng Chau, Shashank Pandit,
    Samuel Wang, and Christos Faloutsos.             Communication and Embedded Systems
    2007. Parallel crawling for online social        (ICICES), Chennai, 2016, pp. 1-5, doi:
    networks. In Proceedings of the 16th             10.1109/ICICES.2016.7518893.
    international conference on World Wide       [17]Wang, HongRu, et al.2018 "Anti-
    Web (WWW ‟07). Association for                      Crawler strategy and distributed
    Computing Machinery, New York, NY,                  crawler based on Hadoop." 2018 IEEE
    USA, 1283–1284.                                     3rd International Conference on Big
                                                        Data Analysis (ICBDA). IEEE.
[9] Cho, Junghoo, and Hector Garcia-
                                                 [18] Boukadi, K., Rekik, M., Rekik, M., &
    Molina.       "Parallel    crawlers."
                                                      Ben-Abdallah, H. (2018). FC4CD: a
    Proceedings of the 11th international
                                                      new SOA-based Focused Crawler for
    conference on World Wide Web. 2002.
                                                      Cloud                         service
[10]Aggarwal, Charu & Al-Garawi, Fatima               Discovery. Computing, 100(10), 1081-
    & Yu, Philip. (2001). Intelligent                 1107.
    Crawling on the World Wide Web with