=Paper= {{Paper |id=Vol-2872/short01 |storemode=property |title=A review of web crawling approaches |pdfUrl=https://ceur-ws.org/Vol-2872/short01.pdf |volume=Vol-2872 |authors=Elda Xhumari,Izaura Xhumari |dblpUrl=https://dblp.org/rec/conf/rtacsit/XhumariX21 }} ==A review of web crawling approaches== https://ceur-ws.org/Vol-2872/short01.pdf
A review of web crawling approaches
Elda Xhumaria, Izaura Xhumarib
a
    Universtity of Tirana, Department of Informatics, Boulevard “Zogu I”, Tirana, 1001, Albania
b
    Universtity of Tirana, Department of Informatics, Boulevard “Zogu I”, Tirana, 1001, Albania

                 Abstract                                                            with the rise of the Web, these general-purpose
                 Websites are getting richer and richer                              search engines and crawlers have encountered
                 with information in different formats.                              some limitations as follows: 1
                 The data that such sites possess today                              - It is impossible for them to index and analyze
                 goes through millions of terabytes of
                                                                                         all pages and keep these search indexes up to
                 data, but not every information that is on
                 the net is useful. To enable the most                                   date.
                 efficient internet browsing for the user,                           - They may return hundreds or more links to a
                 one methodology is to use web crawler.                                  user's query, due to misunderstanding of the
                 This study presents web crawler                                         query pages run by these links may not be
                 methodology, the first steps of                                         closely related to the user query.
                 development, how it works, the different                            - They may not meet query requirements with
                 types of web crawlers, the benefits of                                  different backgrounds, purposes and periods.
                 using and comparing their operating
                                                                                     - Dynamic content, such as news and financial
                 methods which are the advantages and
                 disadvantages of each algorithm used by                                 data, on the Web is growing and changing
                 them.                                                                   frequently. Many search engines can take up to
                                                                                         a month to refresh their indexes across the
                 Keywords                                                                Web. Therefore, query results are probably not
                 Web crawler, Algorithms, Types of web                                   valid at the time the request is made.
                 crawlers                                                               Therefore, it is necessary for a technology which
                                                                                     enables fast crawling search to assemble web pages
1. Introduction                                                                      with the highest possible content and quality and
                                                                                     keeping these pages up to date. This "problem" can
    The world wide web is a large collection of data.                                be solved using indexes which are built by a web
Data that continue to grow day by day. Nowadays                                      crawler. A web crawler, otherwise known as a "web
it has become an important part of human life to use                                 spider", is a program that browses the World Wide
the internet to gain access to information on the                                    Web by "clicking" on any link they find and
World Wide Web. Due to bandwidth, storage                                            collects     information    found      automatically.
capacity, limited computer resources and the rapid                                   However, building a large index for web pages is
growth of the World Wide Web, unforeseen scaling                                     not the only web crawler application [1].
challenges have arisen for search engines. The two
most important features of the web such as the large                                 2. How it works?
volume of data and the speed of their change pose
a difficulty for web crawling, as there are a large                                     The crawler maintains a list of unvisited URLs
number of pages which are added, changed and                                         called frontier. The list is first initialized with URLs
deleted every day. Although search engine                                            provided by a user or other program. Each crawl
technology has dramatically scaled up to keep pace                                   cycle involves selecting a URL from the list and

Proccedings of RTA-CSIT 2021, May 2021, Tirana, Albania
EMAIL:         elda.xhumari@fshn.edu.al     (E.       Xhumari);
izaura.xhumari@fshn.edu.al (I. Xhumari)
              ©️ 2021 Copyright for this paper by its authors. Use permitted under
              Creative Commons License Attribution 4.0 International (CC BY 4.0).
              CEUR Workshop Proceedings (CEUR-WS.org)
retrieving the corresponding page for that URL via       never ends. To have an updated content of
HTTP, analyzing it to extract URLs and specific          downloaded web pages, an incremental web
information, and finally adding these unvisited          crawler links the review of previously downloaded
URLs to the frontier list. Before being added to the     pages to the first visit to new pages [2]. The goal is
list these URLs may be marked a point depending          to achieve updating and coverage at the same time.
on the benefit achieved if the page with the             The advantage of an incremental web crawler is
corresponding URL is visited. The crawl process          that only valuable data is provided to the user, thus
may end when a certain number of crawled pages           the network bandwidth is stored and data
are accessed. If the crawler is ready to visit another   enrichment is achieved.
page and the frontier list is empty then the situation
signals a dead end for the crawler and since the             B. Form Focused Crawler
crawler no longer has new pages to visit it stops.           Form Focused Crawler deals with the rare
                                                         distribution of forms on the Web. Form Crawler
                                                         avoids crawling through unproductive links by
                                                         restricting search to a specific topic, learning the
                                                         characteristics of links and pages that lead to pages
                                                         containing searchable forms, and using appropriate
                                                         stopping criteria. Web crawler uses two rankings:
                                                         site and links to guide its search. Later, a third
                                                         classifier: the shape classifier is used to filter out
                                                         useless forms.

                                                             C. Focused Crawler
                                                            A Focused Crawler collects documents which
                                                         are specific and related to the given topic.
                                                         Sometimes this crawler is also known as Topic
                                                         Crawler to approach how it works. Focused
                                                         Crawler is a web crawler that tends to transfer
                                                         pages that are related to each other. Determines if
                                                         the given page has similarities to the specific topic.
                                                         One of the advantages of the focused crawler is the
                                                         economic flexibility in hardware and network
                                                         resources. It reduces the amount of network traffic,
Figure 1: The Data Flow of a Crawler                     logging and downloads [3]. Focused Crawler
                                                         searches, acquires, indexes, and maintains pages
                                                         for specific groups of topics that represent a
3. Types of web crawler                                  relatively narrow segment of the network. This
   Different types of web crawlers are available         crawler is run by a classifier that learns to recognize
depending on how the web pages are crawled and           the importance of taxonomy embedded examples,
how the future web pages are retrieved and               and a distiller that identifies current online priority
accessed. Some of which are as follows.                  points.
    A. Incremental Crawler
   An incremental web crawler is one of the
traditional crawlers, which constantly updates an
existing set of downloaded pages instead of
restarting the crawling process from scratch each
time. This includes some way to determine if a page
has changed since it was last downloaded. Pages
can appear multiple times in the crawler order, and
crawling is an ongoing process that conceptually
                                                           geographically different locations. Parallelization
                                                           of the crawling system is extremely important from
                                                           the purpose of read of downloading documents in
                                                           an affordable quantity of time.

                                                               E. Distributed Crawler
                                                           Distributed Web Crawler is a distributed
                                                           computing technique. Many crawlers are operating
                                                           for distribution within the web crawling method to
                                                           master as much web coverage as possible [7]. A
                                                           central server manages the communication and
                                                           synchronization of nodes, as it is geographically
Figure 2: General architecture of Focused Web              distributed. Mainly uses PageRank algorithm to
Crawler                                                    increase its efficiency and quality search. The
    Focused Crawler is a new approach to                   advantage of distributed web crawler is that it is not
increasing accuracy and expert internet search. An         affected by system crashes or various events and
ideal focused crawler could only download those            can be adapted by many crawling applications.
pages that are related to the topic while ignoring         To design an efficient web crawler, it is required to
other pages and would anticipate the possibility of        create the distribution task between multiple
a link to a specific topic related to the topic before     machines in a synchronous process. Large websites
downloading it. Focused Crawler has three main             should be distributed individually on the network
components: a classifier that makes important              and they should provide the right chance and
judgments on crawled pages to decide on the                rationality for synchronous access. Meanwhile
extension of downloaded links, a distiller that sets       synchronous distribution saves network bandwidth
a crawl center measure to determine visit                  resources [4].
preferences, and a crawler which has dynamically
reconfigurable priority controls dominated by the          4. Web crawling algorithms
classifier and distiller.
    Focused crawler aims to provide a simpler                i.    Breadth First Search
alternative to overcoming the issue that instant
pages which are low ranking related to the topic in           It starts with a small set of pages and then
question. The idea is to recursively execute an            explores other pages following the links in the first
exhaustive search to a certain depth, starting with        width. Indeed, websites are not strictly traversed at
the relatives of a highly ranked page [4].                 first glance, but can use a variety of policies. For
                                                           example, it may crawl the most important pages
    D. Parallel Crawler                                    first. This method is used by many search engines.
                                                           This crawler balances the load between servers.
    As the size of the internet grows, it becomes
                                                           Breadth first algorithm work on a level by level, i.e.
difficult to retrieve the entire or a major portion of
                                                           algorithm starts at the root URL and searches the
the web employing a single method. Therefore,
                                                           all the neighbors URL at the same level. If the
several search engines typically run multiple
                                                           desired URL is found, then the search terminates.
processes in parallel to perform the above task, so
                                                           If it is not, then search proceeds down to the next
download rate is maximized. This kind of crawler
                                                           level and repeat the processes until the goal is
is known as a parallel crawler [5]. We can also say
                                                           reached. When all the URLs are scanned, but the
that when multiple crawlers are usually run in
                                                           objective is not found, then the failure reported is
parallel, it's referred as Parallel crawlers. A parallel
                                                           generated. Breadth first Search algorithm is
crawler consists of multiple crawling processes
                                                           generally used where the objective lies in the
referred to as C-procs which can run on network of
                                                           depthless parts in a deeper tree [8].
workstations [6]. The Parallel crawlers rely on Page
freshness and Page selection. A Parallel crawler
may be on local network or be distributed at
  ii.   Depth First Search                               iv.    Fish Search Algorithm
   This is an algorithm for traversing or searching         The main principle of the algorithm is: it takes
tree or graph data structures. It is a technique of     as input an initial URL and a search query, and
systematically examining the search starting from       dynamically builds a list of priorities (initialized
the root node and penetrating deeper through the        with the initial URL) of the next URLs (referred to
child node. If there is more than one child, then       as nodes) to be explored. In each step the first node
priority is given to the child on the left and          is removed from the list and processed. As the text
penetrates deeply until there are no more children      of each document becomes available, it is analyzed
available. Returns to the other unexplored node and     by a scoring component assessing whether it is
then proceeds in a similar manner. This algorithm       relevant or irrelevant to the search query (value 1-
ensures that all edges are visited at once. It is       0) and, based on that result, a heuristic decides to
suitable for search problems, but when the branches     pursue the search in that direction or not: Whenever
are large, then this algorithm can end up in an         a document source is retrieved, it is scanned for
endless loop.                                           links. Nodes run by these links are assigned a depth
                                                        value. If the parent is important, the depth of the
 iii.   Best First Search                               children is set to a predetermined value. Otherwise,
   Best first algorithms are often used to find         the depth of the children is set to be one less than
search paths. Best First Search is a search algorithm   the depth of the parent. When the depth reaches
that roams a graph starting from the most promising     zero, the direction is interrupted and none of his
node selected according to a specified rule. The        children are included in the list [9].
basic idea is that having a URL limit, the best URL        v.    Shark-Search Algorithm
according to some evaluation criteria such as              Fish-Search algorithm's main flaw is that the
accuracy, recall, accuracy, and points (F-Score). In    interrelated computation is too simple, it only has 0
this algorithm, the URL selection process is driven     and 1, interrelated and irrelevant respectively.
by lexical similarity between the topic keywords        Secondly, every
and the URL source page. Thus, the similarity           node's potential score has a low precision which
between the page and the topic keywords is used to      only has three situations (0,0.5, and l). Aimed at
evaluate the fit with all the outbound links of the     these disadvantages, Michael Hersovici [10]
page.                                                   brought forward an improved Shark-Search
                                                        algorithm which mainly ameliorates page,
                                                        interrelated query computation and potential score'
                                                        s computing method. The following process is its
                                                        detail:
                                                        - Import vector space model to compute the page
                                                            and user query's relativity.
                                                        - Consider the information given by anchor text
                                                            near the hyperlink and compute the relativity
                                                            between it and user's query.
                                                        - Calculate both of the above two factors with
                                                            child node's potential score computing
                                                            formula.
                                                           Through these betterments, Shark-Search
                                                        algorithm's efficiency is much better than Fish-
                                                        Search's [2].

                                                         vi.    Page Rank Algorithm
                                                           In Page rank algorithm web crawler decides on
                                                        the importance of web pages in each web site
Figure 3: The best-first algorithm pseudo-code          through the total number of links or citations per
                                                        page. Page rank is calculated according to the
Relatedness between web pages by the Page Rank                  websites until recently would undergo the same
algorithm. Website ranking calculation:                         crawler process. The crawling process is no longer
𝑃𝑅 (𝐴) = (1 − 𝑑) + 𝑑 (
                            𝑃𝑅(𝑇1 )
                                    + ⋯+
                                         𝑃𝑅(𝑇𝑛 )
                                                 ),       (1)   as simple as it was a few years ago, due to the
                             𝐶(𝑇1 )       𝐶(𝑇𝑛 )
                                                                increasing use of JavaScript frameworks such as
where PR(A) - Page Rank of a given Page,D –
                                                                Angular, React, Meteor. Many of the websites are
Dumping factor
                                                                JavaScript heavy and generates content by doing
𝑇1 – links.
                                                                asynchronous JavaScript calls after page is loaded.
   To find the Page Rank for a page, called PR (A),             The use of these frameworks makes developer life
you must first find all the pages that link to page A           simpler and provides many benefits for creating
and Out Link from A. If we find a page 𝑇1 that links            dynamic sites. To crawl this type of web sites Web
with A then page C (𝑇1 ) will give the number of                Crawlers, use Selenium.
outbound links on page A. The same procedure is                     Selenium is a Web Browser Automation Tool
done for pages 𝑇2 , 𝑇3 and all other pages that can             originally designed to automate web applications
be linked to the main page A - and the sum of their             for testing purposes. It is now used for many other
values will provide the Page Rank of the website                applications such as automating web-based admin
[11].                                                           tasks, interact with platforms which do not provide
Table 1                                                         API, as well as for Web Crawling.
Advantages and limitations of web crawling                          Building a focused web crawler using selenium
algorithm                                                       tool is good way to collect useful information.
                                                                Focused Crawler is an approach to increase
 Algorithm    Advantages               Limitations
                                                                accuracy and expert internet search. An ideal
 Breadth      Suitable for             If a solution is far
 First        situations where the     away then it             focused crawler could only download those related
 Search       solution is located at   consumes time.           pages by ignoring other pages and would anticipate
              the beginning in a       Consumes a large         the possibility of a link to a specific topic related
              deep tree.               amount of memory.
                                                                site before downloading it.
                                                                    One use case of a focused web crawler is
 Depth        Suitable for in-depth    If the edges are large   extracting financial data. Financial market is a
 First        search problems.         then this algorithm
 Search       Consumes very little     can end in an endless    place of risks and instability. It’s hard to predict
              memory.                  cycle.                   how the curve will go and sometimes, for investors,
 Fish         The algorithm is         The usage of             one decision could be a make-or-break move.
 Search       helpful in forming       resources of network     That’s why experienced practitioners never lose
              the priority table.      is high. Fish search     track of the financial data. Financial data, when
                                       crawlers
                                       significantly load not   extracted and analyzed in real time, can provide
                                       only network, also       wealthy information for investments and trading.
                                       web servers.             And people in different positions scrape financial
 Shark        The algorithm            The usage of
 Search       mainly ameliorates       resources of network
                                                                data for varied purposes.
              page, interrelated       is high.
              query computations
              and potential score's
                                                                6. Conclusions
              computing
              method.                                              Web Crawler is the essential source of
 Page Rank    In a short time, the     Favors older pages,
                                                                information retrieval which roams the Web and
              most important           because a new page,
              pages are returned as    even a very good         downloads web documents that suit the user's need.
              Rank is calculated       one, will not have       Web crawler is used by search engines and other
              on the basis of the      many links unless it     users to regularly ensure that their database is up to
              popularity of a page.    is part of an existing
                                       web site.                date. In this article has been presented a review of
                                                                different types crawling technologies and
5. Approach                                                     algorithms, why “focused crawling” technology is
                                                                being used. The crawling algorithm is the most
   A traditional crawler worked simply by                       important part of any search engine. Focused
extracting static data from HTML code and most                  Crawlers uses more complex systems and
techniques to define the information of high            [11] TIAN Chong “A Kind of Algorithm For Page
relevance and quality. Searching algorithm is the            Ranking Based on Classified Tree In Search
heart of the search engine system. The choice of the         Engine” Proc International Conference on
algorithm has a significant impact on the work and           Computer Application and System Modeling
effectiveness of focused crawler and search engine.          (ICCASM 2010)
In conclusion the focused crawler compared to
different crawlers is intended for advanced web
users focuses on specific topic and it does not waste
the resources on irrelevant material.

7. References
[1] Mahmud, Hasan & Soulemane, Moumie &
     Rafiuzzaman, Mohammad. (2011). A
     framework for dynamic indexing from hidden
     web. International Journal of Computer
     Science Issues. 8.
[2] Su Guiyang, Li Jianhua, Ma Yinghua, Li
     Shenghong,       Song Juping Department of
     Electronic Engineering, Slanghai Jiaotong
     University, Shanghai 200030, P. R. China
     (Received April 10, 2004) New Focused
     Crawling Algorithm
[3] Christopher Olston, Marc Najork, Web
     Crawling
[4] Gautam Pant, Padmini Srinivasan, Filippo
     Menczer, Department of Management
     Sciences School of Library and Information
     Science, The University of Iowa,2004
     Crawling the Web (4-6), Web Dynamics
[5] Dhiraj Khurana, Satish Kumar, “Web
     Crawler: A Review”, IJCSMS International
     Journal of Computer Science & Management
     Studies, Vol. 12, Issue 01, January 2012.
[6] Trupti V. Udapure, Ravindra D. Kale, Rajesh
     C. Dharmik, “Study of Web Crawler and its
     Different Types”, IOSR Journal of Computer
     Engineering (IOSR-JCE), Volume 16, Issue 1,
     Ver. VI (Feb. 2014), PP 01-05.
[7] Yugandhara Patil, Sonal Patil, Janar 2016
     Review of Web Crawlers with Specification
     and Working Vol. 5, Issue 1, January 2016
[8] Junghoo Cho and Hector Garcia-Molina
     ―Effective Page Refresh Policies for Web
     Crawlersǁ ACM Transactions on Database
     Systems, 2003.
[9] Andas Amrin*, C. X. (2015). Focused Web
     Crawling Algorithms. Shanghai, China.
[10] Michael Hersovici, Michal Jaoov, Maarek
     Yoelle S, et al.The Shark-Search algorithm.
     An applicatication: Taibred Web Site
     Mapping, Computer Networks and ISDN
     Systems 30, 1998. 317-326