A review of web crawling approaches Elda Xhumaria, Izaura Xhumarib a Universtity of Tirana, Department of Informatics, Boulevard “Zogu I”, Tirana, 1001, Albania b Universtity of Tirana, Department of Informatics, Boulevard “Zogu I”, Tirana, 1001, Albania Abstract with the rise of the Web, these general-purpose Websites are getting richer and richer search engines and crawlers have encountered with information in different formats. some limitations as follows: 1 The data that such sites possess today - It is impossible for them to index and analyze goes through millions of terabytes of all pages and keep these search indexes up to data, but not every information that is on the net is useful. To enable the most date. efficient internet browsing for the user, - They may return hundreds or more links to a one methodology is to use web crawler. user's query, due to misunderstanding of the This study presents web crawler query pages run by these links may not be methodology, the first steps of closely related to the user query. development, how it works, the different - They may not meet query requirements with types of web crawlers, the benefits of different backgrounds, purposes and periods. using and comparing their operating - Dynamic content, such as news and financial methods which are the advantages and disadvantages of each algorithm used by data, on the Web is growing and changing them. frequently. Many search engines can take up to a month to refresh their indexes across the Keywords Web. Therefore, query results are probably not Web crawler, Algorithms, Types of web valid at the time the request is made. crawlers Therefore, it is necessary for a technology which enables fast crawling search to assemble web pages 1. Introduction with the highest possible content and quality and keeping these pages up to date. This "problem" can The world wide web is a large collection of data. be solved using indexes which are built by a web Data that continue to grow day by day. Nowadays crawler. A web crawler, otherwise known as a "web it has become an important part of human life to use spider", is a program that browses the World Wide the internet to gain access to information on the Web by "clicking" on any link they find and World Wide Web. Due to bandwidth, storage collects information found automatically. capacity, limited computer resources and the rapid However, building a large index for web pages is growth of the World Wide Web, unforeseen scaling not the only web crawler application [1]. challenges have arisen for search engines. The two most important features of the web such as the large 2. How it works? volume of data and the speed of their change pose a difficulty for web crawling, as there are a large The crawler maintains a list of unvisited URLs number of pages which are added, changed and called frontier. The list is first initialized with URLs deleted every day. Although search engine provided by a user or other program. Each crawl technology has dramatically scaled up to keep pace cycle involves selecting a URL from the list and Proccedings of RTA-CSIT 2021, May 2021, Tirana, Albania EMAIL: elda.xhumari@fshn.edu.al (E. Xhumari); izaura.xhumari@fshn.edu.al (I. Xhumari) ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) retrieving the corresponding page for that URL via never ends. To have an updated content of HTTP, analyzing it to extract URLs and specific downloaded web pages, an incremental web information, and finally adding these unvisited crawler links the review of previously downloaded URLs to the frontier list. Before being added to the pages to the first visit to new pages [2]. The goal is list these URLs may be marked a point depending to achieve updating and coverage at the same time. on the benefit achieved if the page with the The advantage of an incremental web crawler is corresponding URL is visited. The crawl process that only valuable data is provided to the user, thus may end when a certain number of crawled pages the network bandwidth is stored and data are accessed. If the crawler is ready to visit another enrichment is achieved. page and the frontier list is empty then the situation signals a dead end for the crawler and since the B. Form Focused Crawler crawler no longer has new pages to visit it stops. Form Focused Crawler deals with the rare distribution of forms on the Web. Form Crawler avoids crawling through unproductive links by restricting search to a specific topic, learning the characteristics of links and pages that lead to pages containing searchable forms, and using appropriate stopping criteria. Web crawler uses two rankings: site and links to guide its search. Later, a third classifier: the shape classifier is used to filter out useless forms. C. Focused Crawler A Focused Crawler collects documents which are specific and related to the given topic. Sometimes this crawler is also known as Topic Crawler to approach how it works. Focused Crawler is a web crawler that tends to transfer pages that are related to each other. Determines if the given page has similarities to the specific topic. One of the advantages of the focused crawler is the economic flexibility in hardware and network resources. It reduces the amount of network traffic, Figure 1: The Data Flow of a Crawler logging and downloads [3]. Focused Crawler searches, acquires, indexes, and maintains pages for specific groups of topics that represent a 3. Types of web crawler relatively narrow segment of the network. This Different types of web crawlers are available crawler is run by a classifier that learns to recognize depending on how the web pages are crawled and the importance of taxonomy embedded examples, how the future web pages are retrieved and and a distiller that identifies current online priority accessed. Some of which are as follows. points. A. Incremental Crawler An incremental web crawler is one of the traditional crawlers, which constantly updates an existing set of downloaded pages instead of restarting the crawling process from scratch each time. This includes some way to determine if a page has changed since it was last downloaded. Pages can appear multiple times in the crawler order, and crawling is an ongoing process that conceptually geographically different locations. Parallelization of the crawling system is extremely important from the purpose of read of downloading documents in an affordable quantity of time. E. Distributed Crawler Distributed Web Crawler is a distributed computing technique. Many crawlers are operating for distribution within the web crawling method to master as much web coverage as possible [7]. A central server manages the communication and synchronization of nodes, as it is geographically Figure 2: General architecture of Focused Web distributed. Mainly uses PageRank algorithm to Crawler increase its efficiency and quality search. The Focused Crawler is a new approach to advantage of distributed web crawler is that it is not increasing accuracy and expert internet search. An affected by system crashes or various events and ideal focused crawler could only download those can be adapted by many crawling applications. pages that are related to the topic while ignoring To design an efficient web crawler, it is required to other pages and would anticipate the possibility of create the distribution task between multiple a link to a specific topic related to the topic before machines in a synchronous process. Large websites downloading it. Focused Crawler has three main should be distributed individually on the network components: a classifier that makes important and they should provide the right chance and judgments on crawled pages to decide on the rationality for synchronous access. Meanwhile extension of downloaded links, a distiller that sets synchronous distribution saves network bandwidth a crawl center measure to determine visit resources [4]. preferences, and a crawler which has dynamically reconfigurable priority controls dominated by the 4. Web crawling algorithms classifier and distiller. Focused crawler aims to provide a simpler i. Breadth First Search alternative to overcoming the issue that instant pages which are low ranking related to the topic in It starts with a small set of pages and then question. The idea is to recursively execute an explores other pages following the links in the first exhaustive search to a certain depth, starting with width. Indeed, websites are not strictly traversed at the relatives of a highly ranked page [4]. first glance, but can use a variety of policies. For example, it may crawl the most important pages D. Parallel Crawler first. This method is used by many search engines. This crawler balances the load between servers. As the size of the internet grows, it becomes Breadth first algorithm work on a level by level, i.e. difficult to retrieve the entire or a major portion of algorithm starts at the root URL and searches the the web employing a single method. Therefore, all the neighbors URL at the same level. If the several search engines typically run multiple desired URL is found, then the search terminates. processes in parallel to perform the above task, so If it is not, then search proceeds down to the next download rate is maximized. This kind of crawler level and repeat the processes until the goal is is known as a parallel crawler [5]. We can also say reached. When all the URLs are scanned, but the that when multiple crawlers are usually run in objective is not found, then the failure reported is parallel, it's referred as Parallel crawlers. A parallel generated. Breadth first Search algorithm is crawler consists of multiple crawling processes generally used where the objective lies in the referred to as C-procs which can run on network of depthless parts in a deeper tree [8]. workstations [6]. The Parallel crawlers rely on Page freshness and Page selection. A Parallel crawler may be on local network or be distributed at ii. Depth First Search iv. Fish Search Algorithm This is an algorithm for traversing or searching The main principle of the algorithm is: it takes tree or graph data structures. It is a technique of as input an initial URL and a search query, and systematically examining the search starting from dynamically builds a list of priorities (initialized the root node and penetrating deeper through the with the initial URL) of the next URLs (referred to child node. If there is more than one child, then as nodes) to be explored. In each step the first node priority is given to the child on the left and is removed from the list and processed. As the text penetrates deeply until there are no more children of each document becomes available, it is analyzed available. Returns to the other unexplored node and by a scoring component assessing whether it is then proceeds in a similar manner. This algorithm relevant or irrelevant to the search query (value 1- ensures that all edges are visited at once. It is 0) and, based on that result, a heuristic decides to suitable for search problems, but when the branches pursue the search in that direction or not: Whenever are large, then this algorithm can end up in an a document source is retrieved, it is scanned for endless loop. links. Nodes run by these links are assigned a depth value. If the parent is important, the depth of the iii. Best First Search children is set to a predetermined value. Otherwise, Best first algorithms are often used to find the depth of the children is set to be one less than search paths. Best First Search is a search algorithm the depth of the parent. When the depth reaches that roams a graph starting from the most promising zero, the direction is interrupted and none of his node selected according to a specified rule. The children are included in the list [9]. basic idea is that having a URL limit, the best URL v. Shark-Search Algorithm according to some evaluation criteria such as Fish-Search algorithm's main flaw is that the accuracy, recall, accuracy, and points (F-Score). In interrelated computation is too simple, it only has 0 this algorithm, the URL selection process is driven and 1, interrelated and irrelevant respectively. by lexical similarity between the topic keywords Secondly, every and the URL source page. Thus, the similarity node's potential score has a low precision which between the page and the topic keywords is used to only has three situations (0,0.5, and l). Aimed at evaluate the fit with all the outbound links of the these disadvantages, Michael Hersovici [10] page. brought forward an improved Shark-Search algorithm which mainly ameliorates page, interrelated query computation and potential score' s computing method. The following process is its detail: - Import vector space model to compute the page and user query's relativity. - Consider the information given by anchor text near the hyperlink and compute the relativity between it and user's query. - Calculate both of the above two factors with child node's potential score computing formula. Through these betterments, Shark-Search algorithm's efficiency is much better than Fish- Search's [2]. vi. Page Rank Algorithm In Page rank algorithm web crawler decides on the importance of web pages in each web site Figure 3: The best-first algorithm pseudo-code through the total number of links or citations per page. Page rank is calculated according to the Relatedness between web pages by the Page Rank websites until recently would undergo the same algorithm. Website ranking calculation: crawler process. The crawling process is no longer 𝑃𝑅 (𝐴) = (1 − 𝑑) + 𝑑 ( 𝑃𝑅(𝑇1 ) + ⋯+ 𝑃𝑅(𝑇𝑛 ) ), (1) as simple as it was a few years ago, due to the 𝐶(𝑇1 ) 𝐶(𝑇𝑛 ) increasing use of JavaScript frameworks such as where PR(A) - Page Rank of a given Page,D – Angular, React, Meteor. Many of the websites are Dumping factor JavaScript heavy and generates content by doing 𝑇1 – links. asynchronous JavaScript calls after page is loaded. To find the Page Rank for a page, called PR (A), The use of these frameworks makes developer life you must first find all the pages that link to page A simpler and provides many benefits for creating and Out Link from A. If we find a page 𝑇1 that links dynamic sites. To crawl this type of web sites Web with A then page C (𝑇1 ) will give the number of Crawlers, use Selenium. outbound links on page A. The same procedure is Selenium is a Web Browser Automation Tool done for pages 𝑇2 , 𝑇3 and all other pages that can originally designed to automate web applications be linked to the main page A - and the sum of their for testing purposes. It is now used for many other values will provide the Page Rank of the website applications such as automating web-based admin [11]. tasks, interact with platforms which do not provide Table 1 API, as well as for Web Crawling. Advantages and limitations of web crawling Building a focused web crawler using selenium algorithm tool is good way to collect useful information. Focused Crawler is an approach to increase Algorithm Advantages Limitations accuracy and expert internet search. An ideal Breadth Suitable for If a solution is far First situations where the away then it focused crawler could only download those related Search solution is located at consumes time. pages by ignoring other pages and would anticipate the beginning in a Consumes a large the possibility of a link to a specific topic related deep tree. amount of memory. site before downloading it. One use case of a focused web crawler is Depth Suitable for in-depth If the edges are large extracting financial data. Financial market is a First search problems. then this algorithm Search Consumes very little can end in an endless place of risks and instability. It’s hard to predict memory. cycle. how the curve will go and sometimes, for investors, Fish The algorithm is The usage of one decision could be a make-or-break move. Search helpful in forming resources of network That’s why experienced practitioners never lose the priority table. is high. Fish search track of the financial data. Financial data, when crawlers significantly load not extracted and analyzed in real time, can provide only network, also wealthy information for investments and trading. web servers. And people in different positions scrape financial Shark The algorithm The usage of Search mainly ameliorates resources of network data for varied purposes. page, interrelated is high. query computations and potential score's 6. Conclusions computing method. Web Crawler is the essential source of Page Rank In a short time, the Favors older pages, information retrieval which roams the Web and most important because a new page, pages are returned as even a very good downloads web documents that suit the user's need. Rank is calculated one, will not have Web crawler is used by search engines and other on the basis of the many links unless it users to regularly ensure that their database is up to popularity of a page. is part of an existing web site. date. In this article has been presented a review of different types crawling technologies and 5. Approach algorithms, why “focused crawling” technology is being used. The crawling algorithm is the most A traditional crawler worked simply by important part of any search engine. Focused extracting static data from HTML code and most Crawlers uses more complex systems and techniques to define the information of high [11] TIAN Chong “A Kind of Algorithm For Page relevance and quality. Searching algorithm is the Ranking Based on Classified Tree In Search heart of the search engine system. The choice of the Engine” Proc International Conference on algorithm has a significant impact on the work and Computer Application and System Modeling effectiveness of focused crawler and search engine. (ICCASM 2010) In conclusion the focused crawler compared to different crawlers is intended for advanced web users focuses on specific topic and it does not waste the resources on irrelevant material. 7. References [1] Mahmud, Hasan & Soulemane, Moumie & Rafiuzzaman, Mohammad. (2011). A framework for dynamic indexing from hidden web. International Journal of Computer Science Issues. 8. [2] Su Guiyang, Li Jianhua, Ma Yinghua, Li Shenghong, Song Juping Department of Electronic Engineering, Slanghai Jiaotong University, Shanghai 200030, P. R. China (Received April 10, 2004) New Focused Crawling Algorithm [3] Christopher Olston, Marc Najork, Web Crawling [4] Gautam Pant, Padmini Srinivasan, Filippo Menczer, Department of Management Sciences School of Library and Information Science, The University of Iowa,2004 Crawling the Web (4-6), Web Dynamics [5] Dhiraj Khurana, Satish Kumar, “Web Crawler: A Review”, IJCSMS International Journal of Computer Science & Management Studies, Vol. 12, Issue 01, January 2012. [6] Trupti V. Udapure, Ravindra D. Kale, Rajesh C. Dharmik, “Study of Web Crawler and its Different Types”, IOSR Journal of Computer Engineering (IOSR-JCE), Volume 16, Issue 1, Ver. VI (Feb. 2014), PP 01-05. [7] Yugandhara Patil, Sonal Patil, Janar 2016 Review of Web Crawlers with Specification and Working Vol. 5, Issue 1, January 2016 [8] Junghoo Cho and Hector Garcia-Molina ―Effective Page Refresh Policies for Web Crawlersǁ ACM Transactions on Database Systems, 2003. [9] Andas Amrin*, C. X. (2015). Focused Web Crawling Algorithms. Shanghai, China. [10] Michael Hersovici, Michal Jaoov, Maarek Yoelle S, et al.The Shark-Search algorithm. An applicatication: Taibred Web Site Mapping, Computer Networks and ISDN Systems 30, 1998. 317-326