=Paper=
{{Paper
|id=Vol-2786/Paper44
|storemode=property
|title=Analysis of Semantic and Non-Semantic crawlers
|pdfUrl=https://ceur-ws.org/Vol-2786/Paper44.pdf
|volume=Vol-2786
|authors=Shridevi s,Shashwat Sanket,Jayraj Thakor,Dhivya M
|dblpUrl=https://dblp.org/rec/conf/isic2/sSTM21
}}
==Analysis of Semantic and Non-Semantic crawlers==
360 Analysis of Semantic and Non-Semantic crawlers Shridevi s, Shashwat Sanket, Jayraj Thakor, Dhivya M Vellore Institute of Technology, Chennai, India Abstract A focused crawler goes through the world wide web and selects out those pages that are apropos to a predefined topic and neglects those pages that are not matter of interest. It collects the domain specific documents and is considered as one of the most important ways to gather information. However, centralized crawlers are not adequate to spider meaningful and relevant portions of the Web. A crawler which is scalable and which is good at load balancing can improve the overall performance. Therefore, with the size of web pages increasing over internet day by day, in order to download the pages efficiently in terms of time and increase the coverage of crawlers distributed web crawling is of prime importance. This paper describes about different semantic and non-semantic web crawler architectures: broadly classifying them into Non- semantic (Serial, Parallel and Distributed) and Semantic (Distributed and focused). An implementation of all the aforementioned types is done using the various libraries provided by Python 3, and a comparative analysis is done among them. The purpose of this paper is to outline how different processes can be run parallelly and on a distributed system and how all these interact with each other using shared variables and message passing algorithms. Keywords Semantic Crawler, Serial, Parallel, Distributed, message passing, shared variables 1. Introduction A similar use is web archiving where the web pages are collected and preserved or A web crawler, also known as a spiderbot is stored for future use. Along with the above a system made up of a program or an mentioned uses web crawlers are also used automated script that downloads web pages to create a replica of visited pages which are on a large scale. Web crawlers are used in processed by search engine for faster search various applications and in diverse optimization and web data mining to analyse domains. In fact, web crawling is one of the statistically. Also, web crawlers are used to impact factors for the growth of internet in collect specific information like harvesting domains like marketing and E-commerce. or collecting spam email addresses or In E- commerce, crawlers can be used for application testing. Due to rapid increase of price comparison and to monitor the recent web pages and most of the data on web are market trends. Similarly, it can be used to unstructured, the semantic crawlers are used predict stock market movements by for retrieval of context relevant web pages. analysing social media content blogs and Semantic crawlers have different other data from different websites. Web architectures like distributed, parallel, crawlers are primary component of web focused and increment crawler. search engines whose purpose is to collect Today, web crawlers form an important web pages in bulk, index them and execute part of various software services to evolve the user-defined query to find the web into large scale integrated distributed pages. software proving that they are not just a program preserving a list of pages to be ISIC’21: International Semantic Intelligence crawled. The web crawler is the principal Conference, February 25-27, 2021, Delhi, India and time demanding element of web search EMAIL: shridevi.s@vit.ac.in (S. Shridevi); dhivya.m2019@vitstudent.ac.in (M. Dhivya). engine. It consumes huge amount of CPU ORCID: 0000-0002-6927-1998 (M. Dhivya ©️ 2021 Copyright for this paper by its authors. Use permitted time, memory and storage space to crawl under Creative Commons License Attribution 4.0 International (CC BY 4.0). through ever increasing and dynamic web. CEUR Workshop Proceedings The time it consumes to crawl through web (CEUR-WS.org) should be as small as possible to maintain its 361 recent updates of the search outputs. They have described a system to extract deep- Parallel and distributed processing is one web content which includes pre-computing submissions for each HTML form and adding way to increase the speed of crawling the resulting HTML pages into a search engine process due to technological advancement index. The entire system is based on achieving and improvement in hardware three main goals. The first goal is to develop an architectures.The work consists of approach that is time saving, automatic and implementation and comparison between scalable to index the hidden web content from different web crawler architecture namely HTML forms that are varied in domains and are Serial, Parallel and Distributed. The in languages from all over the world. The purpose of this work is to outline how we second aim is to develop two types of can increase the processing capabilities of algorithm, one that can identify the inputs that web crawlers and get the query output in accepts only specific value types and other to lesser amount of time. This paper covers accept a keyword to select input values for text detailed information about how different search inputs. The third aim is to develop an processes can be executed on parallel and algorithm that goes through the possible input on a distributed system and how all these combinations to identify and generate URLs interact with each other using shared suitable for web search index. variables and message passing algorithms. Anirban Kundu, Ruma Dutta, Rana Dattagupta, and Debajyoti Mukhopadhyay in 2. Existing Work and Literature Survey their paper “Mining the web with hierarchical crawlers – a resource sharing based crawling In this section, the recent works related to approach” [4] have proposed an extended crawler processing is described. In “Speeding web crawling method to crawl over the up the web crawling process on a multi-core internet on behalf of search engine. The processor using virtualization” [1] by Hussein approach is combination of parallelism and Al- Bahadili, Hamzah Qtishat, and Reyadh S. focused crawling using multiple crawlers. Naoum, they have presented and analysed The algorithm divides the entire structure of their new approach to increase the crawler the website into many levels based on efficiency in terms of time through hyperlink structure to download web pages virtualization using multi- core processor. In from the website and the number of crawlers their work they have divided the multi-core is dynamic at each level. The number of processor into many VMs (Virtual Machines), crawlers required is determined based on the so that the task can be executed concurrently demand at run time by and by developing a on different data. In addition to this they have thread-based program using the number of also described their implementation and hyperlinks from the specificpage. analysis of VM-based distributed web crawler M. Sunil Kumar and P. Neelima in their after rigorous testing. work “Design and Implementation of J. Cho, Hector G., L. Page [2] in their work Scalable, Fully Distributed Web Crawler for have described in what sequence or in what a Web Search Engine” [5] have presented order the URLs must be visited by the crawler Dcrawler which is highly scalable and to obtain the important pages first. This distributed. The core features of the presented method of obtaining pages of prime crawler are decentralization of tasks, an importance rapidly, helps to save time when a assignment function that partitions the crawler is unable to go through the increasing domain for the crawler to crawl effectively, and dynamically changing web. In this work cooperative ability in order to work with other they created a dataset by downloading an web servers and platform independence. For image of Stanford Webpages and performed assignment function Identifier-Seeded experiment by modifying and using different Consistent Hashing have been used. On large-scale and small-scale crawlers like performing tests using distributed crawlers PageRank Crawler, Breadth-first and Depth- they concluded that the Dcrawler performs first search crawler and Backlink-based better than other traditional centralized crawlers. crawlers and also performance can be improved with addition of more crawlers. “Google‟s Deep-Web Crawl” by J. T. Patidar and A. Ambasth in their paper Madhavan, “Improvised Architecture for Distributed D. Ko et al [3] is another notable work Web Crawling” [6] have proposed reliable describing how to crawl the contents of deep- and efficient methods for a web crawler that web which is used in Google search engine. is scalable. In addition to this they have 362 discussed challenges and issues regarding information gained from previous crawl in web structure, job scheduling, spider traps orderto crawl more efficiently the next time. and URL canonicalization. The components of their proposed work include Child Manager, Cluster Manager, Bot Manager and incremental batch analyser for re- crawling. Their results show that they have successfully implemented distributed crawler along with politeness techniques and selection policies but still they face challenges like resource utilization. The work “A Hierarchical Approach to Model Web Query Interfaces for Web Source Integration” [7] by E. Dragut et al. describes an algorithm which extracts and maps query interfaces into a hierarchical representation. The algorithm is divided into 4 steps namely Token Extraction, Tree of Fields, Tree of Text Tokens and Integration and therefore they convert extraction algorithm into integration Figure 1 Flow of Serial Web algorithm. They carried out experiments on Crawler three different datasets (ICQ, Tel8 and WISE) and evaluated the algorithm based on 3. Architectures and Implementation performance metrics like leaf labelling, Schema Tree Structure and Gold Standard. 3.1 Serial Web Crawler D. H. P. Chau, S. Pandit, S. Wang, and C. Faloutsos have described parallel crawling by The crawler maintains a list of unvisited illustrating it on an online auction website in URLscalled the frontier which acts a queue. their work “Parallel Crawling for Online The list is initialized with seed URLs. Each Social Networks” [8]. They have presented crawling loop involves picking the next URL this work for online social networks. They have dynamic assignment architecture which from the frontier to crawl, checking if the ensures that failing of one crawler does not URL is previously visited or not, if not affect another crawler and that there is no visited then fetch the page corresponding to redundant crawling. They visited about 11 the URL through HTTP, followed by parsing million users out of which approximately the retrieved page to extract the URLs and 66,000 were completely crawled. J. Cho and application specific information, and finally H. Garcia-Molina [9] proposed different adding the unvisited URLs to the frontier. architectures for parallel web crawlers, metrics The crawling process may be terminated to evaluate the performance of parallel web when a certain number of pages have been crawlers and the issues related to parallel crawled. If the crawler is ready to crawl crawling. They described issues like Overlap, another page and the frontier is empty, the quality and communication bandwidth and situation signals a dead-end for the crawler. advantages of parallel crawlinglike scalability, The crawler has no new page to fetch and Network-load dispersion and Network- load hence it stops. Figure 1 shows the flow of a reduction. basic serial web crawler. The complete C. C. Aggarwal, F. Al-Garawi, and P. S. implementation of the above model can be Yu in their work “Intelligent crawling on the found in Implementation 1. world wide web with arbitrary predicates” [10] have described intelligent crawling as a Algorithm: method that learns properties and features of the linkage structure of WWW while crawling. The technique proposed by them is more 1. Initialise a constructor crawler with generalized than focused crawling which is variable pageTable and revPageTable based on pre-defined structure of web. The assigns to HashMap intelligent crawling described by them is 2. define a method get_seed() to get the applicable to web pages that support arbitrary the seed_url user-defined topical and keyword queries. The technique described is capable of reusing the 363 processes. These multiple processes run in parallel thus named parallel web crawler. Figure 2 shows the flow chart of the working of the parallel web crawler. Here we created a process pool that is managed by the process manager, which is also responsible for spawning and scheduling new processes. And a shared memory that is used as Frontier. Note our crawler is a simple parallel web crawler. Although there are many different ways of URL partitioning as mentioned in [11]. But the main aim here is to create a Baselinemodel. Algorithm: Crawler 1. Initialise a class crawler with a constructor. 2. Define method testseedurl with seed Figure 2 Flow of parallel webcrawler url as parameter 3. if hostname is same as that of seed url then return false 3. Parse the seed_url and save in 4. otherwise return true hostname 5. Define method getallurl 4. Initialize the frontier. with url as parameter 5. Define a method test seed url 6. fetch and parse the html page 6. if the url has same hostname as that of 7. for each seed url then return false otherwise parsedPage.findAll(„a‟,href=TRUE) return true. 8. if (!safeURL(link)) then throw URL 7. Define a method to get all urls invalid 8. try fetching the url and throw 9. otherwise append url exception urls.append(url+link) 9. parse the page using HTML parser 10. Initialise the daemon server. 10. for each link in parsed page find all „a‟ Algorithm: Frontier 11. if parsed link is not safeURL 1. Initialise a class frontier_manager 12. throw invalid URL witha constructor. 13. otherwise append url and link 2. Initialise the process pool and seed_url. 14. //end if 3. Initialise the frontier with the 15. //end for each global list and assign token to 16. Define a mothod crawl each process bylock acquire and 17. for each current url in frontier release method 18. fetch all the url and update the frontier 4. for each url in urls length. 5. if url is not in frontier push url into 19. for each url in urls frontier and self-release the lock 20. if current url is in page table then push 6. //end if url 7. Define a method to write to the url to index table 21. if url is not in frontier and 8. The index acquires the lock and test_seed_url(url) then push url to check if seed url is not equal to frontier. key in index table then add url to 22. //end if table. 1. //end if 9. otherwise 2. //end for each indextable[local_seed_url].exte 3. //end for each nd(url) 10. Define a method Make_write and writethe urls to frontier and index table 3.2 Parallel Web Crawler 11. define a static method crawl and Parallel crawlers can be understood as several initialise a crawler modified serial crawlers running as separate 364 12. return the seed url the form of checkpoints in the global store. On failure of a nameserver, one ofthe crawlers will 13. Define a method start and be elected as the nameserver, and the new create apool process nameserver will fetch the latest checkpoint and 14. close the pool process continue the task. Note here the crawlers are also receiving the heartbeat signal of a The above Baseline code uses nameserver, in order to identify when the multiprocessing for creating and managing nameserver is down. multiple processes. Here lock-based system is used to access the shared memory space. Before going the flow, here we used two The crawler code is very similar to that of frontiers as local frontier and global frontier. serial web crawler the modification is done local frontier is the frontier of the worker for the Frontier. The Frontier spawns two instance where as a global frontier is part of the crawler workers to fetch the pages. global store .The client will trigger the 3.3 Distributed Web Crawler nameserver by providing the seed URLs to crawl, the nameserver will initialize the global Distributed web crawlers [14,16,17] are a frontier with the seed URLs, and will technique in which many computers dynamically assign the URLs to the respective participate by providing their computing the crawlers local frontier. The crawlers bandwidth in the crawling process. The individually be acting as serial web-crawler proposed architecture acts as a baseline for with its own DNS resolver, Frontier queue, and this technique. In this, there is a central pagetable. For filtering the URLs they will also server as nameserver and four other servers communicate with the global store to check if as workers as crawlers. Here we used the visited or not. Upon completing the crawling dynamic assignment as a policy where the process, thecrawler will dump the pagetable in nameserver dynamically assigns the URLs the common storage and will ask the and balances the load. Note the nameserver nameserver to reallocate the new seed URLs. here is not responsible for crawling in order This process continues till termination to reduce theworkload. triggered by the nameserver. Algorithm: Crawler 1. Initialise a class crawler with a constructor. 2. Define method testseedurl with seed url as parameter 3. if hostname is same as that of seed url then return false 4. otherwise return true 5. Define method getallurl with url as parameter 6. fetch and parse the html page 7. for each parsedPage.findAll(„a‟,href=TRUE) 8. if (!safeURL(link)) then throw URL invalid 9. otherwise append url urls.append(url+link) 10. Initialise the daemon server. Figure 3 flow of distributed web Algorithm: crawler using client- server 1. Initialize class architecture. Frontier_Manager() and variable Apart from the dynamic assignment job of the seed url. nameserver, it is also responsible for 2. create a index table using monitoring the hearth beat and other meta HashMap function information of these worker crawlers. Since 3. Define method testseedurl with seed the nameserver can be a single point of failure url as parameter (SPOF) during the task. To avoid this, the 4. if hostname is same as that of seed nameserver saves all the meta- information in url then return false 365 5. otherwise return true 6. Define method Algorithm: Crawler controller write_to_frontier with 1. Initialise string with seed URL seed_url and host urls a 2. check the whether the string is parameter. present in Database 7. for each url in urls 3. check if (seed 8. if(!(url in frontier and URL exist) test_seed_url(url))) then print already self.frontier.append(url.strip()) exist 9. Define method write_to_index_table. 4. otherwise insert the URL details 10. if local seed url is not present 5. assign variable for statement and in index table then add the add the seed details to the local seed url database 11. otherwise 6. if (statement! = empty) index_Tble(local_seed_url). {execute statement} extend(urls) 7. otherwise print statement not 12. write to frontier and executed. index_table the local_seed_url and urls. Algorithm: Model Extraction 13. fetch the method crawl by 1. Initialise variable id, url, html, using index and crawler langType variable 2. create an object and read the url. 14. end if 3. Repeat till the statement is present 15. end each for. 4. define variables and get the 3.4. Semantic Distributed web crawler subject, predicate, object and URI. Distributed Semantic web crawlers are 5. if object is URIResource then used for crawling both semantic web pages in get URI and asiign to ob. RDF/OWL format and HTML pages. The 6. //end if distributed semantic crawler uses a component 7. if langtype is HTML then called page analyser for understanding the 8. if subject does not contain # page context. The ontology analyser creates and is not null then save subject models for fetched OWL/RDF pages and these to database, models are stored. Later these models are 9. if predicate does not contain # matched with the stored Ontology to make and is not null then save object to crawling decision database. 3.5 Focused web crawlers Focused web crawlers [18] are to use to collect web pages on a specific topic. These crawlers search the entire web on a predefined topic which in turn avoids irrelevant information to the user. Focused crawler saves the computational resource. Semantic focused crawler has multi thread and each thread takes a web page with highest dynamic semantic relevance from priority queue. The main work of the thread is to parse the various hyperlink and add them to the priority queue. Thus, the priority queue has the details of the web page that has to parsed by the thread Semantic Figure 4: Architecture of Distributed focused crawler has another temporary Semantic web crawler queue which maintains the visited web 366 pages. The thread also checks this about 0.67%, semantic distributed crawler temporary queue for visited web pages. gives an accuracy of 80% and focused crawler gives an accuracy of about 86.66%. Algorithm: semantic focused crawler Q: Priority Queue DSR: Dynamic Semantic relevance Link: Queue of traversed URL 1. Initialize priority queue Q with seed URLs 2. Repeat till (!Q.empty() ||fetch cnt 6 Limit) { 3. web page.url = Q.top.getUrl (); //Get most relevant single URL from priority queue 4. Fetch and parse web page.url; Table 1 Number of times a specified 5. web page.urls = extract URLs crawler outperforms other crawlers (hyperlinks) from web page.url; //List of URLs 6. For each web page.urls { 5 7. already exist = Check web page.urls[i] in Links; 0 //Check for duplicates 8. If (!already exist) { 9. Enqueue web page.urls[i] in Links; vit.ac.in chenna 10. Fetch and parse web page.urls[i]; i.vit.ac.in en.wiki 11. Compute DSR of web page.urls[i]; Figure 6 Graphical representation of 12. Enqueue (web page.urls[i], DSR number of times a specified crawler ) in Q; outperforms 13. Store (web page.urls[i], DSR ) in local database; Fig 6 shows the graphical 14. } //end of If presentation of the number of times a 15. } //end of For each specified crawler outperforms. 16. } // Here using Pyro4 python library to 5. Conclusion stimulate the described architecture. Beautiful Soup to parse the HTML pages. It can be concluded that for majority of time, a focused crawler and semantic 4. Results distributed crawler gives the best result for crawling a specific website. From the result it is also clear that focused crawler works well Fig 6 shows the testing of all the semantic as the number of crawling increases. and non- semantic crawlers for a given website. From the table in figure 6 the total References number of test cases are 30 out of which the non-semantic crawlers (distributed crawler [1] Al-Bahadili, H. & Qtishat, Hamzah & outperforms in 20 cases; parallel crawler Naoum, Reyadh. (2013). Speeding Up the outperforms in 8 and serial crawler Web Crawling Process on a Multi-Core outperforms in only 2 cases) and semantic Processor Using Virtualization. crawlers (semantic distributed crawler International Journal on Web Service outperforms in 24 cases and focused crawler in 26 cases). Therefore, distributed crawler Computing. 4. 19-37. achieves an accuracy of about 66.67%, 10.5121/ijwsc.2013.4102. parallel crawler achieves an accuracy of [2] Junghoo Cho, Hector Garcia-Molina, and 26.67 %, serial crawler gives an accuracy of 367 Lawrence Page. 1998. Efficient crawling Arbitrary Predicates. 96-105. through URL ordering. In Proceedings of 10.1145/371920.371955. the seventh international conference on [11]Parallel Crawlers Junghoo Cho, Hector World Wide Web 7 (WWW7). Elsevier Garcia-Molina University of California, Science Publishers B. V., NLD, 161–172. Los Angel [3] Jayant Madhavan, David Ko, Łucja Kot, [12]Naresh Kumar, Manjeet Singh (2015). Vignesh Ganapathy, Alex Rasmussen, Framework for Distributed Semantic and Alon Halevy. 2008. Google‟s Deep Web Crawler. IEEE - International Web crawl. Proc. VLDB Endow. 1, 2 Conference on Computational (August 2008), 1241–1252. Intelligence and Communication [4] Kundu, Anirban & Dutta, Ruma & Networks. Dattagupta, Rana & Mukhopadhyay, [13]K.Lokeshwaran, A.Rajesh (2018). A Debajyoti. (2009). Mining the web with Study of Various Semantic Web hierarchical crawlers - A resource sharing Crawlers and Semantic Web Mining. based crawling approach. IJIIDS. 3. 90- International Journal of Pure and Applied 106. 10.1504/IJIIDS.2009.023040. Mathematics Volume 120 No. 5 2018, [5] Kumar, M. & P, Neelima. (2011). "Design 1163-1173. and Implementation of Scalable, Fully [14] F. Liu and W. Xin, 2020 Distributed Web Crawler for a Web "Implementation of Distributed Crawler Search Engine". International Journal of System Based on Spark for Massive Data Computer Applications. 15. Mining," 2020 5th International 10.5120/1963-2629. Conference on Computer and [6] Patidar, T., & Ambasth, A. (2016). Communication Systems (ICCCS), Improvised Architecture for Distributed Shanghai, China, 2020, pp. 482-485, doi: Web Crawling. International Journal of 10.1109/ICCCS49078.2020.9118442. Computer Applications, 151, 14-20. [15]Rajiv, S and Navaneethan, C, 2020, [7] Kabisch, Thomas & Dragut, Eduard & "Keyword Weight Optimization using Yu, Clement & Leser, Ulf. (2009). A Gradient Strategies in Event Focused Web Crawling" Pattern Recognition Hierarchical Approach to Model Web Letters 01678655 CrossRef. Query Interfaces for Web Source Integration.PVLDB.325-336. [16]S. K. Bal and G. Geetha, "Smart 10.14778/1687627.1687665. distributed web crawler," 2016 International Conference on Information [8] Duen Horng Chau, Shashank Pandit, Samuel Wang, and Christos Faloutsos. Communication and Embedded Systems 2007. Parallel crawling for online social (ICICES), Chennai, 2016, pp. 1-5, doi: networks. In Proceedings of the 16th 10.1109/ICICES.2016.7518893. international conference on World Wide [17]Wang, HongRu, et al.2018 "Anti- Web (WWW ‟07). Association for Crawler strategy and distributed Computing Machinery, New York, NY, crawler based on Hadoop." 2018 IEEE USA, 1283–1284. 3rd International Conference on Big Data Analysis (ICBDA). IEEE. [9] Cho, Junghoo, and Hector Garcia- [18] Boukadi, K., Rekik, M., Rekik, M., & Molina. "Parallel crawlers." Ben-Abdallah, H. (2018). FC4CD: a Proceedings of the 11th international new SOA-based Focused Crawler for conference on World Wide Web. 2002. Cloud service [10]Aggarwal, Charu & Al-Garawi, Fatima Discovery. Computing, 100(10), 1081- & Yu, Philip. (2001). Intelligent 1107. Crawling on the World Wide Web with