Crawley: A Tool for Web Platform Discovery Daniil Dobriy1,∗ , Axel Polleres1 1 Vienna University of Economics and Business, Welthandelsplatz 1, 1020 Vienna, Austria Abstract Crawley, a Python-based command-line tool, provides an automated mechanism for web platform discovery. Incorporating capabilities such as Search Engine crawling, web platform validation and recursive hyperlink traversal, it facilitates the systematic identification and validation of a variety of web platforms. The tool’s effectiveness and versatility are demonstrated via two successful use cases: the identification of Semantic MediaWikis instances, as well as the discovery of Open Data Portals including OpenDataSoft, Socrata, and CKAN. These empirical results underscore Crawley’s capacity to support web-based research. We further outline potential enhancements of the tool, thereby positioning Crawley as a valuable tool in the field of web platform discovery. Keywords Web Crawling, Search Engine Automatisation, Web Platform Discovery, Open Data Portals, MediaWiki 1. Introduction The field of web platform discovery, which involves the systematic identification of websites, is a research priority for discovering Linked Open Data (LOD) [1] and accessing the factual extent of the Semantic Web. This subject intersects with web crawling, an automated process concerned with the traversal and extraction of web content, and Search Engine scraping. Investigations in the field [2] have presented scalable algorithms for pattern mining, sig- nificantly enhancing the efficiency of media-type focused crawling. Additionally, efforts like MultiCrawler have proposed pipeline architectures for more effective crawling and indexing of Semantic Web data [3]. Other notable tools, such as Apache Any231 , offer extraction libraries and web services that transform structured data from HTML and other web documents to more useful formats. The relevance of the application of such tools is illustrated by services like Portalwatch [4] and WikiApiary2 , which monitor the deployment and usage of specific Open Data and Wiki platforms on the web. Finally, due to the inherent cost of the platform and dataset discovery, services like LOD Laundromat [5] and LOD Cloud3 exist to provide an entry point and catalogue linked datasets. In the case of WikiApiary, the service provides a comprehensive repository, which tracks and catalogues Wikis and their respective metadata on the web. Most notably, WikiApiary ISWC 2023 Posters and Demos: 22nd International Semantic Web Conference, November 6–10, 2023, Athens, Greece ∗ Corresponding author. Envelope-Open daniil.dobriy@wu.ac.at (D. Dobriy); axel.polleres@wu.ac.at (A. Polleres) Orcid 0000-0001-5242-302X (D. Dobriy); 0000-0001-5670-1146 (A. Polleres) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop CEUR Workshop Proceedings (CEUR-WS.org) Proceedings http://ceur-ws.org ISSN 1613-0073 1 https://any23.apache.org/ 2 https://wikiapiary.com 3 http://lod-cloud.net CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings also collects Semantic Wikis: Semantic MediaWiki (SMW), Wikibase and Cargo instances, - presenting an ample and under-researched facet of LOD. Despite its extensive coverage and reliance on bots (“bees”) to keep the metadata up-to-date, the catalogue is manually curated through community submissions, which could potentially introduce gaps in data collection. Another specific case, Portalwatch - an open-source project that aims to collect and monitor Open Data Portals, including portal metadata - also shares the same limitation. This constraint underscores the need for automated discovery tools to ensure a more exhaustive enumeration and characterisation of web platforms such as Semantic Wikis or Open Data Portals. The proposed tool aims to enhance and ease web platform discovery in this area. The structure for the remainder of the paper is as follows: Section 2 introduces the architecture and features of the tool and Section 3 provides an overview of two successful use cases of Crawley. Finally, Section 4 draws conclusions and discusses potential directions for future work. 2. Architecture and Features Figure 1: Crawley Architecture Diagram Crawley is an open-source Python-engineered command-line tool designed to streamline the discovery and validation of specific technological platforms. It is currently available together with documentation on GitHub4 under a CC-BY 4.0 license5 . Figure 1 illustrates the high-level architecture of the tool. The tool extends various Search Engine APIs (SERP API, BING API) as a reliable solution to Search Engine querying. While the use of APIs subjects the tool to rate limits, the tool supports a multi-user approach.6 Thus, the search is performed with Google, Bing, Yandex, Yahoo, DuckDuckGo, Baidu and Naver. The user can initiate a search event, which is defined by a Search Engine (i.e., Google, Bing, Yandex, Yahoo, DuckDuckGo, Baidu, Naver) and the query itself.7 The tool then queries the Search Engines, performing result pagination until all the query results are exhausted and prints 4 http://purl.org/crawley 5 http://creativecommons.org/licenses/by/4.0/ 6 cf. documentation for the tool on http://purl.org/crawley/readme 7 cf. ibid the actual number of unique sites, giving the user a heuristic estimation of how prodigious a certain query-Search Engine combination is, and aggregates the search results in the ./results folder. Although the queries can be formulated freely, we recommend using a subset of markers defined in the paragraph below that have a probability of being indexed by Search Engines (i.e., text snippets and image annotations, but not code excerpts). We observe a trade-off pattern whereby more general queries lead to more results, but fewer validation hits in the end, and more specific queries to fewer results, but a larger proportion of hits, which gives merit to formulating both general and specific queries. The results/platform validation process with Crawley begins with the user identifying text/- code snippets commonly found on sites using a particular technology of interest: ”Powered by Semantic MediaWiki”, ”CKAN API”, ”Socrata API” as well as components of URL commonly used by a specific platform (e.g., .../dataset). We designate these as markers. Having identified possible markers and defined them in the configuration8 , the user can initiate a validation phase, whereby the tool requests HTML contents for the collected search results and then matches them against the markers, returning the total number of validation hits for each platform type and producing a validation report. Finally, as a full-fledged crawler, the tool is able to recursively extract further links from validated sites. This is a useful feature which relies on the fact that similar platforms often contain hyperlinks to each other. The extracted links are then treated as search results in the pipeline and can be validated further, whereby previous HTML collection and validation events as well as results are cached for efficiency. 3. Use Cases This section presents two successful use cases of Crawley, motivated by the need to discover and catalogue a broader range of Semantic Web data: Semantic Wikis and Open Data Portals. 3.1. Semantic Wiki Discovery The first use case revolves around the discovery of Semantic Wikis, specifically Semantic MediaWikis, not captured by WikiApiary. To this end, a search (without recursive link collection) and validation have been performed with Crawley using the Bing Search Engine. A set of custom markers has been identified in association with Semantic Wikis: