=Paper=
{{Paper
|id=Vol-1222/paper9
|storemode=property
|title=Focussed crawling of environmental web resources: A pilot study on the combination of multimedia evidence
|pdfUrl=https://ceur-ws.org/Vol-1222/paper9.pdf
|volume=Vol-1222
|dblpUrl=https://dblp.org/rec/conf/mir/TsikrikaMVK14
}}
==Focussed crawling of environmental web resources: A pilot study on the combination of multimedia evidence==
Focussed Crawling of Environmental Web Resources: A Pilot Study on the Combination of Multimedia Evidence Theodora Tsikrika Anastasia Moumtzidou Stefanos Vrochidis Ioannis Kompatsiaris Information Technologies Institute Centre for Research and Technology Hellas Thessaloniki, Greece {theodora.tsikrika, moumtzid, stefanos, ikom}@iti.gr ABSTRACT factors with a strong impact on the quality of life, since they This work investigates the use of focussed crawling tech- directly affect human health (e.g., allergies and asthma), a niques for the discovery of environmental multimedia Web variety of human outdoor activities (ranging from agricul- resources that provide air quality measurements and fore- ture to sports and travel planning), as well as major en- casts. Focussed crawlers automatically navigate the hyper- vironmental issues (such as the greenhouse effect). In or- linked structure of the Web and select the hyperlinks to der to support both scientists in forecasting environmen- follow by estimating their relevance to a given topic, based tal phenomena and also people in everyday action planning, on evidence obtained from the already downloaded pages. there is a need for services that provide access to informa- Given that air quality measurements and particularly air tion related to environmental conditions that is gathered quality forecasts are presented not only in textual form, but from several sources, with a view to obtaining reliable data. are most commonly encoded as multimedia, mainly in the Monitoring stations established by environmental organi- form of heatmaps, we propose the combination of textual sations and agencies typically perform such measurements and visual evidence for predicting the benefit of fetching an and make them available, most commonly, through Web re- unvisited Web resource. First, text classification is applied sources, such as pages, sites, and portals. Assembling and to select the relevant hyperlinks based on their anchor text, integrating information from several such providers is a ma- a surrounding text window, and URL terms. Further hy- jor challenge, which requires, as a first step, the automatic perlinks are selected by combining their text classification discovery of Web resources that contain environmental mea- score with an image classification score that indicates the surement data; this can be cast as a domain–specific search presence of heatmaps in their source page. A pilot evalu- problem. ation indicates that the combination of textual and visual Domain–specific search is mainly addressed by techniques evidence results in improvements in the crawling precision that fall into two categories: (i) the domain–specific query over the use of textual features alone. submission to a general–purpose search engine followed by post–retrieval filtering, and (ii) focussed crawling. Past re- search in the environmental domain (e.g., [12]) has mainly Categories and Subject Descriptors applied techniques from the first category, while the effec- H.3 [Information Systems]: Information Storage and Re- tiveness of focussed crawlers for environmental Web resources trieval has not been previouly investigated. Focussed (or topical ) crawlers exploit the graph structure General Terms of the Web for the discovery of resources about a given topic. Starting from one or more seed URLs on the topic, they Algorithms, Performance, Design, Experimentation download the Web pages addressed by them and mine their content so as to extract the hyperlinks contained therein and Keywords select the ones that would lead them to pages relevant to the focussed crawling, environmental data, link context, image topic. This process is iteratively repeated until a sufficient classification, heatmaps number of pages is fetched (i.e., downloaded). To predict the benefit of fetching an unvisited Web resource is a ma- 1. INTRODUCTION jor challenge since crawlers need to estimate its relevance to the topic at hand based solely on evidence obtained from Environmental conditions, such as the weather, air qual- the already downloaded pages. To this end, state–of–the– ity, and pollen concentration, are considered as one of the art approaches (see [13] for a review) adopt classifier–guided crawling strategies based on supervised machine learning; Copyright c by the paper’s authors. Copying permitted only for private the hyperlinks are classified based on their local context, and academic purposes. such as their anchor text and the textual content surround- In: S. Vrochidis, K. Karatzas, A. Karpinnen, A. Joly (eds.): Proceedings of ing them in the parent page from which they were extracted, the International Workshop on Environmental Multimedia Retrieval (EMR as well on global evidence associated with the entire parent 2014), Glasgow, UK, April 1, 2014, published at http://ceur-ws.org 61 Figure 1: Examples of environmental Web resources providing air quality measurements and forecasts (left–to–right): http://gems.ecmwf.int/, http://www.colorado.gov/airquality/air quality.aspx, http://www.sparetheair.org/Stay-Informed/Todays-Air-Quality/Five-Day-Forecast.aspx, http://airnow.gov. page, such as its textual content or its hyperlink structure. evidence indicating the presence of a heatmap in its parent This work investigates focussed crawling for the auto- page. This is achieved by the late fusion of text and image matic discovery of environmental Web resources, in par- classification confidence scores obtained by supervised ma- ticular those providing air quality measurements and fore- chine learning methods based on Support Vector Machines casts; see Figure 1 for some characteristic examples. Such (SVMs). resources report the concentration values of several air pol- The main contribution of this work is a novel focussed lutants, such as sulphur dioxide (SO2), nitrogen oxides and crawling approach that takes into account multimedia (tex- dioxide (NO+NO2), thoracic particles (PM10), fine parti- tual + visual) evidence for predicting the benefit of fetching cles (PM2.5) and ozone (O3), measured or forecact for spe- an unvisited Web resource based on the combination of text cific regions [9]. Empirical studies [8, 17, 11] have revealed and image classifiers. State–of–the–art classifier–guided fo- that such measurements and particularly air quality fore- cussed crawlers rely mainly on textual evidence [13] and, to casts are presented not only in textual form, but are most the best of our knowledge, visual evidence has not been pre- commonly encoded as multimedia, mainly in the form of viously considered in this context. The proposed classifier– heatmaps (i.e., graphical representations of matrix data with guided focussed crawler is evaluated in the domain of air colors representing pollutant concentrations over geograph- quality environmental Web resources and the experimental ically bounded regions); see Figure 2 for an example. results of our pilot study indicate improvements in the crawl- ing precision when incorporating visual evidence, over the use of textual features alone. The remainder of this paper is structured as follows. Sec- tion 2 discusses related work. Section 3 presents the pro- posed focussed crawling approach, Section 4 describes the evaluation setup, and Section 5 reports and analyses the ex- perimental results. Section 6 concludes this work and out- lines future research directions. 2. RELATED WORK Focussed crawling techniques have been researched since the early days of the Web [7]. Based on the ‘topical locality’ observation that most Web pages link to other pages that are similar in content [6], focussed crawlers attempt to esti- mate the benefit of following a hyperlink extracted from an Figure 2: Heatmap example extracted from already downloaded page by mainly exploiting the (i) local http://silam.fmi.fi/. context of the hyperlink and (ii) global evidence associated with its parent page. This motivates us to form the hypothesis that the pres- Previous research has defined local context in textual terms ence of a heatmap in a page already estimated to be an air as the lexical content that appears around a given hyperlink quality resource indicates that it is indeed highly relevant in its parent page. It may correspond to the anchor text of to the topic. Therefore, if such a page has already been the hyperlink, a text window surrounding it, the words ap- downloaded by a crawler focussed on air quality, it would pearing in its URL, and combinations thereof. Virtually all be a useful source of global evidence for the selections to be focussed crawlers [7, 1, 20, 19, 15, 16, 13] use such textual subsequently performed by such a focussed crawler. To this evidence in one form or another. Global evidence, on the end, this work proposes a classifier–guided focussed crawl- other hand, corresponds either to textual evidence, typically ing approach that estimates the relevance of a hyperlink to the lexical content of the parent page [16], or to hyperlink an unvisited Web resource based on the combination of (i) evidence, such as the centrality of the parent page within textual evidence from its local context and (ii) global visual its neighbouring subgraph [1]. A systematic study of the 62 effectiveness of various definitions of link context has found that crawling techniques that exploit terms both in the im- mediate vicinity of a hyperlink, as well as in its entire parent page, perform significantly better than those depending on just one of those cues [16]. Earlier focussed crawlers (e.g., [5]) estimated the relevance of the hyperlinks pointing to unvisited pages by comput- ing the textual similarity of the hyperlinks’ local context to a query corresponding to a textual representation of the topic at hand; this relevance score could also be smoothed by the textual similarity of the parent page to the same query. State–of–the–art focussed crawlers, though, use su- pervised machine learning methods to decide whether a hy- perlink is likely to lead to a Web page on the topic or not [13]. Classifier–guided focussed crawlers, introduced by Chakrabarti et al. [1], rely on models typically trained using the content of Web pages relevant to the topic; positive sam- ples are usually obtained from existing topic directories such as the Open Directory Project1 (ODP). A systematic evalu- ation on the relative merits of various classification schemes has shown that SVMs and Neural Network–based classifiers perform equally well in a focussed crawling application, with the former being more efficient, while Naive Bayes is a weak choice in this context [15]. This makes SVMs the classifica- tion scheme of choice in guiding focussed crawlers. Focussed crawling has not really been previously explored in the environmental domain. The discovery of environ- Figure 3: Multimedia focussed crawling. mental Web resources has previously been addressed mainly through the submission of domain–specific queries to general– purpose search engines, followed by the application of a the relevance of a hyperlink pointing to an unvisited page p post–retrieval classification step for improving precision [12, based only on its local context, the decision to fetch p de- 10]. The queries were generated using empirical information, pends solely on the output of an appropriately trained text including the incorporation of geographical terms [10], and classifier. Therefore, a page is fetched if the confidence score were expanded using ‘keyword spices’ [14], i.e., a Boolean s of the text–based classifier is above an experimentally set expression of domain–specific terms corresponding to the threshold t1 . output of a decision tree trained on an appropriate cor- However, there are cases in which the local context is not pus [12]. Post–retrieval classification was performed using sufficient to effectively represent relevant hyperlinks, leading SVMs trained on textual features extracted from a training them to obtain low confidence scores below the set threshold corpus [12]. Such approaches are complementary to the dis- t1 , and thus to not being fetched by the focussed crawler. covery of Web resources using focussed classifiers and hybrid In this case, global evidence can be used for adjusting the approaches that combine the two techniques in a common estimate for the hyperlink’s relevance. This is motivated by framework are a promising research direction [11]. the ‘topical locality’ phenomenon of Web pages linking to other pages that are similar in content; therefore, if there 3. MULTIMEDIA FOCUSSED CRAWLING is strong evidence of the parent’s page relevance, then the This work proposes a classifier–guided focussed crawling relevance estimates of its children pages should be adjusted approach for the discovery of environmental Web resources accordingly. providing air quality measurements and forecasts. To this As mentioned before, the presence of heatmaps in a Web end, it estimates the relevance of a hyperlink to an unvis- resource already assumed to be an air quality resource is ited resource based on the combination of its local context a strong indication that it is indeed highly relevant to the with global evidence associated with its parent page. Lo- topic. Therefore, we propose the consideration of heatmap cal context refers to the textual content appearing in the presence in the parent page as global evidence to be used vicinity of the hyperlink in the parent page. Motivated by for adjusting the relevance estimate of hyperlinks with text- the frequent occurrence of heatmaps in such Web resources, based confidence scores below the required threshold t1 (in we consider the presence of a heatmap in a parent page as practice, a lower bound threshold t2 is also set; this thresh- global evidence for its high relevance to the topic. old is also experimentally tuned). In particular, the esti- An overview of the proposed focussed crawling approach mate of relevance of each hyperlink is adjusted to corre- is depicted in Figure 3. First the seed pages are added to spond to the late fusion of a text and a heatmap classi- the list of URLs to fetch. In each iteration, a URL is picked fier: score = f (text classif ier, heatmap classif ier), and from the list and the page corresponding to this URL is the page is fetched if its score ≥ t1 . In our case, a binary fetched (i.e., downloaded) and parsed to extract its hyper- heatmap classifier is considered and the fusion function f links. In the simple case that the focussed crawler estimates is set to correspond to max. This results in a page being fetched if either its text-based confidence score is above t1 or 1 http://www.dmoz.org/. if its text-based confidence score is above t2 (t2 < t1 ) and its 63 parent page contains at least one heatmap. Next, the text terms within the same representation. The text–based clas- and heatmap classifiers employed in this work are described. sification score of each hyperlink is then obatined by the em- ploying the classifier on the feature vector and corresponds 3.1 Text–Based Link Classification to a confidence value that reflects the distance of the testing Text–based link classification is performed using a super- sample to the hyperplane. vised machine learning approach based on SVMs and a va- Our model was trained using 711 samples (100 positive, riety of textual features extracted from the hyperlink’s local 611 negative). Each sample corresponds to a hyperlink point- context. SVMs are applied due to their demonstrated effec- ing to page providing air quality measurements and fore- tiveness in similar applications [15]. casts; these hyperlinks were extracted from 26 pages about Each hyperlink is represented using textual features ex- air quality obtained from ODP and previous empirical stud- tracted from the following fields: ies conducted by domain experts in the context of the project PESCaDO2 . It should be noted that both the hyperlinks • a: anchor text of the hyperlink, and their parent pages are different from the seed set used in the evaluation of the focussed crawler (see Section 4). • h: the terms extracted from the URL of the hyperlink; The generated lexicon consists of 207 terms with the follow- string sequences are split in punctuation marks and ing being the 10 most frequent in the training corpus: days, common URL extensions (e.g., com) and prefixes (e.g., ozone, air, data, quality, today, forecast, yesterday, raw, and www) are removed; current. The geographical lexicon consists of 3,625 terms obtained from a geographical database. • s: the terms extracted from a text window of 50 char- acters surrounding the hyperlink; this text window 3.2 Heatmap Recognition does not contain the anchor text of adjacent links (i.e., Heatmap recognition is performed by applying a recently the window stops as soon as it encounters another developed approach by our research group [10]. That inves- link), tigation on heatmap binary classification using SVMs and a variety of visual features indicated that, overall, the MPEG– • so: the terms extracted from a text window of 50 char- 7 [3] descriptors demonstrated a slightly better performance acters surrounding the hyperlink when overlap to the than the other tested visual features (SIFT [4] and AHDH3 adjacent links is allowed. [18]). Combinations of the above lead to the following five repre- In particular, the following three extracted MPEG–7 fea- sentations corresponding to concatenations of the respective tures that capture color and texture aspects of human per- fields: a+s, a+so, a+h, a+h+s, and a+h+so. ception were the most effective: In the training phase, a list of positive and negative sam- • Scalable Color Descriptor (SC): a Haar–transform ples are collected first, so as to build a vocabulary for rep- based encoding scheme that measures color distribu- resenting the samples in the textual feature space and also tion over an entire image, quantized uniformly to 256 for training the model. Each sample corresponds to a hy- bins, perlink pointing to a Web page on air quality measurements and forecasts and its associated a+so representation. The • Edge Histogram Descriptor (EH): a scale invari- vocabulary is built by accumulating all the terms from the ant visual texture descriptor that captures the spatial a+so representations of the samples and eliminating all stop- distribution of edges; it involves division of image into words. This representation was selected so as to lead to a 16 non–overlapping blocks and edge information cal- richer feature space, compared to the sparser a, s, and a+s culated for each block in five edge categories, and representations, while also remaining relatively noise free compared to the a+h+s and a+h+so representations which • Homogenous Texture Descriptor (HT): describ- are likely to contain more noise given the difficulties in suc- ing directionality, coarseness, and regularity of pat- cessfully parsing URLs. terns in images based on a filter bank approach that Each sample is represented in the textual feature space employs scale and orientation sensitive filters. spanned by the created vocabulary using a tf.idf = tf (t, d)× log( dfn(t) ) weighting scheme, where tf (t, d) is the frequency Their early fusion (SC–EH–HT), as well as the feature EH on of term t in sample d and idf (t) is the inverse document its own produced the best results when employing an SVM frequency of term t in the collection of n samples, where classifier with an RBF kernel. The evaluation was performed df (t) is the number of samples containing that term. Fur- by training the classifier on a dataset of 2,200 images (600 thermore, a feature representing the number of geographical relevant, i.e., heatmaps) and testing it on dataset of 2,860 terms in the sample’s a+so representation is added, given images (1,170 heatmaps)4 . the importance of such terms in the environmental domain In this work, both the EH and the SC–EH–HT models [12]. To avoid overestimation of their effect, such geographi- trained on the first dataset are employed. An image is clas- cal terms were previously removed from the vocabulary that sified as a heatmap if at least one of these classifiers considers was built. The SVM classifier is built using an RBF kernel it to be a heatmap, i.e., a late fusion approach based on a and 5–fold cross–validation is performed on the training set logical OR is applied. to select the class weight parameters. 2 Personalised Environmental Service Configuration and De- In the testing phase, each sample is represented as a fea- livery Orchestration (http://www.pescado-project.eu/). ture vector based on the tf.idf of the terms extracted from 3 Adaptive Hierarchical Density Histogram. one of the proposed representation schemes (a, a+s, a+so, 4 Both datasets are available at: http://mklab.iti.gr/ a+h, a+h+s, or a+h+so) and the number of geographical project/heatmaps. 64 Table 1: List of seed URLs. URL heatmap present 1. http://aircarecolorado.com/ 2. http://airnow.gov/ X 3. http://db.eurad.uni-koeln.de/en/ X 4. http://gems.ecmwf.int/ X 5. http://maps.co.mecklenburg.nc.us/website/airquality/default.php X 6. http://uk-air.defra.gov.uk/ 7. http://www.baaqmd.gov/The-Air-District.aspx 8. http://www.eea.europa.eu/ 9. http://www.gmes-atmosphere.eu/ 10. http://www.londonair.org.uk/LondonAir/Default.aspx X 4. EVALUATION 4.2 Performance Metrics A pilot study is performed for evaluating the performance The standard retrieval evaluation metrics of precision and of the proposed focussed crawling approach. recall are typically applied for assessing the effectiveness of A set of 10 seeds5 (listed in Table 1) was selected, sim- a focussed crawler. Precision corresponds to the proportion ilarly to before, i.e., using ODP and the outcomes of em- of fetched pages that are relevant and recall to the propor- pirical studies conducted by domain experts in the context tion of all relevant pages that are fetched. The latter re- of the project PESCaDO. Half of them contain at least one quires knowledge of all relevant pages on a given topic, an heatmap. Starting from these 10 seeds, a crawl at depth 1 impossible task in the context of the Web. To address this is performed. A total of 807 hyperlinks are extracted from limitation, two recall–oriented evaluation techniques have these 10 seeds and several focussed crawling approaches are been proposed [13]: (i) manually designate a few represen- applied for deciding which ones to fetch. These are evaluated tative pages on the topic and measure what fraction of them in the following two sets of experiments. are discovered by the crawler, and (ii) measure the overlap among independent crawls initiated from different seeds to 4.1 Experiments see whether they converge on the same set of pages. Given Experiment 1: This experiment examines the relative the small scope of our study (i.e., a crawl at depth 1), these merits of the different text–based representations of hyper- approaches are not applicable and therefore recall is not con- links (i.e., a, a+s, a+so, a+h, a+h+s, and a+h+so). In sidered in our evaluation. In addition to precision, the ac- this case, a text–based classifier–guided focussed crawling curacy of the classification of the crawled outlinks is also is applied for each representation and a page is fetched if reported. its text–based confidence score is above a threshold t1 . Ex- periments are performed for t1 values ranging from 0.0 to 4.3 Relevance Assessments 0.9 at step 0.1. When t1 = 0.0, the crawl corresponds to a All 807 extracted hyperlinks were manually assessed. Af- breadth–first search where all hyperlinks are fetched and no ter applying some light URL normalisation (e.g., deleting focussed crawling is performed. trailing slashes) and removing duplicates, 689 unique URLs Experiment 2: This experiment investigates the effec- remain. These correspond both to internal (within–site) tiveness of incorporating multimedia evidence in the form and to external links that were assessed using the follow- of heatmaps in the crawling process. In this case, a page ing three–point relevance scale: pointed by a hyperlink is fetched if the hyperlink’s text– • (highly) relevant: Web resources that provide air qual- based confidence score is above t1 or if its text–based confi- ity measurements and forecasts. These data should dence score is above t2 (t2 < t1 ) and its parent page contains either be visible on the page or should appear after at least one heatmap. The text–based confidence scores are selecting a particular value from options (e.g., region, obtained from the best performing classifier in Experiment pollutant, time of day, etc.) provided from drop–down 1. Experiments are performed for t1 and t2 values ranging menus. from 0.0 to 0.9 at step 0.1, while maintaining t2 < t1 . These experimental results are compared against two baselines: (i) • partially relevant: Web resources that are about air the results of the corresponding text–based focussed crawler quality measurements and forecasts, but do not pro- for threshold t1 , and (ii) the results of the corresponding vide actual data. Examples include Web resources text–based focussed crawler for threshold t2 . that list monitoring sites and the pollutants being mea- To determine the presence of a heatmap in the parent page sured, explain what such measurements mean, describe of a hyperlink, the page is parsed (since it is already down- methods, approaches, and research for measuring, val- loaded) and the hyperlinks pointing to images are compiled idating, and forecasting air quality data, or provide into a list. The crawler iteratively downloads each image in links to components, systems, and applications that the list, extracts its visual features, and applies the heatmap measure air quality. classification until a heatmap is recognised or a maximum number of images is downloaded from each page (set to 20 • non–relevant: Web resources that are not relevant to in our experiments). air quality measurements and forecasts, including re- In both experiments, when a hyperlink appears more than sources that are about air quality and pollution in gen- once within a seed page, only the one with the highest score eral, discussing, for instance, its causes and effects. is taken into consideration for evaluation purposes. Overall, our crawled dataset contains 232 (33.7%) highly 5 relevant pages, 51 (7.4%) partially relevant, and 406 (58.9%) These URLs are different to the ones used when training the classifiers. non–relevant ones. 65 0.5 1.0 a+h ● a+s a+so ● ● ● 0.8 ● ● ● 0.4 ● ● ● a+h+s a+h+so precision accuracy 0.6 ● ● ● 0.3 ● ● ● ● ● ● 0.4 a+h 0.2 ● a+s a+so 0.2 ● a+h+s 0.1 ● a+h+so 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 t1 t1 Figure 4: Precision and accuracy of the focussed crawl for each text–based link classification method (a+h, a+s, a+so, a+h+s, a+h+so) for threshold t1 ∈ {0, 0.1, ..., 0.9} when strict relevance assessments are employed. 0.8 1.0 a+h ● a+s a+so 0.8 ● ● ● ● ● ● a+h+s ● ● ● 0.6 ● a+h+so precision accuracy 0.6 ● ● ● ● ● ● ● ● 0.4 0.4 a+h ● a+s a+so 0.2 ● a+h+s 0.2 ● a+h+so 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 t1 t1 Figure 5: Precision and accuracy of the focussed crawl for each text–based link classification method (a+h, a+s, a+so, a+h+s, a+h+so) for threshold t1 ∈ {0, 0.1, ..., 0.9} when lenient relevance assessments are employed. A closer inspection revealed that 162 (69.8%) of the highly • lenient: when considering both highly relevant and relevant pages were all crawled from seed no. 2 in Table 1 partially relevant Web resources as relevant. (http://airnow.gov/). These correspond to internal links pointing to pages with air quality measurements/forecasts, The distributions of relevance assessments in these two cases each regarding a different U.S. region. This, in conjunction are listed in Table 2. with the fact that all these links obtained really high scores (over 0.9) by our text classifier led us to remove them from Table 2: Relevance assessments distributions when further consideration as they would significantly skew the the 3–point scale judgements are mapped to binary. evaluation results. Therefore, the evaluation was performed Strict Lenient only for the pages crawled from the nine remaining seeds Relevant 70 (13.3)% 120 (22.8)% and these are the results reported in Section 56 . Starting Non–Relevant 456 (86.7)% 406 (77.2)% from the 9 seeds, our crawled dataset contains 526 URLs: 70 All 526 (100.0)% 526 (100.0)% (13.3%) highly relevant pages, 50 (9.5%) partially relevant, and 406 (77.2%) non–relevant ones. To apply the performance metrics presented above, these multiple grade relevance assessments are mapped into binary 4.4 Implementation relevance judgements in two different ways, depending on Our implementation is based on Apache Nutch (http:// whether we are strictly interested in discovering resources nutch.apache.org/), a highly extensible and scalable open containing air quality data, or whether we would also be source Web crawler software project. To convert it to a fo- interested in information about air quality measurements cussed crawler, its parser was modified so as to filter the links and forecasts. In particular, two mappings are considered: being fetched based on our proposed approach. The text– based classifier was implemented using the libraries of the • strict: when considering only highly relevant Web re- Weka machine learning software (http://www.cs.waikato. sources as relevant and the rest (partially relevant and ac.nz/ml/weka/), while the implementation of the visual non–relevant) as non–relevant, and classifier was based on the LIBSVM [2] library. 6 It should be noted that http://airnow.gov/ appears in the list of our crawled pages even when removed from the 5. RESULTS seed list, since it is linked from other seed pages. However, since crawling is performed at depth 1, its own outlinks are Experiment 1: The results of this first experiment that not considered any further. evaluates the effectiveness of the different textual represen- 66 Table 3: Precision of the focussed crawler that combines the a+s text–based link classifier with the heatmap classifier for thresholds t1 ∈ {0.1, ..., 0.9} and t2 ∈ {0, 0.1, ..., 0.8} when strict relevance assessments are employed. t2 Text–based baseline a+s t1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 (fetch if s >= t1 ) 0.1 0.215 0.294 0.2 0.213 0.314 0.296 0.3 0.206 0.299 0.284 0.292 0.4 0.214 0.346 0.340 0.354 0.327 0.5 0.215 0.353 0.347 0.362 0.333 0.333 0.6 0.214 0.362 0.356 0.372 0.341 0.333 0.324 0.7 0.222 0.405 0.400 0.421 0.385 0.382 0.364 0.310 0.8 0.222 0.405 0.400 0.421 0.385 0.382 0.364 0.310 0.308 0.9 0.221 0.421 0.417 0.441 0.400 0.400 0.379 0.320 0.318 0.300 Text–based baseline a+s 0.137 0.294 0.296 0.292 0.327 0.333 0.324 0.310 0.308 (fetch if s >= t2 ) Table 4: Precision of the focussed crawler that combines the a+s text–based link classifier with the heatmap classifier for thresholds t1 ∈ {0.1, ..., 0.9} and t2 ∈ {0, 0.1, ..., 0.8} when lenient relevance assessments are employed. t2 Text–based baseline a+s t1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 (fetch if s >= t1 ) 0.1 0.360 0.518 0.2 0.360 0.571 0.549 0.3 0.350 0.552 0.537 0.554 0.4 0.352 0.615 0.620 0.646 0.612 0.5 0.347 0.608 0.612 0.638 0.604 0.548 0.6 0.343 0.617 0.622 0.651 0.614 0.538 0.541 0.7 0.333 0.619 0.625 0.658 0.615 0.529 0.515 0.483 0.8 0.333 0.619 0.625 0.658 0.615 0.529 0.515 0.483 0.500 0.9 0.328 0.632 0.639 0.676 0.629 0.533 0.517 0.480 0.500 0.500 Text–based baseline a+s 0.217 0.518 0.549 0.554 0.612 0.548 0.541 0.483 0.500 (fetch if s >= t2 ) tations employed by the text–based focussed crawler are de- that 8 of the 9 seeds were classified accurately for the pres- picted in Figures 4 and 5, when applying strict and lenient ence of heatmaps in them (all apart from seed no. 10 in relevance assessments, respectively. Table 1). This is probably due to the difficulty in parsing The a+s classifier–guided focussed crawler achieves the the specific Web resource and also in recognising its images highest overall precision, both for the strict and the lenient as heatmaps, as they correspond to non–typical heatmaps, cases, and for t1 = 0.4, indicating the benefits of combin- different to the ones in our training set. On average, 10 sec- ing the anchor text with the terms obtained from a non– onds were required per Web resource for the downloading, overlapping text window. It also achieves the highest accu- feature extraction, and classification of its images; however, racy, which is equal to that of the a+h and a+h+s classi- this overhead could be reduced by applying parallelisation. fiers; these two classifiers have though slightly lower preci- Tables 3 and 4 present the results of the second exper- sion compared to that of a+s. This indicates that the URL is iment, when applying strict and lenient relevance assess- potentially a useful source of evidence and that application ments, respectively, for t1 and t2 values ranging from 0.0 of more advanced techniques for extracting terms from an to 0.9 at step 0.1, while maintaining t2 < t1 . The results URL is probably required for reaching its full potential. The are compared against the two baselines listed in the tables’ a+so and a+h+so classifiers are the least effective for lower last column and last row respectively. The values in bold t1 values indicating that the additional terms present in the correspond to improvements over both baselines. overlapping text window introduce noise that leads to the The observed substantial improvements for multiple thresh- misclassification of non–relevant hyperlinks. Furthermore, old values provide an indication of the benefits of incorporat- all focussed crawlers improve upon precision for t1 = 0.0 ing visual evidence as global evidence in a focussed crawler. that corresponds to general–purpose crawling. As expected, Consider the best performing classifier when strict relevance the absolute values of precision are much higher in the le- assessments are employed: it achieves precision of 0.44 for nient case, compared to the strict. t1 = 0.9 and t2 = 0.3, while the text–based focussed crawler Experiment 2: The second experiment aims to allow us for the same t1 = 0.9 achieves precision 0.30. An examina- to gain insights into the feasibility and potential benefits of tion of the results shows that the improvements are due to incorporating multimedia in the form of heatmaps in the the fact that 65% of the newly added hyperlinks, i.e., those crawling process. To this end, it combines a+s, the best with text–based classification score between 0.3 and 0.9, are performing text–based classifier from the first experiment, relevant. with results from the heatmap classifier. First, the results of the heatmap classification are presented. 6. CONCLUSIONS Each of the nine seeds contains 15 images on average as This work proposed a novel classifier–guided focussed crawl- identified by our parser. On average, 8 images are down- ing approach for the discovery of environmental Web re- loaded from each seed before a heatmap is found or the im- sources providing air quality measurements and forecasts age list ends. Out of the 75 downloaded images, 74 were that combines multimedia (textual + visual) evidence for correctly classified, with 3 being heatmaps. This means predicting the benefit of fetching an unvisited Web resource. 67 The results of our pilot study provide a first indication of the [9] K. Karatzas and N. Moussiopoulos. Urban air quality effectiveness of incorporating visual evidence in the focussed management and information systems in Europe: legal crawling process over the use of textual features alone. framework and information access. Journal of Large–scale experiments are currently planned for fully Environmental Assessment Policy and Management, assessing the potential benefits of the proposed multime- 2(02):263–272, 2000. dia focussed crawling approach, including experiments for [10] A. Moumtzidou, S. Vrochidis, E. Chatzilari, and improving the effectiveness of the textual classification by I. Kompatsiaris. Discovery of environmental nodes taking into account also the textual content of the entire based on heatmap recognition. In Proceedings of the parent page, similar to previous research [16]. Further fu- 20th IEEE International Conference on Image ture work includes the consideration of other types of images Processing (ICIP 2013), 2013. common in environmental Web resources, such as diagrams, [11] A. Moumtzidou, S. Vrochidis, and I. Kompatsiaris. simple filtering mechanisms for removing prior to classifi- Discovery, analysis and retrieval of multimodal cation small–size images that are unlikely to contain useful environmental information. In Encyclopedia of information (e.g., logos and layout elements), and the incor- Information Science and Technology (in press). IGI poration of additional local evidence, such as the distance Global, 2013. of the hyperlink to the heatmap image. Finally, we aim to [12] A. Moumtzidou, S. Vrochidis, S. Tonelli, investigate the application of the proposed focussed crawler I. Kompatsiaris, and E. Pianta. Discovery of in other domains where information is commonly encoded environmental nodes in the web. In Multidisciplinary in multimedia form, such as food recipes. Information Retrieval, Proceedings of the 5th International Retrieval Facility Conference (IRFC 7. ACKNOWLEDGMENTS 2012), volume 7356 of LNCS, pages 58–72, 2012. This work was supported by MULTISENSOR (contract [13] C. Olston and M. Najork. Web crawling. Foundations no. FP7–610411) and HOMER (contract no. FP7–312388) and Trends in Information Retrieval, 4(3):175–246, projects, partially funded by the European Commission. 2010. [14] S. Oyama, T. Kokubo, and T. Ishida. Domain-specific web search with keyword spices. IEEE Transactions 8. REFERENCES on Knowledge and Data Engineering, 16(1):17–27, [1] S. Chakrabarti, M. van den Berg, and B. Dom. Jan. 2004. Focused crawling: A new approach to topic-specific [15] G. Pant and P. Srinivasan. Learning to crawl: web resource discovery. In Proceedings of the 8th Comparing classification schemes. ACM Transactions International Conference on World Wide Web, on Information Systems, 23(4):430–462, 2005. (WWW 1999), pages 1623–1640, 1999. [16] G. Pant and P. Srinivasan. Link contexts in [2] C. C. Chang and C. J. Lin. LIBSVM: a library for classifier-guided topical crawlers. IEEE Transactions support vector machines. ACM Transactions on on Knowledge and Data Engineering, 18(1):107–122, Intelligent Systems and Technology (TIST), 2(3):27, 2006. 2011. [17] R. San José, A. Baklanov, R. Sokhi, K. Karatzas, and [3] S. F. Chang, T. Sikora, and A. Puri. Overview of the J. Pérez. Computational air quality modelling. MPEG-7 standard. IEEE Transactions on Circuits Developments in Integrated Environmental and Systems for Video Technology, 11(6):688–695, Assessment, 3:247–267, 2008. 2001. [18] P. Sidiropoulos, S. Vrochidis, and I. Kompatsiaris. [4] K. Chatfield, V. S. Lempitsky, A. Vedaldi, and Content-based binary image retrieval using the A. Zisserman. The devil is in the details: an adaptive hierarchical density histogram. Pattern evaluation of recent feature encoding methods. In Recognition, 44(4):739 – 750, 2011. Proceedings of the British Machine Vision Conference [19] T. T. Tang, D. Hawking, N. Craswell, and (BMVC 2011), pages 1–12, 2011. K. Griffiths. Focused crawling for both topical [5] J. Cho, H. Garcia-Molina, and L. Page. Efficient relevance and quality of medical information. In crawling through URL ordering. Computer Networks, Proceedings of the 14th ACM International Conference 30(1-7):161–172, 1998. on Information and Knowledge Management, (CIKM [6] B. D. Davison. Topical locality in the web. In 2005), pages 147–154, 2005. Proceedings of the 23rd Annual International ACM [20] T. T. Tang, D. Hawking, N. Craswell, and R. S. SIGIR Conference on Research and Development in Sankaranarayana. Focused crawling in depression Information Retrieval, (SIGIR 2000), pages 272–279, portal search: A feasibility study. In Proceedings of the 2000. 9th Australasian Document Computing Symposium [7] P. De Bra and R. D. J. Post. Information retrieval in (ADCS 2004), pages 1–9, 2004. the world-wide web: Making client-based searching feasible. Computer Networks and ISDN Systems, 27(2):183–192, 1994. [8] V. Epitropou, K. Karatzas, and A. Bassoukos. A method for the inverse reconstruction of environmental data applicable at the chemical weather portal. In Proceedings of the GI-Forum Symposium and Exhibit on Applied Geoinformatics, pages 58–68, 2010. 68