=Paper= {{Paper |id=Vol-1222/paper9 |storemode=property |title=Focussed crawling of environmental web resources: A pilot study on the combination of multimedia evidence |pdfUrl=https://ceur-ws.org/Vol-1222/paper9.pdf |volume=Vol-1222 |dblpUrl=https://dblp.org/rec/conf/mir/TsikrikaMVK14 }} ==Focussed crawling of environmental web resources: A pilot study on the combination of multimedia evidence== https://ceur-ws.org/Vol-1222/paper9.pdf

Focussed Crawling of Environmental Web Resources:
A Pilot Study on the Combination of Multimedia Evidence

Theodora Tsikrika Anastasia Moumtzidou

Stefanos Vrochidis Ioannis Kompatsiaris
Information Technologies Institute
Centre for Research and Technology Hellas
Thessaloniki, Greece
{theodora.tsikrika, moumtzid, stefanos, ikom}@iti.gr

ABSTRACT factors with a strong impact on the quality of life, since they
This work investigates the use of focussed crawling tech- directly affect human health (e.g., allergies and asthma), a
niques for the discovery of environmental multimedia Web variety of human outdoor activities (ranging from agricul-
resources that provide air quality measurements and fore- ture to sports and travel planning), as well as major en-
casts. Focussed crawlers automatically navigate the hyper- vironmental issues (such as the greenhouse effect). In or-
linked structure of the Web and select the hyperlinks to der to support both scientists in forecasting environmen-
follow by estimating their relevance to a given topic, based tal phenomena and also people in everyday action planning,
on evidence obtained from the already downloaded pages. there is a need for services that provide access to informa-
Given that air quality measurements and particularly air tion related to environmental conditions that is gathered
quality forecasts are presented not only in textual form, but from several sources, with a view to obtaining reliable data.
are most commonly encoded as multimedia, mainly in the Monitoring stations established by environmental organi-
form of heatmaps, we propose the combination of textual sations and agencies typically perform such measurements
and visual evidence for predicting the benefit of fetching an and make them available, most commonly, through Web re-
unvisited Web resource. First, text classification is applied sources, such as pages, sites, and portals. Assembling and
to select the relevant hyperlinks based on their anchor text, integrating information from several such providers is a ma-
a surrounding text window, and URL terms. Further hy- jor challenge, which requires, as a first step, the automatic
perlinks are selected by combining their text classification discovery of Web resources that contain environmental mea-
score with an image classification score that indicates the surement data; this can be cast as a domain–specific search
presence of heatmaps in their source page. A pilot evalu- problem.
ation indicates that the combination of textual and visual Domain–specific search is mainly addressed by techniques
evidence results in improvements in the crawling precision that fall into two categories: (i) the domain–specific query
over the use of textual features alone. submission to a general–purpose search engine followed by
post–retrieval filtering, and (ii) focussed crawling. Past re-
search in the environmental domain (e.g., [12]) has mainly
Categories and Subject Descriptors applied techniques from the first category, while the effec-
H.3 [Information Systems]: Information Storage and Re- tiveness of focussed crawlers for environmental Web resources
trieval has not been previouly investigated.
Focussed (or topical ) crawlers exploit the graph structure
General Terms of the Web for the discovery of resources about a given topic.
Starting from one or more seed URLs on the topic, they
Algorithms, Performance, Design, Experimentation
download the Web pages addressed by them and mine their
content so as to extract the hyperlinks contained therein and
Keywords select the ones that would lead them to pages relevant to the
focussed crawling, environmental data, link context, image topic. This process is iteratively repeated until a sufficient
classification, heatmaps number of pages is fetched (i.e., downloaded). To predict
the benefit of fetching an unvisited Web resource is a ma-
1. INTRODUCTION jor challenge since crawlers need to estimate its relevance to
the topic at hand based solely on evidence obtained from
Environmental conditions, such as the weather, air qual-
the already downloaded pages. To this end, state–of–the–
ity, and pollen concentration, are considered as one of the
art approaches (see [13] for a review) adopt classifier–guided
crawling strategies based on supervised machine learning;
Copyright c by the paper’s authors. Copying permitted only for private the hyperlinks are classified based on their local context,
and academic purposes. such as their anchor text and the textual content surround-
In: S. Vrochidis, K. Karatzas, A. Karpinnen, A. Joly (eds.): Proceedings of ing them in the parent page from which they were extracted,
the International Workshop on Environmental Multimedia Retrieval (EMR as well on global evidence associated with the entire parent
2014), Glasgow, UK, April 1, 2014, published at http://ceur-ws.org

61
Figure 1: Examples of environmental Web resources providing air quality measurements and
forecasts (left–to–right): http://gems.ecmwf.int/, http://www.colorado.gov/airquality/air quality.aspx,
http://www.sparetheair.org/Stay-Informed/Todays-Air-Quality/Five-Day-Forecast.aspx, http://airnow.gov.

page, such as its textual content or its hyperlink structure. evidence indicating the presence of a heatmap in its parent
This work investigates focussed crawling for the auto- page. This is achieved by the late fusion of text and image
matic discovery of environmental Web resources, in par- classification confidence scores obtained by supervised ma-
ticular those providing air quality measurements and fore- chine learning methods based on Support Vector Machines
casts; see Figure 1 for some characteristic examples. Such (SVMs).
resources report the concentration values of several air pol- The main contribution of this work is a novel focussed
lutants, such as sulphur dioxide (SO2), nitrogen oxides and crawling approach that takes into account multimedia (tex-
dioxide (NO+NO2), thoracic particles (PM10), fine parti- tual + visual) evidence for predicting the benefit of fetching
cles (PM2.5) and ozone (O3), measured or forecact for spe- an unvisited Web resource based on the combination of text
cific regions [9]. Empirical studies [8, 17, 11] have revealed and image classifiers. State–of–the–art classifier–guided fo-
that such measurements and particularly air quality fore- cussed crawlers rely mainly on textual evidence [13] and, to
casts are presented not only in textual form, but are most the best of our knowledge, visual evidence has not been pre-
commonly encoded as multimedia, mainly in the form of viously considered in this context. The proposed classifier–
heatmaps (i.e., graphical representations of matrix data with guided focussed crawler is evaluated in the domain of air
colors representing pollutant concentrations over geograph- quality environmental Web resources and the experimental
ically bounded regions); see Figure 2 for an example. results of our pilot study indicate improvements in the crawl-
ing precision when incorporating visual evidence, over the
use of textual features alone.
The remainder of this paper is structured as follows. Sec-
tion 2 discusses related work. Section 3 presents the pro-
posed focussed crawling approach, Section 4 describes the
evaluation setup, and Section 5 reports and analyses the ex-
perimental results. Section 6 concludes this work and out-
lines future research directions.

2. RELATED WORK
Focussed crawling techniques have been researched since
the early days of the Web [7]. Based on the ‘topical locality’
observation that most Web pages link to other pages that
are similar in content [6], focussed crawlers attempt to esti-
mate the benefit of following a hyperlink extracted from an
Figure 2: Heatmap example extracted from already downloaded page by mainly exploiting the (i) local
http://silam.fmi.fi/. context of the hyperlink and (ii) global evidence associated
with its parent page.
This motivates us to form the hypothesis that the pres- Previous research has defined local context in textual terms
ence of a heatmap in a page already estimated to be an air as the lexical content that appears around a given hyperlink
quality resource indicates that it is indeed highly relevant in its parent page. It may correspond to the anchor text of
to the topic. Therefore, if such a page has already been the hyperlink, a text window surrounding it, the words ap-
downloaded by a crawler focussed on air quality, it would pearing in its URL, and combinations thereof. Virtually all
be a useful source of global evidence for the selections to be focussed crawlers [7, 1, 20, 19, 15, 16, 13] use such textual
subsequently performed by such a focussed crawler. To this evidence in one form or another. Global evidence, on the
end, this work proposes a classifier–guided focussed crawl- other hand, corresponds either to textual evidence, typically
ing approach that estimates the relevance of a hyperlink to the lexical content of the parent page [16], or to hyperlink
an unvisited Web resource based on the combination of (i) evidence, such as the centrality of the parent page within
textual evidence from its local context and (ii) global visual its neighbouring subgraph [1]. A systematic study of the

62
effectiveness of various definitions of link context has found
that crawling techniques that exploit terms both in the im-
mediate vicinity of a hyperlink, as well as in its entire parent
page, perform significantly better than those depending on
just one of those cues [16].
Earlier focussed crawlers (e.g., [5]) estimated the relevance
of the hyperlinks pointing to unvisited pages by comput-
ing the textual similarity of the hyperlinks’ local context
to a query corresponding to a textual representation of the
topic at hand; this relevance score could also be smoothed
by the textual similarity of the parent page to the same
query. State–of–the–art focussed crawlers, though, use su-
pervised machine learning methods to decide whether a hy-
perlink is likely to lead to a Web page on the topic or
not [13]. Classifier–guided focussed crawlers, introduced by
Chakrabarti et al. [1], rely on models typically trained using
the content of Web pages relevant to the topic; positive sam-
ples are usually obtained from existing topic directories such
as the Open Directory Project1 (ODP). A systematic evalu-
ation on the relative merits of various classification schemes
has shown that SVMs and Neural Network–based classifiers
perform equally well in a focussed crawling application, with
the former being more efficient, while Naive Bayes is a weak
choice in this context [15]. This makes SVMs the classifica-
tion scheme of choice in guiding focussed crawlers.
Focussed crawling has not really been previously explored
in the environmental domain. The discovery of environ-
Figure 3: Multimedia focussed crawling.
mental Web resources has previously been addressed mainly
through the submission of domain–specific queries to general–
purpose search engines, followed by the application of a
the relevance of a hyperlink pointing to an unvisited page p
post–retrieval classification step for improving precision [12,
based only on its local context, the decision to fetch p de-
10]. The queries were generated using empirical information,
pends solely on the output of an appropriately trained text
including the incorporation of geographical terms [10], and
classifier. Therefore, a page is fetched if the confidence score
were expanded using ‘keyword spices’ [14], i.e., a Boolean
s of the text–based classifier is above an experimentally set
expression of domain–specific terms corresponding to the
threshold t1 .
output of a decision tree trained on an appropriate cor-
However, there are cases in which the local context is not
pus [12]. Post–retrieval classification was performed using
sufficient to effectively represent relevant hyperlinks, leading
SVMs trained on textual features extracted from a training
them to obtain low confidence scores below the set threshold
corpus [12]. Such approaches are complementary to the dis-
t1 , and thus to not being fetched by the focussed crawler.
covery of Web resources using focussed classifiers and hybrid
In this case, global evidence can be used for adjusting the
approaches that combine the two techniques in a common
estimate for the hyperlink’s relevance. This is motivated by
framework are a promising research direction [11].
the ‘topical locality’ phenomenon of Web pages linking to
other pages that are similar in content; therefore, if there
3. MULTIMEDIA FOCUSSED CRAWLING is strong evidence of the parent’s page relevance, then the
This work proposes a classifier–guided focussed crawling relevance estimates of its children pages should be adjusted
approach for the discovery of environmental Web resources accordingly.
providing air quality measurements and forecasts. To this As mentioned before, the presence of heatmaps in a Web
end, it estimates the relevance of a hyperlink to an unvis- resource already assumed to be an air quality resource is
ited resource based on the combination of its local context a strong indication that it is indeed highly relevant to the
with global evidence associated with its parent page. Lo- topic. Therefore, we propose the consideration of heatmap
cal context refers to the textual content appearing in the presence in the parent page as global evidence to be used
vicinity of the hyperlink in the parent page. Motivated by for adjusting the relevance estimate of hyperlinks with text-
the frequent occurrence of heatmaps in such Web resources, based confidence scores below the required threshold t1 (in
we consider the presence of a heatmap in a parent page as practice, a lower bound threshold t2 is also set; this thresh-
global evidence for its high relevance to the topic. old is also experimentally tuned). In particular, the esti-
An overview of the proposed focussed crawling approach mate of relevance of each hyperlink is adjusted to corre-
is depicted in Figure 3. First the seed pages are added to spond to the late fusion of a text and a heatmap classi-
the list of URLs to fetch. In each iteration, a URL is picked fier: score = f (text classif ier, heatmap classif ier), and
from the list and the page corresponding to this URL is the page is fetched if its score ≥ t1 . In our case, a binary
fetched (i.e., downloaded) and parsed to extract its hyper- heatmap classifier is considered and the fusion function f
links. In the simple case that the focussed crawler estimates is set to correspond to max. This results in a page being
fetched if either its text-based confidence score is above t1 or
1
http://www.dmoz.org/. if its text-based confidence score is above t2 (t2 < t1 ) and its

63
parent page contains at least one heatmap. Next, the text terms within the same representation. The text–based clas-
and heatmap classifiers employed in this work are described. sification score of each hyperlink is then obatined by the em-
ploying the classifier on the feature vector and corresponds
3.1 Text–Based Link Classification to a confidence value that reflects the distance of the testing
Text–based link classification is performed using a super- sample to the hyperplane.
vised machine learning approach based on SVMs and a va- Our model was trained using 711 samples (100 positive,
riety of textual features extracted from the hyperlink’s local 611 negative). Each sample corresponds to a hyperlink point-
context. SVMs are applied due to their demonstrated effec- ing to page providing air quality measurements and fore-
tiveness in similar applications [15]. casts; these hyperlinks were extracted from 26 pages about
Each hyperlink is represented using textual features ex- air quality obtained from ODP and previous empirical stud-
tracted from the following fields: ies conducted by domain experts in the context of the project
PESCaDO2 . It should be noted that both the hyperlinks
• a: anchor text of the hyperlink, and their parent pages are different from the seed set used
in the evaluation of the focussed crawler (see Section 4).
• h: the terms extracted from the URL of the hyperlink; The generated lexicon consists of 207 terms with the follow-
string sequences are split in punctuation marks and ing being the 10 most frequent in the training corpus: days,
common URL extensions (e.g., com) and prefixes (e.g., ozone, air, data, quality, today, forecast, yesterday, raw, and
www) are removed; current. The geographical lexicon consists of 3,625 terms
obtained from a geographical database.
• s: the terms extracted from a text window of 50 char-
acters surrounding the hyperlink; this text window 3.2 Heatmap Recognition
does not contain the anchor text of adjacent links (i.e., Heatmap recognition is performed by applying a recently
the window stops as soon as it encounters another developed approach by our research group [10]. That inves-
link), tigation on heatmap binary classification using SVMs and a
variety of visual features indicated that, overall, the MPEG–
• so: the terms extracted from a text window of 50 char-
7 [3] descriptors demonstrated a slightly better performance
acters surrounding the hyperlink when overlap to the
than the other tested visual features (SIFT [4] and AHDH3
adjacent links is allowed.
[18]).
Combinations of the above lead to the following five repre- In particular, the following three extracted MPEG–7 fea-
sentations corresponding to concatenations of the respective tures that capture color and texture aspects of human per-
fields: a+s, a+so, a+h, a+h+s, and a+h+so. ception were the most effective:
In the training phase, a list of positive and negative sam-
• Scalable Color Descriptor (SC): a Haar–transform
ples are collected first, so as to build a vocabulary for rep-
based encoding scheme that measures color distribu-
resenting the samples in the textual feature space and also
tion over an entire image, quantized uniformly to 256
for training the model. Each sample corresponds to a hy-
bins,
perlink pointing to a Web page on air quality measurements
and forecasts and its associated a+so representation. The • Edge Histogram Descriptor (EH): a scale invari-
vocabulary is built by accumulating all the terms from the ant visual texture descriptor that captures the spatial
a+so representations of the samples and eliminating all stop- distribution of edges; it involves division of image into
words. This representation was selected so as to lead to a 16 non–overlapping blocks and edge information cal-
richer feature space, compared to the sparser a, s, and a+s culated for each block in five edge categories, and
representations, while also remaining relatively noise free
compared to the a+h+s and a+h+so representations which • Homogenous Texture Descriptor (HT): describ-
are likely to contain more noise given the difficulties in suc- ing directionality, coarseness, and regularity of pat-
cessfully parsing URLs. terns in images based on a filter bank approach that
Each sample is represented in the textual feature space employs scale and orientation sensitive filters.
spanned by the created vocabulary using a tf.idf = tf (t, d)×
log( dfn(t) ) weighting scheme, where tf (t, d) is the frequency Their early fusion (SC–EH–HT), as well as the feature EH on
of term t in sample d and idf (t) is the inverse document its own produced the best results when employing an SVM
frequency of term t in the collection of n samples, where classifier with an RBF kernel. The evaluation was performed
df (t) is the number of samples containing that term. Fur- by training the classifier on a dataset of 2,200 images (600
thermore, a feature representing the number of geographical relevant, i.e., heatmaps) and testing it on dataset of 2,860
terms in the sample’s a+so representation is added, given images (1,170 heatmaps)4 .
the importance of such terms in the environmental domain In this work, both the EH and the SC–EH–HT models
[12]. To avoid overestimation of their effect, such geographi- trained on the first dataset are employed. An image is clas-
cal terms were previously removed from the vocabulary that sified as a heatmap if at least one of these classifiers considers
was built. The SVM classifier is built using an RBF kernel it to be a heatmap, i.e., a late fusion approach based on a
and 5–fold cross–validation is performed on the training set logical OR is applied.
to select the class weight parameters. 2
Personalised Environmental Service Configuration and De-
In the testing phase, each sample is represented as a fea- livery Orchestration (http://www.pescado-project.eu/).
ture vector based on the tf.idf of the terms extracted from 3
Adaptive Hierarchical Density Histogram.
one of the proposed representation schemes (a, a+s, a+so, 4
Both datasets are available at: http://mklab.iti.gr/
a+h, a+h+s, or a+h+so) and the number of geographical project/heatmaps.

64
Table 1: List of seed URLs.
URL heatmap present
1. http://aircarecolorado.com/
2. http://airnow.gov/ X
3. http://db.eurad.uni-koeln.de/en/ X
4. http://gems.ecmwf.int/ X
5. http://maps.co.mecklenburg.nc.us/website/airquality/default.php X
6. http://uk-air.defra.gov.uk/
7. http://www.baaqmd.gov/The-Air-District.aspx
8. http://www.eea.europa.eu/
9. http://www.gmes-atmosphere.eu/
10. http://www.londonair.org.uk/LondonAir/Default.aspx X

4. EVALUATION 4.2 Performance Metrics
A pilot study is performed for evaluating the performance The standard retrieval evaluation metrics of precision and
of the proposed focussed crawling approach. recall are typically applied for assessing the effectiveness of
A set of 10 seeds5 (listed in Table 1) was selected, sim- a focussed crawler. Precision corresponds to the proportion
ilarly to before, i.e., using ODP and the outcomes of em- of fetched pages that are relevant and recall to the propor-
pirical studies conducted by domain experts in the context tion of all relevant pages that are fetched. The latter re-
of the project PESCaDO. Half of them contain at least one quires knowledge of all relevant pages on a given topic, an
heatmap. Starting from these 10 seeds, a crawl at depth 1 impossible task in the context of the Web. To address this
is performed. A total of 807 hyperlinks are extracted from limitation, two recall–oriented evaluation techniques have
these 10 seeds and several focussed crawling approaches are been proposed [13]: (i) manually designate a few represen-
applied for deciding which ones to fetch. These are evaluated tative pages on the topic and measure what fraction of them
in the following two sets of experiments. are discovered by the crawler, and (ii) measure the overlap
among independent crawls initiated from different seeds to
4.1 Experiments see whether they converge on the same set of pages. Given
Experiment 1: This experiment examines the relative the small scope of our study (i.e., a crawl at depth 1), these
merits of the different text–based representations of hyper- approaches are not applicable and therefore recall is not con-
links (i.e., a, a+s, a+so, a+h, a+h+s, and a+h+so). In sidered in our evaluation. In addition to precision, the ac-
this case, a text–based classifier–guided focussed crawling curacy of the classification of the crawled outlinks is also
is applied for each representation and a page is fetched if reported.
its text–based confidence score is above a threshold t1 . Ex-
periments are performed for t1 values ranging from 0.0 to 4.3 Relevance Assessments
0.9 at step 0.1. When t1 = 0.0, the crawl corresponds to a All 807 extracted hyperlinks were manually assessed. Af-
breadth–first search where all hyperlinks are fetched and no ter applying some light URL normalisation (e.g., deleting
focussed crawling is performed. trailing slashes) and removing duplicates, 689 unique URLs
Experiment 2: This experiment investigates the effec- remain. These correspond both to internal (within–site)
tiveness of incorporating multimedia evidence in the form and to external links that were assessed using the follow-
of heatmaps in the crawling process. In this case, a page ing three–point relevance scale:
pointed by a hyperlink is fetched if the hyperlink’s text–
• (highly) relevant: Web resources that provide air qual-
based confidence score is above t1 or if its text–based confi-
ity measurements and forecasts. These data should
dence score is above t2 (t2 < t1 ) and its parent page contains
either be visible on the page or should appear after
at least one heatmap. The text–based confidence scores are
selecting a particular value from options (e.g., region,
obtained from the best performing classifier in Experiment
pollutant, time of day, etc.) provided from drop–down
1. Experiments are performed for t1 and t2 values ranging
menus.
from 0.0 to 0.9 at step 0.1, while maintaining t2 < t1 . These
experimental results are compared against two baselines: (i) • partially relevant: Web resources that are about air
the results of the corresponding text–based focussed crawler quality measurements and forecasts, but do not pro-
for threshold t1 , and (ii) the results of the corresponding vide actual data. Examples include Web resources
text–based focussed crawler for threshold t2 . that list monitoring sites and the pollutants being mea-
To determine the presence of a heatmap in the parent page sured, explain what such measurements mean, describe
of a hyperlink, the page is parsed (since it is already down- methods, approaches, and research for measuring, val-
loaded) and the hyperlinks pointing to images are compiled idating, and forecasting air quality data, or provide
into a list. The crawler iteratively downloads each image in links to components, systems, and applications that
the list, extracts its visual features, and applies the heatmap measure air quality.
classification until a heatmap is recognised or a maximum
number of images is downloaded from each page (set to 20 • non–relevant: Web resources that are not relevant to
in our experiments). air quality measurements and forecasts, including re-
In both experiments, when a hyperlink appears more than sources that are about air quality and pollution in gen-
once within a seed page, only the one with the highest score eral, discussing, for instance, its causes and effects.
is taken into consideration for evaluation purposes. Overall, our crawled dataset contains 232 (33.7%) highly
5 relevant pages, 51 (7.4%) partially relevant, and 406 (58.9%)
These URLs are different to the ones used when training
the classifiers. non–relevant ones.

65
0.5

1.0
a+h
● a+s
a+so ● ● ●

0.8
● ● ●

0.4
● ● ●
a+h+s
a+h+so
precision

accuracy
0.6
● ●
●
0.3
● ●
● ● ●
●

0.4
a+h
0.2

● a+s
a+so

0.2
● a+h+s
0.1

●
a+h+so

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
t1 t1

Figure 4: Precision and accuracy of the focussed crawl for each text–based link classification method (a+h,
a+s, a+so, a+h+s, a+h+so) for threshold t1 ∈ {0, 0.1, ..., 0.9} when strict relevance assessments are employed.
0.8

1.0
a+h
● a+s
a+so

0.8
● ● ● ● ● ●
a+h+s ● ● ●
0.6

● a+h+so
precision

accuracy
0.6
● ● ● ●
●
● ●
●
0.4

0.4
a+h
● a+s
a+so

0.2
●
a+h+s
0.2

● a+h+so

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
t1 t1

Figure 5: Precision and accuracy of the focussed crawl for each text–based link classification method (a+h,
a+s, a+so, a+h+s, a+h+so) for threshold t1 ∈ {0, 0.1, ..., 0.9} when lenient relevance assessments are employed.

A closer inspection revealed that 162 (69.8%) of the highly • lenient: when considering both highly relevant and
relevant pages were all crawled from seed no. 2 in Table 1 partially relevant Web resources as relevant.
(http://airnow.gov/). These correspond to internal links
pointing to pages with air quality measurements/forecasts, The distributions of relevance assessments in these two cases
each regarding a different U.S. region. This, in conjunction are listed in Table 2.
with the fact that all these links obtained really high scores
(over 0.9) by our text classifier led us to remove them from Table 2: Relevance assessments distributions when
further consideration as they would significantly skew the the 3–point scale judgements are mapped to binary.
evaluation results. Therefore, the evaluation was performed Strict Lenient
only for the pages crawled from the nine remaining seeds
Relevant 70 (13.3)% 120 (22.8)%
and these are the results reported in Section 56 . Starting
Non–Relevant 456 (86.7)% 406 (77.2)%
from the 9 seeds, our crawled dataset contains 526 URLs: 70
All 526 (100.0)% 526 (100.0)%
(13.3%) highly relevant pages, 50 (9.5%) partially relevant,
and 406 (77.2%) non–relevant ones.
To apply the performance metrics presented above, these
multiple grade relevance assessments are mapped into binary 4.4 Implementation
relevance judgements in two different ways, depending on Our implementation is based on Apache Nutch (http://
whether we are strictly interested in discovering resources nutch.apache.org/), a highly extensible and scalable open
containing air quality data, or whether we would also be source Web crawler software project. To convert it to a fo-
interested in information about air quality measurements cussed crawler, its parser was modified so as to filter the links
and forecasts. In particular, two mappings are considered: being fetched based on our proposed approach. The text–
based classifier was implemented using the libraries of the
• strict: when considering only highly relevant Web re- Weka machine learning software (http://www.cs.waikato.
sources as relevant and the rest (partially relevant and ac.nz/ml/weka/), while the implementation of the visual
non–relevant) as non–relevant, and classifier was based on the LIBSVM [2] library.
6
It should be noted that http://airnow.gov/ appears in
the list of our crawled pages even when removed from the 5. RESULTS
seed list, since it is linked from other seed pages. However,
since crawling is performed at depth 1, its own outlinks are Experiment 1: The results of this first experiment that
not considered any further. evaluates the effectiveness of the different textual represen-

66
Table 3: Precision of the focussed crawler that combines the a+s text–based link classifier with the heatmap
classifier for thresholds t1 ∈ {0.1, ..., 0.9} and t2 ∈ {0, 0.1, ..., 0.8} when strict relevance assessments are employed.
t2 Text–based baseline a+s
t1
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 (fetch if s >= t1 )
0.1 0.215 0.294
0.2 0.213 0.314 0.296
0.3 0.206 0.299 0.284 0.292
0.4 0.214 0.346 0.340 0.354 0.327
0.5 0.215 0.353 0.347 0.362 0.333 0.333
0.6 0.214 0.362 0.356 0.372 0.341 0.333 0.324
0.7 0.222 0.405 0.400 0.421 0.385 0.382 0.364 0.310
0.8 0.222 0.405 0.400 0.421 0.385 0.382 0.364 0.310 0.308
0.9 0.221 0.421 0.417 0.441 0.400 0.400 0.379 0.320 0.318 0.300
Text–based baseline a+s
0.137 0.294 0.296 0.292 0.327 0.333 0.324 0.310 0.308
(fetch if s >= t2 )

Table 4: Precision of the focussed crawler that combines the a+s text–based link classifier with the heatmap
classifier for thresholds t1 ∈ {0.1, ..., 0.9} and t2 ∈ {0, 0.1, ..., 0.8} when lenient relevance assessments are employed.
t2 Text–based baseline a+s
t1
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 (fetch if s >= t1 )
0.1 0.360 0.518
0.2 0.360 0.571 0.549
0.3 0.350 0.552 0.537 0.554
0.4 0.352 0.615 0.620 0.646 0.612
0.5 0.347 0.608 0.612 0.638 0.604 0.548
0.6 0.343 0.617 0.622 0.651 0.614 0.538 0.541
0.7 0.333 0.619 0.625 0.658 0.615 0.529 0.515 0.483
0.8 0.333 0.619 0.625 0.658 0.615 0.529 0.515 0.483 0.500
0.9 0.328 0.632 0.639 0.676 0.629 0.533 0.517 0.480 0.500 0.500
Text–based baseline a+s
0.217 0.518 0.549 0.554 0.612 0.548 0.541 0.483 0.500
(fetch if s >= t2 )

tations employed by the text–based focussed crawler are de- that 8 of the 9 seeds were classified accurately for the pres-
picted in Figures 4 and 5, when applying strict and lenient ence of heatmaps in them (all apart from seed no. 10 in
relevance assessments, respectively. Table 1). This is probably due to the difficulty in parsing
The a+s classifier–guided focussed crawler achieves the the specific Web resource and also in recognising its images
highest overall precision, both for the strict and the lenient as heatmaps, as they correspond to non–typical heatmaps,
cases, and for t1 = 0.4, indicating the benefits of combin- different to the ones in our training set. On average, 10 sec-
ing the anchor text with the terms obtained from a non– onds were required per Web resource for the downloading,
overlapping text window. It also achieves the highest accu- feature extraction, and classification of its images; however,
racy, which is equal to that of the a+h and a+h+s classi- this overhead could be reduced by applying parallelisation.
fiers; these two classifiers have though slightly lower preci- Tables 3 and 4 present the results of the second exper-
sion compared to that of a+s. This indicates that the URL is iment, when applying strict and lenient relevance assess-
potentially a useful source of evidence and that application ments, respectively, for t1 and t2 values ranging from 0.0
of more advanced techniques for extracting terms from an to 0.9 at step 0.1, while maintaining t2 < t1 . The results
URL is probably required for reaching its full potential. The are compared against the two baselines listed in the tables’
a+so and a+h+so classifiers are the least effective for lower last column and last row respectively. The values in bold
t1 values indicating that the additional terms present in the correspond to improvements over both baselines.
overlapping text window introduce noise that leads to the The observed substantial improvements for multiple thresh-
misclassification of non–relevant hyperlinks. Furthermore, old values provide an indication of the benefits of incorporat-
all focussed crawlers improve upon precision for t1 = 0.0 ing visual evidence as global evidence in a focussed crawler.
that corresponds to general–purpose crawling. As expected, Consider the best performing classifier when strict relevance
the absolute values of precision are much higher in the le- assessments are employed: it achieves precision of 0.44 for
nient case, compared to the strict. t1 = 0.9 and t2 = 0.3, while the text–based focussed crawler
Experiment 2: The second experiment aims to allow us for the same t1 = 0.9 achieves precision 0.30. An examina-
to gain insights into the feasibility and potential benefits of tion of the results shows that the improvements are due to
incorporating multimedia in the form of heatmaps in the the fact that 65% of the newly added hyperlinks, i.e., those
crawling process. To this end, it combines a+s, the best with text–based classification score between 0.3 and 0.9, are
performing text–based classifier from the first experiment, relevant.
with results from the heatmap classifier. First, the results
of the heatmap classification are presented. 6. CONCLUSIONS
Each of the nine seeds contains 15 images on average as
This work proposed a novel classifier–guided focussed crawl-
identified by our parser. On average, 8 images are down-
ing approach for the discovery of environmental Web re-
loaded from each seed before a heatmap is found or the im-
sources providing air quality measurements and forecasts
age list ends. Out of the 75 downloaded images, 74 were
that combines multimedia (textual + visual) evidence for
correctly classified, with 3 being heatmaps. This means
predicting the benefit of fetching an unvisited Web resource.

67
The results of our pilot study provide a first indication of the [9] K. Karatzas and N. Moussiopoulos. Urban air quality
effectiveness of incorporating visual evidence in the focussed management and information systems in Europe: legal
crawling process over the use of textual features alone. framework and information access. Journal of
Large–scale experiments are currently planned for fully Environmental Assessment Policy and Management,
assessing the potential benefits of the proposed multime- 2(02):263–272, 2000.
dia focussed crawling approach, including experiments for [10] A. Moumtzidou, S. Vrochidis, E. Chatzilari, and
improving the effectiveness of the textual classification by I. Kompatsiaris. Discovery of environmental nodes
taking into account also the textual content of the entire based on heatmap recognition. In Proceedings of the
parent page, similar to previous research [16]. Further fu- 20th IEEE International Conference on Image
ture work includes the consideration of other types of images Processing (ICIP 2013), 2013.
common in environmental Web resources, such as diagrams, [11] A. Moumtzidou, S. Vrochidis, and I. Kompatsiaris.
simple filtering mechanisms for removing prior to classifi- Discovery, analysis and retrieval of multimodal
cation small–size images that are unlikely to contain useful environmental information. In Encyclopedia of
information (e.g., logos and layout elements), and the incor- Information Science and Technology (in press). IGI
poration of additional local evidence, such as the distance Global, 2013.
of the hyperlink to the heatmap image. Finally, we aim to [12] A. Moumtzidou, S. Vrochidis, S. Tonelli,
investigate the application of the proposed focussed crawler I. Kompatsiaris, and E. Pianta. Discovery of
in other domains where information is commonly encoded environmental nodes in the web. In Multidisciplinary
in multimedia form, such as food recipes. Information Retrieval, Proceedings of the 5th
International Retrieval Facility Conference (IRFC
7. ACKNOWLEDGMENTS 2012), volume 7356 of LNCS, pages 58–72, 2012.
This work was supported by MULTISENSOR (contract [13] C. Olston and M. Najork. Web crawling. Foundations
no. FP7–610411) and HOMER (contract no. FP7–312388) and Trends in Information Retrieval, 4(3):175–246,
projects, partially funded by the European Commission. 2010.
[14] S. Oyama, T. Kokubo, and T. Ishida. Domain-specific
web search with keyword spices. IEEE Transactions
8. REFERENCES on Knowledge and Data Engineering, 16(1):17–27,
[1] S. Chakrabarti, M. van den Berg, and B. Dom. Jan. 2004.
Focused crawling: A new approach to topic-specific [15] G. Pant and P. Srinivasan. Learning to crawl:
web resource discovery. In Proceedings of the 8th Comparing classification schemes. ACM Transactions
International Conference on World Wide Web, on Information Systems, 23(4):430–462, 2005.
(WWW 1999), pages 1623–1640, 1999. [16] G. Pant and P. Srinivasan. Link contexts in
[2] C. C. Chang and C. J. Lin. LIBSVM: a library for classifier-guided topical crawlers. IEEE Transactions
support vector machines. ACM Transactions on on Knowledge and Data Engineering, 18(1):107–122,
Intelligent Systems and Technology (TIST), 2(3):27, 2006.
2011. [17] R. San José, A. Baklanov, R. Sokhi, K. Karatzas, and
[3] S. F. Chang, T. Sikora, and A. Puri. Overview of the J. Pérez. Computational air quality modelling.
MPEG-7 standard. IEEE Transactions on Circuits Developments in Integrated Environmental
and Systems for Video Technology, 11(6):688–695, Assessment, 3:247–267, 2008.
2001. [18] P. Sidiropoulos, S. Vrochidis, and I. Kompatsiaris.
[4] K. Chatfield, V. S. Lempitsky, A. Vedaldi, and Content-based binary image retrieval using the
A. Zisserman. The devil is in the details: an adaptive hierarchical density histogram. Pattern
evaluation of recent feature encoding methods. In Recognition, 44(4):739 – 750, 2011.
Proceedings of the British Machine Vision Conference [19] T. T. Tang, D. Hawking, N. Craswell, and
(BMVC 2011), pages 1–12, 2011. K. Griffiths. Focused crawling for both topical
[5] J. Cho, H. Garcia-Molina, and L. Page. Efficient relevance and quality of medical information. In
crawling through URL ordering. Computer Networks, Proceedings of the 14th ACM International Conference
30(1-7):161–172, 1998. on Information and Knowledge Management, (CIKM
[6] B. D. Davison. Topical locality in the web. In 2005), pages 147–154, 2005.
Proceedings of the 23rd Annual International ACM [20] T. T. Tang, D. Hawking, N. Craswell, and R. S.
SIGIR Conference on Research and Development in Sankaranarayana. Focused crawling in depression
Information Retrieval, (SIGIR 2000), pages 272–279, portal search: A feasibility study. In Proceedings of the
2000. 9th Australasian Document Computing Symposium
[7] P. De Bra and R. D. J. Post. Information retrieval in (ADCS 2004), pages 1–9, 2004.
the world-wide web: Making client-based searching
feasible. Computer Networks and ISDN Systems,
27(2):183–192, 1994.
[8] V. Epitropou, K. Karatzas, and A. Bassoukos. A
method for the inverse reconstruction of environmental
data applicable at the chemical weather portal. In
Proceedings of the GI-Forum Symposium and Exhibit
on Applied Geoinformatics, pages 58–68, 2010.