=Paper= {{Paper |id=Vol-1222/paper9 |storemode=property |title=Focussed crawling of environmental web resources: A pilot study on the combination of multimedia evidence |pdfUrl=https://ceur-ws.org/Vol-1222/paper9.pdf |volume=Vol-1222 |dblpUrl=https://dblp.org/rec/conf/mir/TsikrikaMVK14 }} ==Focussed crawling of environmental web resources: A pilot study on the combination of multimedia evidence== https://ceur-ws.org/Vol-1222/paper9.pdf
   Focussed Crawling of Environmental Web Resources:
  A Pilot Study on the Combination of Multimedia Evidence

                                        Theodora Tsikrika                          Anastasia Moumtzidou

                                        Stefanos Vrochidis                          Ioannis Kompatsiaris
                                                    Information Technologies Institute
                                                Centre for Research and Technology Hellas
                                                           Thessaloniki, Greece
                                    {theodora.tsikrika, moumtzid, stefanos, ikom}@iti.gr

ABSTRACT                                                                           factors with a strong impact on the quality of life, since they
This work investigates the use of focussed crawling tech-                          directly affect human health (e.g., allergies and asthma), a
niques for the discovery of environmental multimedia Web                           variety of human outdoor activities (ranging from agricul-
resources that provide air quality measurements and fore-                          ture to sports and travel planning), as well as major en-
casts. Focussed crawlers automatically navigate the hyper-                         vironmental issues (such as the greenhouse effect). In or-
linked structure of the Web and select the hyperlinks to                           der to support both scientists in forecasting environmen-
follow by estimating their relevance to a given topic, based                       tal phenomena and also people in everyday action planning,
on evidence obtained from the already downloaded pages.                            there is a need for services that provide access to informa-
Given that air quality measurements and particularly air                           tion related to environmental conditions that is gathered
quality forecasts are presented not only in textual form, but                      from several sources, with a view to obtaining reliable data.
are most commonly encoded as multimedia, mainly in the                             Monitoring stations established by environmental organi-
form of heatmaps, we propose the combination of textual                            sations and agencies typically perform such measurements
and visual evidence for predicting the benefit of fetching an                      and make them available, most commonly, through Web re-
unvisited Web resource. First, text classification is applied                      sources, such as pages, sites, and portals. Assembling and
to select the relevant hyperlinks based on their anchor text,                      integrating information from several such providers is a ma-
a surrounding text window, and URL terms. Further hy-                              jor challenge, which requires, as a first step, the automatic
perlinks are selected by combining their text classification                       discovery of Web resources that contain environmental mea-
score with an image classification score that indicates the                        surement data; this can be cast as a domain–specific search
presence of heatmaps in their source page. A pilot evalu-                          problem.
ation indicates that the combination of textual and visual                            Domain–specific search is mainly addressed by techniques
evidence results in improvements in the crawling precision                         that fall into two categories: (i) the domain–specific query
over the use of textual features alone.                                            submission to a general–purpose search engine followed by
                                                                                   post–retrieval filtering, and (ii) focussed crawling. Past re-
                                                                                   search in the environmental domain (e.g., [12]) has mainly
Categories and Subject Descriptors                                                 applied techniques from the first category, while the effec-
H.3 [Information Systems]: Information Storage and Re-                             tiveness of focussed crawlers for environmental Web resources
trieval                                                                            has not been previouly investigated.
                                                                                      Focussed (or topical ) crawlers exploit the graph structure
General Terms                                                                      of the Web for the discovery of resources about a given topic.
                                                                                   Starting from one or more seed URLs on the topic, they
Algorithms, Performance, Design, Experimentation
                                                                                   download the Web pages addressed by them and mine their
                                                                                   content so as to extract the hyperlinks contained therein and
Keywords                                                                           select the ones that would lead them to pages relevant to the
focussed crawling, environmental data, link context, image                         topic. This process is iteratively repeated until a sufficient
classification, heatmaps                                                           number of pages is fetched (i.e., downloaded). To predict
                                                                                   the benefit of fetching an unvisited Web resource is a ma-
1.    INTRODUCTION                                                                 jor challenge since crawlers need to estimate its relevance to
                                                                                   the topic at hand based solely on evidence obtained from
   Environmental conditions, such as the weather, air qual-
                                                                                   the already downloaded pages. To this end, state–of–the–
ity, and pollen concentration, are considered as one of the
                                                                                   art approaches (see [13] for a review) adopt classifier–guided
                                                                                   crawling strategies based on supervised machine learning;
Copyright c by the paper’s authors. Copying permitted only for private             the hyperlinks are classified based on their local context,
and academic purposes.                                                             such as their anchor text and the textual content surround-
In: S. Vrochidis, K. Karatzas, A. Karpinnen, A. Joly (eds.): Proceedings of        ing them in the parent page from which they were extracted,
the International Workshop on Environmental Multimedia Retrieval (EMR              as well on global evidence associated with the entire parent
2014), Glasgow, UK, April 1, 2014, published at http://ceur-ws.org


                                                                              61
Figure 1:     Examples of environmental Web resources providing air quality measurements and
forecasts (left–to–right): http://gems.ecmwf.int/, http://www.colorado.gov/airquality/air quality.aspx,
http://www.sparetheair.org/Stay-Informed/Todays-Air-Quality/Five-Day-Forecast.aspx, http://airnow.gov.


page, such as its textual content or its hyperlink structure.         evidence indicating the presence of a heatmap in its parent
   This work investigates focussed crawling for the auto-             page. This is achieved by the late fusion of text and image
matic discovery of environmental Web resources, in par-               classification confidence scores obtained by supervised ma-
ticular those providing air quality measurements and fore-            chine learning methods based on Support Vector Machines
casts; see Figure 1 for some characteristic examples. Such            (SVMs).
resources report the concentration values of several air pol-            The main contribution of this work is a novel focussed
lutants, such as sulphur dioxide (SO2), nitrogen oxides and           crawling approach that takes into account multimedia (tex-
dioxide (NO+NO2), thoracic particles (PM10), fine parti-              tual + visual) evidence for predicting the benefit of fetching
cles (PM2.5) and ozone (O3), measured or forecact for spe-            an unvisited Web resource based on the combination of text
cific regions [9]. Empirical studies [8, 17, 11] have revealed        and image classifiers. State–of–the–art classifier–guided fo-
that such measurements and particularly air quality fore-             cussed crawlers rely mainly on textual evidence [13] and, to
casts are presented not only in textual form, but are most            the best of our knowledge, visual evidence has not been pre-
commonly encoded as multimedia, mainly in the form of                 viously considered in this context. The proposed classifier–
heatmaps (i.e., graphical representations of matrix data with         guided focussed crawler is evaluated in the domain of air
colors representing pollutant concentrations over geograph-           quality environmental Web resources and the experimental
ically bounded regions); see Figure 2 for an example.                 results of our pilot study indicate improvements in the crawl-
                                                                      ing precision when incorporating visual evidence, over the
                                                                      use of textual features alone.
                                                                         The remainder of this paper is structured as follows. Sec-
                                                                      tion 2 discusses related work. Section 3 presents the pro-
                                                                      posed focussed crawling approach, Section 4 describes the
                                                                      evaluation setup, and Section 5 reports and analyses the ex-
                                                                      perimental results. Section 6 concludes this work and out-
                                                                      lines future research directions.


                                                                      2.   RELATED WORK
                                                                         Focussed crawling techniques have been researched since
                                                                      the early days of the Web [7]. Based on the ‘topical locality’
                                                                      observation that most Web pages link to other pages that
                                                                      are similar in content [6], focussed crawlers attempt to esti-
                                                                      mate the benefit of following a hyperlink extracted from an
Figure 2:     Heatmap         example     extracted     from          already downloaded page by mainly exploiting the (i) local
http://silam.fmi.fi/.                                                 context of the hyperlink and (ii) global evidence associated
                                                                      with its parent page.
  This motivates us to form the hypothesis that the pres-                Previous research has defined local context in textual terms
ence of a heatmap in a page already estimated to be an air            as the lexical content that appears around a given hyperlink
quality resource indicates that it is indeed highly relevant          in its parent page. It may correspond to the anchor text of
to the topic. Therefore, if such a page has already been              the hyperlink, a text window surrounding it, the words ap-
downloaded by a crawler focussed on air quality, it would             pearing in its URL, and combinations thereof. Virtually all
be a useful source of global evidence for the selections to be        focussed crawlers [7, 1, 20, 19, 15, 16, 13] use such textual
subsequently performed by such a focussed crawler. To this            evidence in one form or another. Global evidence, on the
end, this work proposes a classifier–guided focussed crawl-           other hand, corresponds either to textual evidence, typically
ing approach that estimates the relevance of a hyperlink to           the lexical content of the parent page [16], or to hyperlink
an unvisited Web resource based on the combination of (i)             evidence, such as the centrality of the parent page within
textual evidence from its local context and (ii) global visual        its neighbouring subgraph [1]. A systematic study of the


                                                                 62
effectiveness of various definitions of link context has found
that crawling techniques that exploit terms both in the im-
mediate vicinity of a hyperlink, as well as in its entire parent
page, perform significantly better than those depending on
just one of those cues [16].
   Earlier focussed crawlers (e.g., [5]) estimated the relevance
of the hyperlinks pointing to unvisited pages by comput-
ing the textual similarity of the hyperlinks’ local context
to a query corresponding to a textual representation of the
topic at hand; this relevance score could also be smoothed
by the textual similarity of the parent page to the same
query. State–of–the–art focussed crawlers, though, use su-
pervised machine learning methods to decide whether a hy-
perlink is likely to lead to a Web page on the topic or
not [13]. Classifier–guided focussed crawlers, introduced by
Chakrabarti et al. [1], rely on models typically trained using
the content of Web pages relevant to the topic; positive sam-
ples are usually obtained from existing topic directories such
as the Open Directory Project1 (ODP). A systematic evalu-
ation on the relative merits of various classification schemes
has shown that SVMs and Neural Network–based classifiers
perform equally well in a focussed crawling application, with
the former being more efficient, while Naive Bayes is a weak
choice in this context [15]. This makes SVMs the classifica-
tion scheme of choice in guiding focussed crawlers.
   Focussed crawling has not really been previously explored
in the environmental domain. The discovery of environ-
                                                                               Figure 3: Multimedia focussed crawling.
mental Web resources has previously been addressed mainly
through the submission of domain–specific queries to general–
purpose search engines, followed by the application of a
                                                                        the relevance of a hyperlink pointing to an unvisited page p
post–retrieval classification step for improving precision [12,
                                                                        based only on its local context, the decision to fetch p de-
10]. The queries were generated using empirical information,
                                                                        pends solely on the output of an appropriately trained text
including the incorporation of geographical terms [10], and
                                                                        classifier. Therefore, a page is fetched if the confidence score
were expanded using ‘keyword spices’ [14], i.e., a Boolean
                                                                        s of the text–based classifier is above an experimentally set
expression of domain–specific terms corresponding to the
                                                                        threshold t1 .
output of a decision tree trained on an appropriate cor-
                                                                            However, there are cases in which the local context is not
pus [12]. Post–retrieval classification was performed using
                                                                        sufficient to effectively represent relevant hyperlinks, leading
SVMs trained on textual features extracted from a training
                                                                        them to obtain low confidence scores below the set threshold
corpus [12]. Such approaches are complementary to the dis-
                                                                        t1 , and thus to not being fetched by the focussed crawler.
covery of Web resources using focussed classifiers and hybrid
                                                                        In this case, global evidence can be used for adjusting the
approaches that combine the two techniques in a common
                                                                        estimate for the hyperlink’s relevance. This is motivated by
framework are a promising research direction [11].
                                                                        the ‘topical locality’ phenomenon of Web pages linking to
                                                                        other pages that are similar in content; therefore, if there
3.     MULTIMEDIA FOCUSSED CRAWLING                                     is strong evidence of the parent’s page relevance, then the
   This work proposes a classifier–guided focussed crawling             relevance estimates of its children pages should be adjusted
approach for the discovery of environmental Web resources               accordingly.
providing air quality measurements and forecasts. To this                   As mentioned before, the presence of heatmaps in a Web
end, it estimates the relevance of a hyperlink to an unvis-             resource already assumed to be an air quality resource is
ited resource based on the combination of its local context             a strong indication that it is indeed highly relevant to the
with global evidence associated with its parent page. Lo-               topic. Therefore, we propose the consideration of heatmap
cal context refers to the textual content appearing in the              presence in the parent page as global evidence to be used
vicinity of the hyperlink in the parent page. Motivated by              for adjusting the relevance estimate of hyperlinks with text-
the frequent occurrence of heatmaps in such Web resources,              based confidence scores below the required threshold t1 (in
we consider the presence of a heatmap in a parent page as               practice, a lower bound threshold t2 is also set; this thresh-
global evidence for its high relevance to the topic.                    old is also experimentally tuned). In particular, the esti-
   An overview of the proposed focussed crawling approach               mate of relevance of each hyperlink is adjusted to corre-
is depicted in Figure 3. First the seed pages are added to              spond to the late fusion of a text and a heatmap classi-
the list of URLs to fetch. In each iteration, a URL is picked           fier: score = f (text classif ier, heatmap classif ier), and
from the list and the page corresponding to this URL is                 the page is fetched if its score ≥ t1 . In our case, a binary
fetched (i.e., downloaded) and parsed to extract its hyper-             heatmap classifier is considered and the fusion function f
links. In the simple case that the focussed crawler estimates           is set to correspond to max. This results in a page being
                                                                        fetched if either its text-based confidence score is above t1 or
1
    http://www.dmoz.org/.                                               if its text-based confidence score is above t2 (t2 < t1 ) and its


                                                                   63
parent page contains at least one heatmap. Next, the text               terms within the same representation. The text–based clas-
and heatmap classifiers employed in this work are described.            sification score of each hyperlink is then obatined by the em-
                                                                        ploying the classifier on the feature vector and corresponds
3.1    Text–Based Link Classification                                   to a confidence value that reflects the distance of the testing
   Text–based link classification is performed using a super-           sample to the hyperplane.
vised machine learning approach based on SVMs and a va-                    Our model was trained using 711 samples (100 positive,
riety of textual features extracted from the hyperlink’s local          611 negative). Each sample corresponds to a hyperlink point-
context. SVMs are applied due to their demonstrated effec-              ing to page providing air quality measurements and fore-
tiveness in similar applications [15].                                  casts; these hyperlinks were extracted from 26 pages about
   Each hyperlink is represented using textual features ex-             air quality obtained from ODP and previous empirical stud-
tracted from the following fields:                                      ies conducted by domain experts in the context of the project
                                                                        PESCaDO2 . It should be noted that both the hyperlinks
   • a: anchor text of the hyperlink,                                   and their parent pages are different from the seed set used
                                                                        in the evaluation of the focussed crawler (see Section 4).
   • h: the terms extracted from the URL of the hyperlink;              The generated lexicon consists of 207 terms with the follow-
     string sequences are split in punctuation marks and                ing being the 10 most frequent in the training corpus: days,
     common URL extensions (e.g., com) and prefixes (e.g.,              ozone, air, data, quality, today, forecast, yesterday, raw, and
     www) are removed;                                                  current. The geographical lexicon consists of 3,625 terms
                                                                        obtained from a geographical database.
   • s: the terms extracted from a text window of 50 char-
     acters surrounding the hyperlink; this text window                 3.2    Heatmap Recognition
     does not contain the anchor text of adjacent links (i.e.,             Heatmap recognition is performed by applying a recently
     the window stops as soon as it encounters another                  developed approach by our research group [10]. That inves-
     link),                                                             tigation on heatmap binary classification using SVMs and a
                                                                        variety of visual features indicated that, overall, the MPEG–
   • so: the terms extracted from a text window of 50 char-
                                                                        7 [3] descriptors demonstrated a slightly better performance
     acters surrounding the hyperlink when overlap to the
                                                                        than the other tested visual features (SIFT [4] and AHDH3
     adjacent links is allowed.
                                                                        [18]).
Combinations of the above lead to the following five repre-                In particular, the following three extracted MPEG–7 fea-
sentations corresponding to concatenations of the respective            tures that capture color and texture aspects of human per-
fields: a+s, a+so, a+h, a+h+s, and a+h+so.                              ception were the most effective:
   In the training phase, a list of positive and negative sam-
                                                                           • Scalable Color Descriptor (SC): a Haar–transform
ples are collected first, so as to build a vocabulary for rep-
                                                                             based encoding scheme that measures color distribu-
resenting the samples in the textual feature space and also
                                                                             tion over an entire image, quantized uniformly to 256
for training the model. Each sample corresponds to a hy-
                                                                             bins,
perlink pointing to a Web page on air quality measurements
and forecasts and its associated a+so representation. The                  • Edge Histogram Descriptor (EH): a scale invari-
vocabulary is built by accumulating all the terms from the                   ant visual texture descriptor that captures the spatial
a+so representations of the samples and eliminating all stop-                distribution of edges; it involves division of image into
words. This representation was selected so as to lead to a                   16 non–overlapping blocks and edge information cal-
richer feature space, compared to the sparser a, s, and a+s                  culated for each block in five edge categories, and
representations, while also remaining relatively noise free
compared to the a+h+s and a+h+so representations which                     • Homogenous Texture Descriptor (HT): describ-
are likely to contain more noise given the difficulties in suc-              ing directionality, coarseness, and regularity of pat-
cessfully parsing URLs.                                                      terns in images based on a filter bank approach that
   Each sample is represented in the textual feature space                   employs scale and orientation sensitive filters.
spanned by the created vocabulary using a tf.idf = tf (t, d)×
log( dfn(t) ) weighting scheme, where tf (t, d) is the frequency        Their early fusion (SC–EH–HT), as well as the feature EH on
of term t in sample d and idf (t) is the inverse document               its own produced the best results when employing an SVM
frequency of term t in the collection of n samples, where               classifier with an RBF kernel. The evaluation was performed
df (t) is the number of samples containing that term. Fur-              by training the classifier on a dataset of 2,200 images (600
thermore, a feature representing the number of geographical             relevant, i.e., heatmaps) and testing it on dataset of 2,860
terms in the sample’s a+so representation is added, given               images (1,170 heatmaps)4 .
the importance of such terms in the environmental domain                   In this work, both the EH and the SC–EH–HT models
[12]. To avoid overestimation of their effect, such geographi-          trained on the first dataset are employed. An image is clas-
cal terms were previously removed from the vocabulary that              sified as a heatmap if at least one of these classifiers considers
was built. The SVM classifier is built using an RBF kernel              it to be a heatmap, i.e., a late fusion approach based on a
and 5–fold cross–validation is performed on the training set            logical OR is applied.
to select the class weight parameters.                                  2
                                                                          Personalised Environmental Service Configuration and De-
   In the testing phase, each sample is represented as a fea-           livery Orchestration (http://www.pescado-project.eu/).
ture vector based on the tf.idf of the terms extracted from             3
                                                                          Adaptive Hierarchical Density Histogram.
one of the proposed representation schemes (a, a+s, a+so,               4
                                                                          Both datasets are available at: http://mklab.iti.gr/
a+h, a+h+s, or a+h+so) and the number of geographical                   project/heatmaps.


                                                                   64
                                                 Table 1: List of seed URLs.
                                                       URL                                     heatmap present
                        1.   http://aircarecolorado.com/
                        2.   http://airnow.gov/                                                      X
                        3.   http://db.eurad.uni-koeln.de/en/                                        X
                        4.   http://gems.ecmwf.int/                                                  X
                        5.   http://maps.co.mecklenburg.nc.us/website/airquality/default.php         X
                        6.   http://uk-air.defra.gov.uk/
                        7.   http://www.baaqmd.gov/The-Air-District.aspx
                        8.   http://www.eea.europa.eu/
                        9.   http://www.gmes-atmosphere.eu/
                       10.   http://www.londonair.org.uk/LondonAir/Default.aspx                      X



4.    EVALUATION                                                       4.2    Performance Metrics
   A pilot study is performed for evaluating the performance              The standard retrieval evaluation metrics of precision and
of the proposed focussed crawling approach.                            recall are typically applied for assessing the effectiveness of
   A set of 10 seeds5 (listed in Table 1) was selected, sim-           a focussed crawler. Precision corresponds to the proportion
ilarly to before, i.e., using ODP and the outcomes of em-              of fetched pages that are relevant and recall to the propor-
pirical studies conducted by domain experts in the context             tion of all relevant pages that are fetched. The latter re-
of the project PESCaDO. Half of them contain at least one              quires knowledge of all relevant pages on a given topic, an
heatmap. Starting from these 10 seeds, a crawl at depth 1              impossible task in the context of the Web. To address this
is performed. A total of 807 hyperlinks are extracted from             limitation, two recall–oriented evaluation techniques have
these 10 seeds and several focussed crawling approaches are            been proposed [13]: (i) manually designate a few represen-
applied for deciding which ones to fetch. These are evaluated          tative pages on the topic and measure what fraction of them
in the following two sets of experiments.                              are discovered by the crawler, and (ii) measure the overlap
                                                                       among independent crawls initiated from different seeds to
4.1    Experiments                                                     see whether they converge on the same set of pages. Given
   Experiment 1: This experiment examines the relative                 the small scope of our study (i.e., a crawl at depth 1), these
merits of the different text–based representations of hyper-           approaches are not applicable and therefore recall is not con-
links (i.e., a, a+s, a+so, a+h, a+h+s, and a+h+so). In                 sidered in our evaluation. In addition to precision, the ac-
this case, a text–based classifier–guided focussed crawling            curacy of the classification of the crawled outlinks is also
is applied for each representation and a page is fetched if            reported.
its text–based confidence score is above a threshold t1 . Ex-
periments are performed for t1 values ranging from 0.0 to              4.3    Relevance Assessments
0.9 at step 0.1. When t1 = 0.0, the crawl corresponds to a               All 807 extracted hyperlinks were manually assessed. Af-
breadth–first search where all hyperlinks are fetched and no           ter applying some light URL normalisation (e.g., deleting
focussed crawling is performed.                                        trailing slashes) and removing duplicates, 689 unique URLs
   Experiment 2: This experiment investigates the effec-               remain. These correspond both to internal (within–site)
tiveness of incorporating multimedia evidence in the form              and to external links that were assessed using the follow-
of heatmaps in the crawling process. In this case, a page              ing three–point relevance scale:
pointed by a hyperlink is fetched if the hyperlink’s text–
                                                                          • (highly) relevant: Web resources that provide air qual-
based confidence score is above t1 or if its text–based confi-
                                                                            ity measurements and forecasts. These data should
dence score is above t2 (t2 < t1 ) and its parent page contains
                                                                            either be visible on the page or should appear after
at least one heatmap. The text–based confidence scores are
                                                                            selecting a particular value from options (e.g., region,
obtained from the best performing classifier in Experiment
                                                                            pollutant, time of day, etc.) provided from drop–down
1. Experiments are performed for t1 and t2 values ranging
                                                                            menus.
from 0.0 to 0.9 at step 0.1, while maintaining t2 < t1 . These
experimental results are compared against two baselines: (i)              • partially relevant: Web resources that are about air
the results of the corresponding text–based focussed crawler                quality measurements and forecasts, but do not pro-
for threshold t1 , and (ii) the results of the corresponding                vide actual data. Examples include Web resources
text–based focussed crawler for threshold t2 .                              that list monitoring sites and the pollutants being mea-
   To determine the presence of a heatmap in the parent page                sured, explain what such measurements mean, describe
of a hyperlink, the page is parsed (since it is already down-               methods, approaches, and research for measuring, val-
loaded) and the hyperlinks pointing to images are compiled                  idating, and forecasting air quality data, or provide
into a list. The crawler iteratively downloads each image in                links to components, systems, and applications that
the list, extracts its visual features, and applies the heatmap             measure air quality.
classification until a heatmap is recognised or a maximum
number of images is downloaded from each page (set to 20                  • non–relevant: Web resources that are not relevant to
in our experiments).                                                        air quality measurements and forecasts, including re-
   In both experiments, when a hyperlink appears more than                  sources that are about air quality and pollution in gen-
once within a seed page, only the one with the highest score                eral, discussing, for instance, its causes and effects.
is taken into consideration for evaluation purposes.                   Overall, our crawled dataset contains 232 (33.7%) highly
5                                                                      relevant pages, 51 (7.4%) partially relevant, and 406 (58.9%)
 These URLs are different to the ones used when training
the classifiers.                                                       non–relevant ones.


                                                                  65
                  0.5




                                                                                                           1.0
                                                                                     a+h
                                                                             ●       a+s
                                                                                     a+so                                                                         ●        ●      ●




                                                                                                           0.8
                                                                                                                                           ●          ●     ●




                  0.4
                                                                                                                         ●     ●     ●
                                                                                     a+h+s
                                                                                     a+h+so
      precision




                                                                                               accuracy
                                                                                                           0.6
                                                 ●          ●
                                                                  ●
                  0.3
                                                                        ●        ●
                               ●     ●                                                  ●
                                           ●




                                                                                                           0.4
                                                                                                                                                                               a+h
                  0.2



                                                                                                                                                                       ●       a+s
                                                                                                                                                                               a+so




                                                                                                           0.2
                         ●                                                                                                                                                     a+h+s
                  0.1




                                                                                                                   ●
                                                                                                                                                                               a+h+so


                        0.0   0.1   0.2   0.3   0.4        0.5   0.6   0.7   0.8       0.9                        0.0   0.1   0.2   0.3   0.4        0.5   0.6   0.7   0.8       0.9
                                                      t1                                                                                        t1

Figure 4: Precision and accuracy of the focussed crawl for each text–based link classification method (a+h,
a+s, a+so, a+h+s, a+h+so) for threshold t1 ∈ {0, 0.1, ..., 0.9} when strict relevance assessments are employed.
                  0.8




                                                                                                           1.0
                                                                                     a+h
                                                                             ●       a+s
                                                                                     a+so




                                                                                                           0.8
                                                                                                                         ●     ●     ●     ●          ●     ●
                                                                                     a+h+s                                                                        ●        ●      ●
                  0.6




                                                 ●                                   a+h+so
      precision




                                                                                               accuracy
                                                                                                           0.6
                                     ●     ●                ●     ●
                               ●
                                                                                 ●      ●
                                                                        ●
                  0.4




                                                                                                           0.4
                                                                                                                                                                               a+h
                                                                                                                                                                       ●       a+s
                                                                                                                                                                               a+so

                                                                                                           0.2
                                                                                                                   ●
                                                                                                                                                                               a+h+s
                  0.2




                         ●                                                                                                                                                     a+h+so


                        0.0   0.1   0.2   0.3   0.4        0.5   0.6   0.7   0.8       0.9                        0.0   0.1   0.2   0.3   0.4        0.5   0.6   0.7   0.8       0.9
                                                      t1                                                                                        t1

Figure 5: Precision and accuracy of the focussed crawl for each text–based link classification method (a+h,
a+s, a+so, a+h+s, a+h+so) for threshold t1 ∈ {0, 0.1, ..., 0.9} when lenient relevance assessments are employed.


   A closer inspection revealed that 162 (69.8%) of the highly                                                   • lenient: when considering both highly relevant and
relevant pages were all crawled from seed no. 2 in Table 1                                                         partially relevant Web resources as relevant.
(http://airnow.gov/). These correspond to internal links
pointing to pages with air quality measurements/forecasts,                                                The distributions of relevance assessments in these two cases
each regarding a different U.S. region. This, in conjunction                                              are listed in Table 2.
with the fact that all these links obtained really high scores
(over 0.9) by our text classifier led us to remove them from                                              Table 2: Relevance assessments distributions when
further consideration as they would significantly skew the                                                the 3–point scale judgements are mapped to binary.
evaluation results. Therefore, the evaluation was performed                                                                      Strict       Lenient
only for the pages crawled from the nine remaining seeds
                                                                                                                   Relevant 70     (13.3)% 120 (22.8)%
and these are the results reported in Section 56 . Starting
                                                                                                              Non–Relevant 456 (86.7)% 406 (77.2)%
from the 9 seeds, our crawled dataset contains 526 URLs: 70
                                                                                                                        All 526 (100.0)% 526 (100.0)%
(13.3%) highly relevant pages, 50 (9.5%) partially relevant,
and 406 (77.2%) non–relevant ones.
   To apply the performance metrics presented above, these
multiple grade relevance assessments are mapped into binary                                               4.4       Implementation
relevance judgements in two different ways, depending on                                                     Our implementation is based on Apache Nutch (http://
whether we are strictly interested in discovering resources                                               nutch.apache.org/), a highly extensible and scalable open
containing air quality data, or whether we would also be                                                  source Web crawler software project. To convert it to a fo-
interested in information about air quality measurements                                                  cussed crawler, its parser was modified so as to filter the links
and forecasts. In particular, two mappings are considered:                                                being fetched based on our proposed approach. The text–
                                                                                                          based classifier was implemented using the libraries of the
   • strict: when considering only highly relevant Web re-                                                Weka machine learning software (http://www.cs.waikato.
     sources as relevant and the rest (partially relevant and                                             ac.nz/ml/weka/), while the implementation of the visual
     non–relevant) as non–relevant, and                                                                   classifier was based on the LIBSVM [2] library.
6
  It should be noted that http://airnow.gov/ appears in
the list of our crawled pages even when removed from the                                                  5.      RESULTS
seed list, since it is linked from other seed pages. However,
since crawling is performed at depth 1, its own outlinks are                                                Experiment 1: The results of this first experiment that
not considered any further.                                                                               evaluates the effectiveness of the different textual represen-


                                                                                              66
Table 3: Precision of the focussed crawler that combines the a+s text–based link classifier with the heatmap
classifier for thresholds t1 ∈ {0.1, ..., 0.9} and t2 ∈ {0, 0.1, ..., 0.8} when strict relevance assessments are employed.
                                                                      t2                                            Text–based baseline a+s
               t1
                                 0.0      0.1      0.2      0.3      0.4          0.5      0.6     0.7      0.8        (fetch if s >= t1 )
                0.1             0.215                                                                                        0.294
                0.2             0.213    0.314                                                                               0.296
                0.3             0.206    0.299    0.284                                                                      0.292
                0.4             0.214    0.346    0.340    0.354                                                             0.327
                0.5             0.215    0.353    0.347    0.362    0.333                                                    0.333
                0.6             0.214    0.362    0.356    0.372    0.341     0.333                                          0.324
                0.7             0.222    0.405    0.400    0.421    0.385     0.382       0.364                              0.310
                0.8             0.222    0.405    0.400    0.421    0.385     0.382       0.364   0.310                      0.308
                0.9             0.221    0.421    0.417    0.441    0.400     0.400       0.379   0.320   0.318              0.300
     Text–based baseline a+s
                                0.137    0.294    0.296    0.292    0.327      0.333      0.324   0.310    0.308
        (fetch if s >= t2 )



Table 4: Precision of the focussed crawler that combines the a+s text–based link classifier with the heatmap
classifier for thresholds t1 ∈ {0.1, ..., 0.9} and t2 ∈ {0, 0.1, ..., 0.8} when lenient relevance assessments are employed.
                                                                     t2                                            Text–based baseline a+s
                t1
                                  0.0      0.1      0.2      0.3      0.4          0.5     0.6     0.7     0.8        (fetch if s >= t1 )
                 0.1             0.360                                                                                      0.518
                 0.2             0.360    0.571                                                                             0.549
                 0.3             0.350    0.552    0.537                                                                    0.554
                 0.4             0.352    0.615    0.620    0.646                                                           0.612
                 0.5             0.347    0.608    0.612    0.638    0.604                                                  0.548
                 0.6             0.343    0.617    0.622    0.651    0.614        0.538                                     0.541
                 0.7             0.333    0.619    0.625    0.658    0.615        0.529   0.515                             0.483
                 0.8             0.333    0.619    0.625    0.658    0.615        0.529   0.515   0.483                     0.500
                 0.9             0.328    0.632    0.639    0.676    0.629        0.533   0.517   0.480   0.500             0.500
      Text–based baseline a+s
                                 0.217    0.518    0.549    0.554    0.612        0.548   0.541   0.483   0.500
         (fetch if s >= t2 )



tations employed by the text–based focussed crawler are de-                  that 8 of the 9 seeds were classified accurately for the pres-
picted in Figures 4 and 5, when applying strict and lenient                  ence of heatmaps in them (all apart from seed no. 10 in
relevance assessments, respectively.                                         Table 1). This is probably due to the difficulty in parsing
   The a+s classifier–guided focussed crawler achieves the                   the specific Web resource and also in recognising its images
highest overall precision, both for the strict and the lenient               as heatmaps, as they correspond to non–typical heatmaps,
cases, and for t1 = 0.4, indicating the benefits of combin-                  different to the ones in our training set. On average, 10 sec-
ing the anchor text with the terms obtained from a non–                      onds were required per Web resource for the downloading,
overlapping text window. It also achieves the highest accu-                  feature extraction, and classification of its images; however,
racy, which is equal to that of the a+h and a+h+s classi-                    this overhead could be reduced by applying parallelisation.
fiers; these two classifiers have though slightly lower preci-                  Tables 3 and 4 present the results of the second exper-
sion compared to that of a+s. This indicates that the URL is                 iment, when applying strict and lenient relevance assess-
potentially a useful source of evidence and that application                 ments, respectively, for t1 and t2 values ranging from 0.0
of more advanced techniques for extracting terms from an                     to 0.9 at step 0.1, while maintaining t2 < t1 . The results
URL is probably required for reaching its full potential. The                are compared against the two baselines listed in the tables’
a+so and a+h+so classifiers are the least effective for lower                last column and last row respectively. The values in bold
t1 values indicating that the additional terms present in the                correspond to improvements over both baselines.
overlapping text window introduce noise that leads to the                       The observed substantial improvements for multiple thresh-
misclassification of non–relevant hyperlinks. Furthermore,                   old values provide an indication of the benefits of incorporat-
all focussed crawlers improve upon precision for t1 = 0.0                    ing visual evidence as global evidence in a focussed crawler.
that corresponds to general–purpose crawling. As expected,                   Consider the best performing classifier when strict relevance
the absolute values of precision are much higher in the le-                  assessments are employed: it achieves precision of 0.44 for
nient case, compared to the strict.                                          t1 = 0.9 and t2 = 0.3, while the text–based focussed crawler
   Experiment 2: The second experiment aims to allow us                      for the same t1 = 0.9 achieves precision 0.30. An examina-
to gain insights into the feasibility and potential benefits of              tion of the results shows that the improvements are due to
incorporating multimedia in the form of heatmaps in the                      the fact that 65% of the newly added hyperlinks, i.e., those
crawling process. To this end, it combines a+s, the best                     with text–based classification score between 0.3 and 0.9, are
performing text–based classifier from the first experiment,                  relevant.
with results from the heatmap classifier. First, the results
of the heatmap classification are presented.                                 6.     CONCLUSIONS
   Each of the nine seeds contains 15 images on average as
                                                                               This work proposed a novel classifier–guided focussed crawl-
identified by our parser. On average, 8 images are down-
                                                                             ing approach for the discovery of environmental Web re-
loaded from each seed before a heatmap is found or the im-
                                                                             sources providing air quality measurements and forecasts
age list ends. Out of the 75 downloaded images, 74 were
                                                                             that combines multimedia (textual + visual) evidence for
correctly classified, with 3 being heatmaps. This means
                                                                             predicting the benefit of fetching an unvisited Web resource.


                                                                     67
The results of our pilot study provide a first indication of the         [9] K. Karatzas and N. Moussiopoulos. Urban air quality
effectiveness of incorporating visual evidence in the focussed               management and information systems in Europe: legal
crawling process over the use of textual features alone.                     framework and information access. Journal of
   Large–scale experiments are currently planned for fully                   Environmental Assessment Policy and Management,
assessing the potential benefits of the proposed multime-                    2(02):263–272, 2000.
dia focussed crawling approach, including experiments for               [10] A. Moumtzidou, S. Vrochidis, E. Chatzilari, and
improving the effectiveness of the textual classification by                 I. Kompatsiaris. Discovery of environmental nodes
taking into account also the textual content of the entire                   based on heatmap recognition. In Proceedings of the
parent page, similar to previous research [16]. Further fu-                  20th IEEE International Conference on Image
ture work includes the consideration of other types of images                Processing (ICIP 2013), 2013.
common in environmental Web resources, such as diagrams,                [11] A. Moumtzidou, S. Vrochidis, and I. Kompatsiaris.
simple filtering mechanisms for removing prior to classifi-                  Discovery, analysis and retrieval of multimodal
cation small–size images that are unlikely to contain useful                 environmental information. In Encyclopedia of
information (e.g., logos and layout elements), and the incor-                Information Science and Technology (in press). IGI
poration of additional local evidence, such as the distance                  Global, 2013.
of the hyperlink to the heatmap image. Finally, we aim to               [12] A. Moumtzidou, S. Vrochidis, S. Tonelli,
investigate the application of the proposed focussed crawler                 I. Kompatsiaris, and E. Pianta. Discovery of
in other domains where information is commonly encoded                       environmental nodes in the web. In Multidisciplinary
in multimedia form, such as food recipes.                                    Information Retrieval, Proceedings of the 5th
                                                                             International Retrieval Facility Conference (IRFC
7.   ACKNOWLEDGMENTS                                                         2012), volume 7356 of LNCS, pages 58–72, 2012.
  This work was supported by MULTISENSOR (contract                      [13] C. Olston and M. Najork. Web crawling. Foundations
no. FP7–610411) and HOMER (contract no. FP7–312388)                          and Trends in Information Retrieval, 4(3):175–246,
projects, partially funded by the European Commission.                       2010.
                                                                        [14] S. Oyama, T. Kokubo, and T. Ishida. Domain-specific
                                                                             web search with keyword spices. IEEE Transactions
8.   REFERENCES                                                              on Knowledge and Data Engineering, 16(1):17–27,
 [1] S. Chakrabarti, M. van den Berg, and B. Dom.                            Jan. 2004.
     Focused crawling: A new approach to topic-specific                 [15] G. Pant and P. Srinivasan. Learning to crawl:
     web resource discovery. In Proceedings of the 8th                       Comparing classification schemes. ACM Transactions
     International Conference on World Wide Web,                             on Information Systems, 23(4):430–462, 2005.
     (WWW 1999), pages 1623–1640, 1999.                                 [16] G. Pant and P. Srinivasan. Link contexts in
 [2] C. C. Chang and C. J. Lin. LIBSVM: a library for                        classifier-guided topical crawlers. IEEE Transactions
     support vector machines. ACM Transactions on                            on Knowledge and Data Engineering, 18(1):107–122,
     Intelligent Systems and Technology (TIST), 2(3):27,                     2006.
     2011.                                                              [17] R. San José, A. Baklanov, R. Sokhi, K. Karatzas, and
 [3] S. F. Chang, T. Sikora, and A. Puri. Overview of the                    J. Pérez. Computational air quality modelling.
     MPEG-7 standard. IEEE Transactions on Circuits                          Developments in Integrated Environmental
     and Systems for Video Technology, 11(6):688–695,                        Assessment, 3:247–267, 2008.
     2001.                                                              [18] P. Sidiropoulos, S. Vrochidis, and I. Kompatsiaris.
 [4] K. Chatfield, V. S. Lempitsky, A. Vedaldi, and                          Content-based binary image retrieval using the
     A. Zisserman. The devil is in the details: an                           adaptive hierarchical density histogram. Pattern
     evaluation of recent feature encoding methods. In                       Recognition, 44(4):739 – 750, 2011.
     Proceedings of the British Machine Vision Conference               [19] T. T. Tang, D. Hawking, N. Craswell, and
     (BMVC 2011), pages 1–12, 2011.                                          K. Griffiths. Focused crawling for both topical
 [5] J. Cho, H. Garcia-Molina, and L. Page. Efficient                        relevance and quality of medical information. In
     crawling through URL ordering. Computer Networks,                       Proceedings of the 14th ACM International Conference
     30(1-7):161–172, 1998.                                                  on Information and Knowledge Management, (CIKM
 [6] B. D. Davison. Topical locality in the web. In                          2005), pages 147–154, 2005.
     Proceedings of the 23rd Annual International ACM                   [20] T. T. Tang, D. Hawking, N. Craswell, and R. S.
     SIGIR Conference on Research and Development in                         Sankaranarayana. Focused crawling in depression
     Information Retrieval, (SIGIR 2000), pages 272–279,                     portal search: A feasibility study. In Proceedings of the
     2000.                                                                   9th Australasian Document Computing Symposium
 [7] P. De Bra and R. D. J. Post. Information retrieval in                   (ADCS 2004), pages 1–9, 2004.
     the world-wide web: Making client-based searching
     feasible. Computer Networks and ISDN Systems,
     27(2):183–192, 1994.
 [8] V. Epitropou, K. Karatzas, and A. Bassoukos. A
     method for the inverse reconstruction of environmental
     data applicable at the chemical weather portal. In
     Proceedings of the GI-Forum Symposium and Exhibit
     on Applied Geoinformatics, pages 58–68, 2010.


                                                                   68