<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Designing A General Deep Web Harvester by Harvestability Factor</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mohamamdreza Khelghati</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maurice van Keulen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Djoerd Hiemstra</string-name>
          <email>d.hiemstra@utwente.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Databases Group, University of Twente</institution>
          ,
          <addr-line>Netherlands s.m.khelghati, m.vankeulen</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>To make deep web data accessible, harvesters have a crucial role. Targeting dierent domains and websites enhances the need of a general-purpose harvester which can be applied to dierent settings and situations. To develop such a harvester, a large number of issues should be addressed. To have all inuential elements in one big picture, a new concept, called harvestability factor ( HF ), is introduced in this paper. The HF is dened as an attribute of a website ( HF W) or a harvester (HF H) representing the extent to which the website can be harvested or the harvester can harvest. The comprising elements of these factors are dierent websites' or harvesters' features. These elements are gathered from literature or introduced through the authors' experiments. In addition to enabling designers of evaluating where they products stand from the harvesting perspective, the HF can act as a framework for designing harvesters. Designers can dene the list of features and prioritize their implementations. To validate the eectiveness of HF in practice, it is shown how the HFs0 elements can be applied in categorizing deep websites and how this is useful in designing a harvester. To validate the HF H as an evaluation metric, it is shown how it can be calculated for the harvester implemented by the authors. The results show that the developed harvester works pretty well for the targeted test set by a score of 14.783 of 15.</p>
      </abstract>
      <kwd-group>
        <kwd>Deep Web</kwd>
        <kwd>Deep Web Harvester</kwd>
        <kwd>Harvester Performance Evaluation</kwd>
        <kwd>Harvestability Factor</kwd>
        <kwd>Harvester Design Framework</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Nowadays, in an information-thirsty environment, the deep web concept receives
lots of attention. The content hidden behind web forms which is invisible or
hidden to general search engines like Google or Yahoo is dened as deep web
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] ( also known as hidden or invisible web [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]). Whether the goal of an access
approach is indexing more representative content of a website (referred as
Surfacing approach [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) or extracting the whole content, harvesters have a crucial role.
Covering dierent domains, and websites increases the need to have a
generalpurpose harvester which can be applied to dierent settings and situations. To
develop such a harvester, a number of issues like business domain, targeted
websites, and the harvesting goals should be considered. Dierent business domains
and goals could pose diverse characteristics on deep web access approaches. In
some domains, a few number of big databases are the main sources of data and
in others, data is scattered through a large number of websites. The latter makes
it more desirable to have an approach with no need of extra conguration or at
least minimal conguration eort for each website. The goal of the harvesting
task is also important [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. If the goal is to extract all the data and the harvester
downloads it partially, the harvesting task is not considered successful. However,
this might be a success story if the goal is just to obtain a representative set of
data [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In addition to the domain and harvesting goal, features of deep websites
could have great impacts on a deep web access approach. Dierent website
features, from graphical interface to back-end designing and developing techniques
could play an important role. If a website is Flash [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], an Applet, or a simple
HTML page, it makes a big dierence on the access approach design. Without
a well-dened list of elements aecting harvesting tasks, having a general deep
web access approach seems far from reach.
      </p>
      <p>Contributions As the main contribution, a new concept, called Harvestability
Factor (HF ) is introduced. This concept enables websites’ and harvesters’
designers of evaluating where they products stand in the harvesting point of view.
Through this concept, we also put all the important elements in harvesting deep
websites in one big picture. This denes a framework for designing general deep
web harvesters trying to cover all the features from dierent aspects in designing
a harvester. Some of these factors are mentioned in the literature and the others
are discovered through the experiments by the authors. Having all these
important features in one big picture, we evaluate the inuence of each of them on
harvesting. This helps creating an implementation strategy for harvester.
Dening the importance of each feature helps prioritizing features implementations.
Sections In Section 2, the harvestability factor ( HF ) is introduced. The Section
4 introduces the elements of HF W which are also applied for categorizing deep
websites. In this section, all the features of deep websites aecting harvesting
process are introduced and deep websites are categorized accordingly. In Section
5, the HF elements for a harvester are dened. All the requirements for designing
a general purpose deep web harvester from general requirements to detailed ones
are also discussed. The dierent approaches applied in literature to meet these
requirements are also explored. Having mentioned all the necessary requirements,
in Section 6, as a sample of such a general deep web harvester, the designed
harvester by the authors of this paper is introduced and both HF as a design
framework and an evaluation metric are validated. Finally, in Section 7, the
conclusions drawn from this work are discussed and future work is suggested.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Harvestability Factor</title>
      <p>To formalize all important issues in accessing deep web data, a new factor is
introduced in this paper as Harvestability F actor (HF ). Although in a harvesting
process, the roles of both harvester and website are intertwined, separate
denitions are required by website and harvester designers for better understanding of
harvesting processes. Hence, the HF is dened as an attribute of a website or a
harvester representing the extent to which the website can be harvested ( HF W)
or the harvester can harvest ( HF H).</p>
      <p>As it is shown in Formula 2.1, the HF of a given harvester ( h) is dened
by applying function sum of multiplying harvester performances for each one of
websites’ features discussed in Section 4 (percentage of harvester failure shown
by F a) by the importance of that feature. In this formula, Crf and Cof represent
the importance of a feature. n is the number of features and k represents the
number of general features. The Crf represents how critical is the feature for
harvesting and the Cof represents how often this feature is used in the targeted
domain. In this formula, general requirements are represented by GF j . This is
discussed more in Section 5.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Related Work</title>
      <p>
        In this paper, two issues are targeted by introducing the HF ; HF as a harvester
design framework and HF as an evaluation metric for websites and harvesters.
Since the introduction of deep web, there has been several attempts to give access
to this part of web and improve the existing approaches [
        <xref ref-type="bibr" rid="ref3 ref5 ref6 ref7 ref8 ref9">5,6,7,8,9,10,3</xref>
        ]. In all
these approaches, the focus is on harvesters rather than the websites. They try
to improve harvester performance by applying new techniques. Although this
is essential but it is not enough. In this paper, it is believed that improving
eciency, scalability, robustness, and other requirements of harvesters is not
n
HF H (h) = X(1
i=1
(hF afi
      </p>
      <p>Crfi</p>
      <p>k
Cofi )) + X(hGFj )
j=1</p>
      <p>In Formula 2.2, the HF W is dened for a website by considering its features
discussed in Section 4. In this formula, given a website, for each one of its
features, the average performance of harvesters and the importance of the feature
are multiplied. n is the number of features and m is the number of harvesters
considered which can be also one. In this formula, wpfi represents the absence
or presence of the feature in the website.</p>
      <p>n
HF W (w) = X(1
i=1
(wpfi</p>
      <p>m
(1=m X(F afj )) Crfi ))
j=1</p>
      <p>Assigning accurate values to the weights and features mentioned in these
two formulas is beyond the scope of this paper and considered as a feature
work. However, in Section (6), using simple approaches to assign values to these
parameters, it is shown how these formulas can help in evaluating harvesters
and websites. In this paper, it is tried to cover all aspects of the introduced
HF elements; business domain, harvesting goal, harvester features and websites
features to give a design harvester guideline.
(2.1)
(2.2)
possible without having all the eecting factors in one big picture. Introducing
the HF is the rst step in this direction. The HF helps to study not only the
harvesters but also the targeted domain and the websites features in the design
process.</p>
      <p>Evaluate/Compare harvesters Small amount of work has been done in
comparing and analyzing web harvesting tools. The related studies in literature such as
the work in [11,12] focus mainly on a limited number of aspects such as
capability in dealing with dierent data formats, capability to record the extracted
data, user friendliness, price in market, export formats, ability to manage the
anonymous scraping and multi-threading. However, in this study, in addition
to these features, a more detailed set of features are introduced. Despite other
works, this paper provides a mechanism to produce an evaluation number to
each harvester considering a wide range of features from general requirements
to detailed capabilities.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Elements of Website Harvestability Factor</title>
      <p>In this section, dierent features of websites which are related to harvesting
processes are studied. The roles of each feature as dening element of a website HF
are also mentioned. Each of these features could be also applied for categorizing
deep websites from the harvesting perspective. In the extended version of this
paper [13], the more detailed descriptions of these elements could be found.
4.1</p>
      <sec id="sec-4-1">
        <title>Web Development Techniques</title>
        <p>A number of techniques applied in developing and designing web sites and web
pages create challenges for harvesters. These techniques are usually applied to
add interactivity to web pages as well as for improving site navigation. In
following, there is a list of such techniques.</p>
        <p>
          Embedded Scripting Languages in HTML Pages Embedded scripts in
HTML pages can make content in layers either shown or hidden based on a user
action, or change in a state. They can even build HTTP requests to ll out and
submit a form dynamically. Managing HTML layers, performing redirections,
dynamically generating navigations like pop-up menus and creating hidden
anchors are some of the issues which could be caused by client-side scripts [
          <xref ref-type="bibr" rid="ref4">4,14</xref>
          ].
This prevents harvesters to have the page as it is shown to user.
Session Management Mechanisms In session management mechanism 1,
server keeps track of transactions made with a client. Based on this history and
information on client resources, server could provide dierent services to the
client. For harvesters, in later access to documents or distributed crawling, this
will create problems [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] as client environment changes or session expires.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>1 http://en.wikipedia.org/wiki/Session_(computer_science)</title>
        <p>
          Complex URL Redirections For reasons such as similar or moved domains,
manipulating search engines or visitors, or URL shortening, URLs are redirected.
This means dierent responses are given to browser request which results in
browser showing a dierent page. These redirections could happen automatically
or manually. It is easier for harvesters to deal with redirections handled on server
side unless it is a redirection loop which does not load any page at the end or it
is a redirect chain which might take longer time to have the nal page returned
[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Handling the redirections initiated by scripts embedded in page content is a
completely dierent story. Refresh meta tag in HTML, JavaScript redirections,
and Frame redirections are examples of these redirections.
        </p>
        <p>Applets or Flash code If Flash or Applet is used for designing whole website,
it is almost impossible for harvesters to access its content without running
expensive analysis over each item. Nowadays, websites designers avoid these practices
in order to make sure their sites are on good terms with crawlers. If only the
welcoming page is designed by Flash or Applet, it becomes easier for harvesters.
If they are used for advertisements, they can be ignored.</p>
        <p>Frames There are also some issues such as frames which can create
diculties for harvesting processes. Detecting the right frame which contains the page
content in a multi-frame page is one of the problems created by such issues.
HTML Coding Practices For harvesters which rely on tags, attributes,
and also presentation features, HTML code practices become highly important.
Having bad-written HTML code (like not closed tags) might cause problems in
analyzing page HTML tree and therefore incapability of harvester to extract
data. Lacking well-dened ID, class, and other explanatory attributes for items
could also make diculties for harvesters and make them prone to mistakes.
Being consistent in coding practices for all pages and data items is also
important. For example, if there is IDs for items, it should be the case for all of them
or at least a dened set of items (like dierent categories). In some cases, data
from the same category, even with the same presentation template have small
dierences in the HTML code behind them. This might mislead harvesters.
4.2</p>
      </sec>
      <sec id="sec-4-3">
        <title>Website Policies</title>
        <p>Search Policies
Query Interfaces There are a number of dierent web interfaces classied as
keyword-based, form-like, browsing, and a combination of them [15]. Each one
of these interfaces creates a dierent set of requirements for harvester. For
example, in a form-like search interface, information on attribute-value bindings
or accessing predened lists of values for attributes could be of great help for
harvesters to decide on which queries to send to the search engine. Detecting
interfaces and recognizing dierent features of web forms could be help harvesters.
Knowing that query interface provides dierent search options like searching by
keyword, industry domain, region, or time helps harvester to act eciently.
Indexing Policies In case of having search feature in a website, it becomes
important to know about the indexing policies. For example, with stop-words
indexed in a web site, sending a stop-word query is one of the most reliable
options to have a response. Also, if there is no limitation on browsing through
search results, sending only one stop-word results in a high coverage. In addition
to indexing policies regarding stop words, it is important to know which parts of
data are indexed. For example, having only titles indexed makes great dierence
in dening next queries with having whole text of detailed pages indexed. This
is the case in generating queries based on most frequent words in visited pages.
Search Queries and Algorithms In response to a query posed to a search engine,
websites do not necessarily follow the same principles. In some cases, stop-words
are removed from search queries, query phrases are treated in dierent ways
(considered as AND phrase or OR phrase), or number of returned results shown
to user is limited. There might be even dierences on additional information
provided in reply to a query, such as statistics on search results and number of
found related answers. There are also websites which put a limitation on the
number of queries a client can send.</p>
        <p>Navigation In most of websites, a query is sent, search results are displayed and
by following each one of those returned results, a detailed page is represented.
However, there are situations in which this is not the case. In some websites,
in return to a query, a list of categories related to that query are displayed.
Following each one of those categories might end up in another subcategory.
This makes it dicult for harvester to realize which returned page is a category
or actually a detailed page.</p>
        <p>Security, Privacy and Legal Policies Answering this question should be
one of the rst steps in harvesting process: is it legal to access data, store it
and present it to users?. It is also important to note that if login is required
by website to access data. Considering website’s terms of service to follow the
privacy policies is also important. In some websites, the Robots Exclusion
Protocol is applied which gives instructions about the site to web robots in a le
named Robots.txt. In case of the existence of such a le and depending on how
strict it is asked to be followed, necessary concerns should be considered. Not
all the websites welcome bots (harvesters, or crawlers) with open arms. Having
recognized bots through trac monitoring, bot identity declaration, or real
person declaration techniques like a CAPTCHA, websites can use various measures
to stop or slow them down. Blocking an IP address, disabling web service API,
commercial anti-bot services, or using application rewalls are some of these
measures. It is also important to note other privacy policies of the website like
policy on disclosing aggregate information for analytical purposes by owners of
website.
4.3</p>
      </sec>
      <sec id="sec-4-4">
        <title>Data and Content</title>
        <p>
          Type and Format of Residing Data in Data Sources The content of a
deep website could be categorized in two dierent groups [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]: structured data
found in almost all shopping websites (products as entities), movie sites, job
listings, and etc, and unstructured data like articles and papers. Each of these
mentioned data types have dierent features which could be helpful in harvesters
performances. For example, in a website representing structured data, using the
features of a data item like company name could help in dening next queries
resulting in a more ecient crawl. It is also of a great importance for harvesters
to know about dierent data le formats for pdf, image, or video les. Dierent
data formats need dierent handlers to download them.
        </p>
        <p>Data Layout How data is represented in web pages aects the harvesters
relying on presentation features of data. Dierent data types in a website could
be presented in dierent ways. Even data items of a same category could be
presented dierently based on their features. If these dierences in presentation are
not known to harvester, it will use the same algorithm to extract all data. This
might result in extracting none or undesired information. Structural variations
on data presentation must be also tolerated by harvesters and treated
accordingly. If the data is represented in a structured way like lists or tables or it is
represented in text or a combination of both, harvester should treat them
differently. It is also important if a data item has elds represented as nested data
on pages; for example, comments or scores information. This poses dierent
requirements on extracting and storage of information.</p>
        <p>Data Type Formats Including ontologies and text-patterns in the process of
extracting data from detailed pages makes it important to investigate how they
can aect the harvesting process. Committing to one ontology and following same
patterns for same concepts like dd-mm-yyyy format for all dates mentioned on
the website could aect the conguration and design of the harvester. Also, for
example, if the mentioned address format on the website is not the same for all
addresses mentioned in the website, it can have a great eect on the harvester
conguration.</p>
        <p>Information of a Data Item is Scattered in Dierent Pages Usually,
the queries are sent to search engine, returned results are followed and data
about desired items is extracted. However, this is not always the case. In some
cases data of a interesting data item is scattered in website. In a more common
way, general data is presented in the page navigated through search results.
However, more detailed information is provided in some other links which is
accessible (only) through this detailed page (you need to go to the detailed page
and then browse through the tabs or links to access the information you want).
Finding these links and extracting information from them could be a challenge
for harvesters.</p>
        <p>Providing Semantic Annotations (Meta data) The pages may include
meta data or semantic markups and annotations. The annotations might be
embedded in the pages or organized into a semantic layer [16] stored and managed
separately from the web pages. Data schema and instructions from this layer can
be retrieved by harvesters before scraping the pages.</p>
        <p>Website Content Language Dealing with the language of the targeted
website is one of the abilities that the harvesters should have. Some of the approaches
applied in harvesters are based on parsing the content of web pages like data
patterns. Having this in mind, it should be noted that dealing with Chinese
language needs dierent congurations than English or the Farsi languages. Having
dierent languages in the targeted websites will cause diculties for these
harvesters.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Elements of Harvester Harvestability Factor</title>
      <p>As mentioned in the Introduction Section, designing a deep web access approach
is highly aected by business domains, websites, and the harvesting goals. In
the previous section, the features of a website aecting the HF is mentioned.
In dening the harvestability factor for a harvester, in addition to the ability of
the harvester in dealing with each one of the mentioned website features in the
previous section, there are also a number of general requirements which should
be met by the harvester. These two set of features help us in dening the HF
for a harvester. Knowing about the techniques and methods applied in each
harvester helps in deciding about the harvester performs for each one of the
elements. Therefore, in this section, a subsection is dedicated to dening these
methods and techniques.
5.1</p>
      <sec id="sec-5-1">
        <title>High Level Requirements</title>
        <p>Every general purpose harvester, despite the dierences among its goal, and the
domain and websites it is targeting, should meet a set of requirements important
in all harvesting processes. Being automatic or running with minimal
conguration is one of these requirements. Being scalable (applicable to a large number of
websites), independent (of business domain, technology, and etc), ecient (with
the least possible number of queries, harvests the most possible amount of data),
easy to use (conguration and run settings should be easy for users to perform),
and resilient to changes both on website content and presentation are other
general requirements which should be followed by a harvester. With these features,
a harvester should be able to ll in forms eciently and automatically, extract
data/entities from the returned results pages, and store the extracted data 2 [15].
2 In [15], two more steps are also considered for a harvester; discovering deep web
sources of interest, and presenting extracted data to users and providing them with
posing query mechanisms.</p>
        <p>For all these steps considered in a deep web harvesting process, having an
automatic error/change detection helps to improve harvesting process. This
enables the harvester of doing an uninterrupted harvest as it becomes capable of
detecting and resolving issues like IP based blocking, website failures, and etc.
The harvester should be capable of providing rm guaranties about the
exhaustive coverage of the harvested part of the Web. Size estimation of deep websites
[?] and also dening a stop condition for harvesting process could help in
reaching this goal. In monitoring entities on the Web, it becomes highly important if
the harvester could be able to keep the data up-to-date. This needs harvesters
being capable of detecting new and deleted entities on the Web. While fullling
these high level requirements, the harvester should be also capable of harvesting
the dierent categories of websites mentioned in Section 4. There are dierent
approaches to meet these requirements in literature. In the following section,
these approaches are introduced.
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Harvesting Techniques</title>
        <p>
          To access data behind web forms, a wide range of harvesters are suggested in
literature [
          <xref ref-type="bibr" rid="ref3 ref5 ref6 ref7 ref8 ref9">5,6,7,8,9,10,3</xref>
          ]. The dierences among these harvesters root from
different sources; from applied techniques in each step of harvesting process to the
main goal behind the harvester design. In this paper, the focus in categorizing
harvesters is on the techniques and tools applied by harvesters to meet the
requirements introduced in Subsection 5.1 and Section 4. This categorization helps
reasoning why a harvester could work well for a website and not for the other.
It also helps to judge about harvester performance on a website with known
features before even applying it in practice. If known that harvesters from a
category have problems with website with a specic feature, with the features
of the website and harvester at hand, it is possible to predict the outcome of
harvesting process. In the following, this classication is represented [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] 3.
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>1. HTML-based harvesters</title>
        <p>
          In HTML-based harvesters, the harvester relies on a set of dierent features
of document HTML code [17]. To analyze the HTML structure of the pages,
the document is translated into a parsing tree. This could be done by using
browser controls like Internet Explorer to parse webpages into Data Object
Model (DOM) trees. Then, by running a number of pre-dened extraction
rules on the tree, the data is extracted.
2. Harvesters based on Natural Language Processing (NLP) techniques
In these harvesters [18,19], NLP techniques such as ltering, part-of-speech
tagging, and lexical semantic tagging are applied to build relationships
between phrases and sentences. From these extracted relationships, a number
of extraction rules can be derived. These rules are based on syntactic and
semantic constraints and help to identify the relevant information within a
document.
3 This categorization is introduced in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] except number 6 and 7 which are added by
authors of this paper.
3. Machine learning based harvesters
        </p>
        <p>These harvesters [20] rely on a given set of training examples to derive a
number of extraction rules. In these techniques, rather than relying on linguistic
constraints found in the page, rules are based on features of the structure of
the pieces of data.
4. Modeling-based harvesters</p>
        <p>In modeling-based harvesters [21,22], a data model is dened. In this data
model, a number of objects, their properties and relationships are dened.
Based on this data model and its modeling primitives, points of interest are
located in Web pages.
5. Ontology-based harvesters</p>
        <p>In these harvesters [23], the extraction process is based on the data and not
the presentation structure. These harvesters need a specic domain
ontology. Through domain ontologies, concepts relevant to a particular topic or
area of interest are dened and available for harvesters. The ontology-based
harvesters use these ontologies to locate ontology’s constants present in the
page and to construct objects associated with them.
6. Computer vision based harvester</p>
        <p>These harvesters use computer vision techniques in addition to techniques
from machine learning to analyze webpages. In these harvesters, the main
goal is to identify and extract information from web pages by interpreting
them visually as a human being does. Some of these approaches use also the
visual features on the deep Web pages [24].
7. Harvesters based on a combination of previous categories. For example, in a
harvester based on HTML structure, applying machine learning techniques
could help in having more precise extraction results.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Harvestability Factor Validation</title>
      <p>As mentioned in Section 1, the HF can be used for evaluating a harvester in
harvesting and a website in being harvested. It was also discussed that this
factor can work as a design framework. To validate these claims, a collection of
deep websites is studied considering the HF elements. The developed harvester
by authors of this paper, as an example eort for developing a general purpose
harvester, is applied on the test set.
6.1</p>
      <sec id="sec-6-1">
        <title>Test Set</title>
        <p>To create the test set for illustrating how deep websites can be categorized based
on the HF elements and how this can be used in designing a general purpose
harvester, a set of websites from the list of top-100 job vacancy websites by
[25]. In the selection of websites from this list, the ones including job boards are
considered. To extend this test set, a set of Dutch job vacancy websites are also
considered. For each of these websites, all the elements of HF are studied. To
examine the harvester performance on each one of the categories, the harvester
is applied on the websites.
6.2</p>
      </sec>
      <sec id="sec-6-2">
        <title>Developed Harvester</title>
        <p>The developed harvester is a HTML-based harvester which automates loading
of pages in a browser. These features help to resolve the challenges caused by
some of the websites’ features mentioned in Section 4. For example, to enable
the harvester of implementing embedded scripts in HTML pages, the techniques
for automating browsers are applied. Also, for selecting the points of interests,
HTML-based techniques are considered. These features also help the harvester
to meet the general requirements mentioned in Subsection 5.1 like automation,
scalability, independency, eciency, and being easy to use. For eciency
purposes, dierent query generation mechanism could be applied to have the most
amount of data harvested with the least possible number of posed queries. The
conguration is limited to entering the template, and XPaths for points of
interests. There is also no need to enter a data model for data storage. Given these
congurations for each website, high scalability level can be achieved.
Domainindependency is also highly achieved through using only HTML-based techniques
which also makes it language-independent.
6.3</p>
      </sec>
      <sec id="sec-6-3">
        <title>HF Validation as a Framework</title>
        <p>HF as a Design Framework Through this study, it is shown how these
websites are categorized by applying the HF elements and how this can guide
the design and implementation of a harvester. Having studied the set of deep
websites and prioritizing the features of deep websites, By applying this harvester
on this set of websites, it is shown how these features are eective on harvesting
processes in practice.</p>
        <p>Results Studying this set of websites from the domain of job vacancies brings
a number of facts into light. If we assume this set of websites represents the
job vacancies domain, the results can guide the design process by emphasizing
on the elements of HF faced more frequently. As it can be seen in Table 6.1,
embedded scripts, query interfaces, data layouts, and in-persistent data patterns
need further attention during the harvester design process.</p>
        <p>Being based on browsers enables our harvester to overcome some of the
challenges caused by embedded scripting languages in HTML pages. This is perfectly
valid when there is no change of content based on user interaction with the page.
However, presence of scripts changing the content based on user interaction or
change of browser or time makes it more dicult for the harvester. Simulating
user actions or changes in the page environment and comparing the generated
result page with the previous version of the page should be performed to be
capable of harvesting the page presented to users. This part is not included in
our current version of harvester. However, it is worth mentioning that this type
of scripts was not faced in our test collection. So, it was reasonable to postpone
the implementation of this feature.</p>
        <p>The second most common HF element in the test set is detecting query
interfaces. In all the cases, our harvester could detect the template and query the
Harvestability Factor’s</p>
        <p>Element
Embedded Script in HTML</p>
        <p>4.1</p>
        <p>Applet / Flash 4.1
Data Layout (dierent</p>
        <p>layouts) 4.3</p>
        <p>Navigation (not
straight-forward) 4.2
Search Policies (limited</p>
        <p>search results) 4.2
Indexing Policies (not</p>
        <p>stopwords) 4.2
HTML Coding Practices (not</p>
        <p>persistent) 4.1
Security / Privacy / Legal Policies 4.2</p>
        <p>URL Redirection 4.1
Residing Data (text, no</p>
        <p>structure) 4.3
Session Management 4.1</p>
        <p>Query Interface Type 4.2
Persistent Data Patterns (not
persistent) 4.3</p>
        <p>Multi-frames 4.1
search engines. The other common faced feature is having dierent data layouts.
This is resolved in our harvester by making it possible to dene dierent page
templates for each website. However, this might pose a lot of time and eort
during conguration phase if there are a large number of page templates used for
showing data. In the HTML-based approaches, data can be extracted also based
on the content. Therefore, if the data patterns are consistent, a high quality
data extraction is still possible through the techniques applied by our harvester.
Among the websites in the test set, 15 percent of the websites have limitation
on browsing the number of viewed search results. To resolve this problem,
dierent query generation mechanisms are applied which allow ecient harvesting of
deep website. The harvester can also detect if stopwords are indexed or not and
send the next queries accordingly. These meet two other common HF elements
mentioned in Table 6.1.</p>
        <p>Among the samples, it was observed that users are asked to enable the
cookies for the website. This technique is becoming more frequently used by web
developers. Therefore, harvesters should accordingly be able to recognize and
resolve it. To resolve other session management techniques, keeping the session
information and tracking the navigation path to the page are useful. In a not
straight-forward search navigation website, which results in more steps than
going through search, browsing results page, and viewing the detailed page, the
developed harvester could work successfully. This was provided that there are
only two types of page templates; search results page, and detailed page
templates. The harvester can distinguish only these two types.</p>
        <p>As it can be seen in Table 6.1, for some of the HF elements, no websites in the
test set were found. This might be due to the specications of the test domain.
For example, the application of techniques like Applet or Flash could be seen
more frequently in domains like Graphics or Music industries and not so often
in job vacancy domain. The same applies to requiring credentials to view job
vacancies which is unacceptable in business models of these companies. It is also
worth mentioning that dening some of these elements in HF for a website is
time-consuming and sometimes hard. Persistent coding practices is one of those
elements. It is time-consuming to study a website if it follows a persistent coding
paradigm unless you face an exception.</p>
        <p>Validation of Harvestability Factor as a Website/Harvester
Evaluator
In this part, the ability of the HF in evaluating the harvestability of a website or a
harvester is discussed. As mentioned in Section 2, assigning values to weights and
features in the HF W Formula is beyond the scope of this paper. However, to show
how this could be benecial, we use a simple method to assign these numbers. All
the values are assigned with probabilities. We assign the percentage of feature
occurrence in the test set as the Cof . The F af values are calculated through
F af = 1 successratef formula. The successratef are assigned by the authors
of this paper based on the capability of the developed harvester in resolving the
problems caused by each feature. Introducing a more accurate method for these
values will be studied as future work. The ( Crf ) values are assigned based on the
authors experience and the results observed during experiments. They represent
how inuential is the corresponding feature in the whole harvesting process. We
assign them with four values; Highly Critical (1) , Very Critical (0.75) , Critical
(0.5), Eective not Critical (0.25) , and no eect (0) . Having assigned all the
present parameters in the formula with values, the HF H can be calculated. The
best score is the result of having all the features equal to 1 which results in 15.
Our developed harvester score is 14.783. In the following line, it is shown how this
number is calculated. This high number tells us that the harvester works well for
this targeted domain. Having nine features absent in the websites tested in this
domain gives a high advantage to this harvester. Of course, by having average
numbers for each parameter of the HF formula, the harvester performance can
be tested generally. In this part, we only show the HF of the harvester for a
limited test set.
15000 H F180H00()D)+ev(e1lop(ed110H40 arv15e00s0ter)110=00 )9)++((11 ((1210160000 1501050000 170010100)0))+) +(1(1 ( 102(012040 1705017050 150001)20)00+))(1= 1(41:027083
7</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusions and Future Work</title>
      <p>As discussed in Section 6, the elements of the introduced harvestability factor
can categorize deep websites based on their features which are important in
harvesting process. This enables the owners of deep websites and website designers
of evaluating where their products stand from harvesting point of view. This
helps them to decide about which measures to take in order to follow their
policies whether it is increasing access or limiting it. For harvester designers, the
harvestabiltiy factor acts not only as an evaluation metric of how well the
harvester can behave in practice dealing with dierent websites, it also behaves as
a framework for designing deep web harvesters. The HF provides designer with
a thorough list of requirements they should meet and also helps to prioritize
the features to be addressed and included in the harvester. Categorizing deep
websites based on their harvestabiltiy factors makes it feasible to understand the
importance of dierent websites’ features. This helps to prioritize the features
to be addressed and included in the harvester.</p>
      <p>Having the HF as a comparison metric for dierent deep web harvesters is
another advantage of this introduced concept. To show how this can be applied,
we tested the formula for our own developed harvester on a predened set of
job vacancy websites. To enable the measuring, we applied simple methods in
assigning values. The importance of each element was judged in a combination
of author experience and expertise with the frequency of usage of that element
among the test set websites. Having more than half of the elements absent among
the websites gave an advantage to the harvester to get a high score. Of course,
this shows that harvester would work very well for this set. However, judging its
performance for a bigger or dierent domain needs new values for each parameter
in the formula. This is realized if the average values for each parameter in the
formula are assigned. This enables developers to decide what to include in a
harvester and predict the outcome in accurate numbers even before running the
harvester on the target websites.</p>
      <p>For the next step, we will study how to assign more accurate values
automatically. This means having websites classied based on the introduced elements
and judging about the importance of each element in a more automatic approach.
Also, having average values to calculate the performance of a harvester in more
general domains is in the list of our future work. As another future work, we aim
at using the HF in guiding us in developing a more general deep web harvester.
Using the studies performed in this paper and extending them to a bigger test
set will help us in deciding on the features our deep web harvester should include
and prioritizing their developments.
8</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgement</title>
      <p>We thank the WCC Company for hosting the rst author and Jan de Vos, Eliska
Went, and Marko Smiljani¢ for their support, discussions, and valuable input.
This publication is supported by the Dutch national program COMMIT.
10. J. L. Hong, Deep web data extraction, in Systems Man and Cybernetics (SMC),
2010 IEEE International Conference on , pp. 34203427, 2010.
11. A. Herrouz, C. Khentout, and M. Djoudi, Overview of web content mining tools,</p>
      <p>CoRR, vol. abs/1307.1024, 2013.
12. F. Johnson and S. K. Gupta, Article: Web content mining techniques: A survey,
International Journal of Computer Applications , vol. 47, pp. 4450, June 2012.</p>
      <p>Full text available.
13. S. M. Khelghati, M. van Keulen, and D. Hiemstra, Designing a general deep
web access approach based on a newly introduced factor; harvestability factor
(hf), Technical Report TR-CTIT-14-08, Centre for Telematics and Information
Technology, University of Twente, Enschede, June 2014.
14. Dynamic web page. http://en.wikipedia.org/wiki/Dynamic_web_page , 2013.
15. N. Zhang and G. Das, Exploration of deep web repositories, PVLDB, vol. 4,
no. 12, pp. 15061507, 2011.
16. What is freeformat. http://www.gooseeker.com/en/node/knowledgebase/
freeformat, 2013.
17. V. Crescenzi, G. Mecca, and P. Merialdo, Roadrunner: Towards automatic data
extraction from large web sites, in Proceedings of the 27th International Conference
on Very Large Data Bases, VLDB ’01, (San Francisco, CA, USA), pp. 109118,
Morgan Kaufmann Publishers Inc., 2001.
18. D. Freitag, Machine learning for information extraction in informal domains,</p>
      <p>Mach. Learn., vol. 39, pp. 169202, May 2000.
19. I. Muslea, S. Minton, and C. A. Knoblock, Hierarchical wrapper induction for
semistructured information sources, Autonomous Agents and Multi-Agent
Systems, vol. 4, pp. 93114, Mar. 2001.
20. N. Kushmerick, Wrapper induction: Eciency and expressiveness, Artif. Intell.,
vol. 118, pp. 1568, Apr. 2000.
21. B. Adelberg, Nodose - a tool for semi-automatically extracting structured and
semistructured data from text documents., in SIGMOD Record, pp. 283294, 1998.
22. B. Ribeiro-Neto, A. H. F. Laender, and A. S. da Silva, Extracting semi-structured
data through examples, in Proceedings of the Eighth International Conference
on Information and Knowledge Management , CIKM ’99, (New York, NY, USA),
pp. 94101, ACM, 1999.
23. D. W. Embley, D. M. Campbell, Y. S. Jiang, S. W. Liddle, D. W. Lonsdale, Y.-K.</p>
      <p>Ng, and R. D. Smith, Conceptual-model-based data extraction from
multiplerecord web pages, Data Knowl. Eng., vol. 31, pp. 227251, Nov. 1999.
24. W. Liu, X. Meng, and W. Meng, Vide: A vision-based approach for deep web
data extraction, IEEE Transactions on Knowledge and Data Engineering , vol. 22,
no. 3, pp. 447460, 2010.
25. The top 100 websites for your career. http://www.forbes.com/sites/
jacquelynsmith/2013/09/18/the-top-100-websites-for-your-career/ , 2013.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Xin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ganti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Rajaraman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <article-title>Crawling deep web entity pages</article-title>
          ,
          <source>in Proceedings of the Sixth ACM International Conference on Web Search and Data Mining</source>
          , WSDM '
          <fpage>13</fpage>
          , (New York, NY, USA), pp.
          <fpage>355364</fpage>
          ,
          <issue>ACM</issue>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>M.</given-names>
            <surname>Cafarella</surname>
          </string-name>
          ,
          <article-title>Extracting and Querying a Comprehensive Web Database</article-title>
          ,
          <source>in Proceedings of the Conference on Innovative Data Systems Research (CIDR)</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>J.</given-names>
            <surname>Madhavan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ganapathy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rasmussen</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Halevy</surname>
          </string-name>
          ,
          <article-title>Google's Deep Web crawl</article-title>
          ,
          <source>Proc. VLDB Endow.</source>
          , vol.
          <volume>1</volume>
          , pp.
          <fpage>12411252</fpage>
          ,
          <string-name>
            <surname>Aug</surname>
          </string-name>
          .
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>M.</given-names>
            <surname>Alvarez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Raposo</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Vina</surname>
          </string-name>
          ,
          <article-title>Client-side deep web data extraction, in Proceedings of the E-Commerce Technology for Dynamic E-Business</article-title>
          , IEEE International Conference, CEC-EAST '
          <fpage>04</fpage>
          , (Washington, DC, USA), pp.
          <fpage>158161</fpage>
          , IEEE Computer Society,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>L.</given-names>
            <surname>Barbosa</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Freire</surname>
          </string-name>
          ,
          <article-title>Siphoning hidden-web data through keyword-based interfaces</article-title>
          ,
          <source>in SBBD</source>
          , pp.
          <fpage>309321</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>A.</given-names>
            <surname>Ntoulas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zerfos</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <article-title>Downloading textual hidden web content through keyword queries</article-title>
          ,
          <source>in Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries , JCDL '05</source>
          , (New York, NY, USA), pp.
          <fpage>100109</fpage>
          ,
          <issue>ACM</issue>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>S.</given-names>
            <surname>Raghavan</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          ,
          <article-title>Crawling the hidden web</article-title>
          ,
          <source>in Proceedings of the 27th International Conference on Very Large Data Bases , VLDB '01</source>
          , (San Francisco, CA, USA), pp.
          <fpage>129138</fpage>
          , Morgan Kaufmann Publishers Inc.,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>A. H. F.</given-names>
            <surname>Laender</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. A.</given-names>
            <surname>Ribeiro-Neto</surname>
          </string-name>
          ,
          <article-title>A. S. da Silva, and</article-title>
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Teixeira</surname>
          </string-name>
          ,
          <article-title>A brief survey of web data extraction tools, SIGMOD Rec</article-title>
          ., vol.
          <volume>31</volume>
          , pp.
          <fpage>8493</fpage>
          ,
          <year>June 2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>G.</given-names>
            <surname>Weikum and M. Theobald</surname>
          </string-name>
          ,
          <article-title>From information to knowledge: harvesting entities and relationships from web sources, in PODS (J</article-title>
          . Paredaens and
          <string-name>
            <given-names>D. V.</given-names>
            <surname>Gucht</surname>
          </string-name>
          , eds.), pp.
          <fpage>6576</fpage>
          ,
          <issue>ACM</issue>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>