<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Wei Liu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaofeng Meng</string-name>
          <email>xfmeng@ruc.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Information, Renmin University of China</institution>
          ,
          <addr-line>Beijing, 100872</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>More and more accessible databases are available in the Web. In order to provide people a uni¯ed access to these Web databases and achieve information from them automatically, a comprehensive solution for Web database integration is proposed in this paper. After summarizing the research status in this area, the works which are the focus of my PhD thesis are presented.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        According to the survey[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] released by UIUC in 2004, there
are more than 300,000 Web databases and 450,000 query
interfaces available at that time, and the two ¯gures are still
increasing quickly. Besides the scale of Web databases, the
contents in Web databases are spanning well across all
topics. Some Deep Web portal services provide Deep Web
directories which classify Web databases in some taxonomies.
For example, CompletePlanet[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the biggest Deep Web
directory, has collected more than 7,000 Web databases and
classi¯ed them into 42 topics. Combing the above two
aspects, we can conclude that theses Web databases are just
like a huge repository and provide people a great
opportu(c)2006 for the individual paper by the paper' authors.
Copying permitted for private and scienti¯c purposes.
Republication of material on this page requires permission by
the copyright owners.
      </p>
      <p>Proceedings of the VLDB2006 Ph.D. Workshop
Seoul, Rep of Korea, 2006
nity to get their desired information.</p>
      <p>With proliferation of Web databases, it is not only an
opportunity but also a challenge for people. At present, people
access to Web databases mainly by manual approach, and
his will bring an overhead problem.</p>
      <p>Here is an example to explain the problem. Suppose Jane
wants buy a book on Java. There are several tasks she
has to complete. First, she must ¯nd the Web sites which
sell books. If she wants save money, more Web sites are
needed to compare. Second, she ¯lls the query interfaces
with an appropriate query (for example, ¯ll book title with
"think in java") and submits them. Third, when the Web
pages contain query results returned (these Web pages are
called response pages generally), she browses them in turn
and chooses the best book. The whole process is
timeconsuming. Maybe Jane will spend half a day for this.
Therefore, the challenge of manual approach is people
often have di±culties in ¯rst ¯nding the right sources and
then querying over them.</p>
      <p>It is impending and compulsory to integrate Web databases
and to provide people a uni¯ed access to them and achieve
information automatically. Web databases integration can
be considered as the heterogeneous data source integration
in Web context. The traditional heterogeneous data source
integration generally focuses on the heterogeneity and
autonomy of data sources. According to my investigation, Web
databases also have four distinct characteristics which are
di®erent to other heterogeneous data sources:</p>
      <p>RDB
Integrated
Interface</p>
    </sec>
    <sec id="sec-2">
      <title>Result Process Module</title>
      <p>Data
Merging</p>
      <p>Results
Annotation
Results
Extraction
WDB
Selection
Query
Translation
Query
Submission</p>
    </sec>
    <sec id="sec-3">
      <title>Query Process Module</title>
      <p>WebDB
WebDB
Web DB</p>
      <p>W e b
WebDB
Web DB
ss</p>
    </sec>
    <sec id="sec-4">
      <title>Interface Integration Module</title>
      <p>Interface
Integration</p>
      <p>WDB
Clustering</p>
      <p>InterfaceSchema</p>
      <p>Extraction</p>
      <p>WDB
Discovery
² Scale: There are myriads of Web databases in Web,
and even under a special topic the quantity of Web
databases is still striking.
² Dynamic: First, Web databases are very sparsely
distributed in Web, and they appear and disappear
endlessly. So searching for appropriate Web databases
in Web is really like looking for a few needles in a
haystack. Second, the contents in Web databases are
usually updated frequently. Especially in some
topics, such as airline and job, everyday a batch of new
contents will be added to Web databases and the
outdated part will be removed. So the information in Web
databases is "ever" not "forever" to you.
² Access through query interfaces: Due to the
peculiar access approach, the schema of a Web database can
not be captured directly. We can only infer the schema
from their query interfaces and response pages.
² Heterogeneity: The query interfaces and response
pages are designed by di®erent persons and there are
no design standards to follow. Even in the same topic,
the query interfaces and response pages are often very
dissimilar.</p>
      <p>In a word, the research on Web database integration aims to
help people make use of the abundant information in Web
databases e®ectively and e±ciently. But due to the distinct
characteristics of Web databases, there are many challenging
research issues in this area.</p>
      <p>My PhD thesis is focusing on building a Web database
integration system and addressing several challenging issues in
this area. In this paper a comprehensive solution for Web
database integration is presented and my current and future
research works in this area is indicated .</p>
      <p>There is a fact which should not be neglected. Some Web
sites provide Web Services for their Web databases, and
people can use a customized program to access Web databases.
But this approach has two limitations: ¯rst, only a small
portion of Web sites provide Web Services for their Web
databases; second, this approach must depend on a
customized program, and this is not an easy thing for common
users. So in this paper we focus on the popular approach
of accessing Web databases through the query interfaces in
Web pages.</p>
      <p>The rest of this paper is organized as follows. Section 2 gives
the solution for Web database integration; Section 3
summarizes the research status in this area; Section 4 presents
the works we are focusing now and will focus in the future;
Section 5 is the conclusion.</p>
      <sec id="sec-4-1">
        <title>2. A SOLUTION FOR WDB INTEGRATION</title>
        <p>In this section, a comprehensive solution for Web database
integration is proposed, which is the pursuit in my PhD
track. Figure 2 is the architecture of the solution. This
solution includes three primary modules: integrated
interface generation module, query processing module and results
processing module.</p>
        <p>Integrated interface generation module: Produce an
integrated interface over the query interfaces of the Web
databases to be integrated. There are four components in
this module. The functions of them are described as
following:
² Web database discovery: Search Web sites which have
Web databases behind, and identify the query
interfaces among the Web pages in these Web sites.
² Query interface schema extraction: Extract the
attributes in query interfaces (such as "Title" and
"Author" in Figure 1), and the meta-information about
each attribute (such as value type, default value, etc).
² Web database clustering by topic: Cluster all
discovered Web databases into di®erent groups. The Web
databases in each group belong to the same topic.
² Interface integration: Given the Web databases in the
same topic, merge the same semantic attributes in
different query interfaces into a global attribute, and
¯nally form an integrated interface.</p>
        <p>Query processing module: Process a user's query ¯lled
in integrated interface, and submit the query to each Web
databases. There are three components in this module. The
functions of them are described as following:
² Web database selection: Select appropriate Web databases
for a user's query in order to get the satisfying results
at minimal cost.
² Query translation: Try to translate the query on
integrated interface equivalently into a set of local queries
on the query interfaces of Web databases.
² Query submission: Analyze the submission approaches
of local query interfaces, and submit each local query
automatically.</p>
        <p>Result processing module: Extract the query results
achieved from Web databases, and merge the results
together under a global schema. There are three components
in this module. The functions of them are described as
following:
² Result extraction: Identify and extract the pure results
from the response pages returned by Web databases.
² Result Annotation: Append the proper semantics for
the extracted results.
² Result merging: Merge the results extracted from
different Web databases together under a global schema.
These components work together and make up of a
comprehensive solution for Web database integration. It's not
di±cult to found that there are dependency relationships
between them. Figure 2 has disclosed such dependency
relationship. For example, query processing module depends
on integrated interface generation module (high level),
interface integration depends on Web database clustering (low
level). So the quality of the implementation of a component
will a®ect the next component greatly.</p>
        <p>In fact, each component can be considered as a research issue
itself. In order to build a practical Web database integration
system, these issues must be solved well in theory ¯rst. In
Section 3, the research status in this area will be discussed.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3. RESEARCH STATUS IN THIS AREA</title>
        <p>Until now, large numbers of e®orts are devoted to this area.
Due to the space limit, the related works can not be
discussed comprehensively and in detail. We only discuss them
summarily according to the issues they address, and we also
give the representative works.</p>
        <p>Unfortunately, the development of research in this area is
uneven very much though the great e®orts have been done.
Several issues have been already addressed well and are
mature enough we can resort to (developed issues), some issues
is developing and need be researched deeply (developing
issues), and some issues have not been touched yet
(undeveloped issues). We summarize the research status according
to the development of these issues.</p>
      </sec>
      <sec id="sec-4-3">
        <title>3.1 Developed Issues</title>
        <p>
          Interface integration It has received enough attention, and
several e®ective approaches[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ][
          <xref ref-type="bibr" rid="ref4">4</xref>
          ][
          <xref ref-type="bibr" rid="ref5">5</xref>
          ][
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] are proposed solve this
problem. These approaches match attributes of query
interfaces by exploiting the semantic similarity between labels as
well as that between data instances.
        </p>
        <p>
          Query interface schema extraction In order to understand
query capabilities a query interface supports, [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] transforms
query interfaces into a visual language, and develops a 2P
grammar and a best-e®ort parser to realize a parsing
mechanism.
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>3.2 Developing Issues</title>
        <p>Besides introducing the current approach for developing
issues, the shortcomings of them are pointed out at the same
time.</p>
        <p>
          Web database discovery [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] proposed a strategy does that
by focusing the crawl on a given topic and choosing links to
follow within a topic that are more likely to lead to pages
that contain query interfaces. It can not assure the quantity
of discovered Web databases. [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] use automatic feature
generation to describe candidates and C4.5 decision trees to
detect query interfaces. It can not di®erentiate the query
interfaces of search engines from that of Web databases.
Web database clustering [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] performs the clustering based
on the features available on the interface page. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] proposed
an objective function, model-di®erentiation, to compute the
probability which topic a query interface belongs to. Their
accuracy depends on the schema information of query
interfaces, so they are not good at dealing with the query
interfaces with simple schema.
        </p>
        <p>
          Result extraction There are lots of approaches proposed to
address this issue. Most of them[
          <xref ref-type="bibr" rid="ref13">13</xref>
          ][
          <xref ref-type="bibr" rid="ref14">14</xref>
          ][
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] ¯rst transform
the response page into a HTML tag tree, then identify and
extract data records or data items by analyzing tree
structure and tag information. They can only deal with the Web
pages designed by HTML language, so it is a latent
shortcoming with the development of Web.
        </p>
        <p>
          Result annotation This problem is often solved during the
process of Result extraction. [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] ¯nd the proper the
annotation of an extracted data item in the response page by
some heuristic rules. They are very e®ective if a data item
really has its annotation in the response page. But they can
not ensure all data items get their annotations.
        </p>
        <p>
          Entity identi¯cation Entity identi¯cation is one of the key
components of data merging. Several approaches have been
proposed to solve this problem. For example, [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] applies
a set of domain-independent string transformations to
compare the entities' shared attributes in order to identify
matching entities. All current approaches assume that they have
achieved the well-build schema match between Web databases,
but schema match in Web context have not been solved yet.
        </p>
      </sec>
      <sec id="sec-4-5">
        <title>3.3 Undeveloped Issues</title>
        <p>The undeveloped issues include Web database selection, Query
translation, and Data merging. These issues have been well
studied in some contexts(such as data warehouse), but there
have not been approaches proposed to address these issues
in the context of Web database integration, and they are
compulsory in Web database integration.</p>
        <p>Among these developing and undeveloped issues, Entity
identi¯cation, Result extraction and Web database selection are
in my PhD track at present and in the future, which are
discussed in Section 4.</p>
      </sec>
      <sec id="sec-4-6">
        <title>4. SEVERAL RESEARCH WORKS</title>
        <p>In this section, several research works are proposed for
discussion, which are being done at present and will be done
in future.</p>
      </sec>
      <sec id="sec-4-7">
        <title>4.1 Entity Identification among Web Databases</title>
        <p>Entity identi¯cation is a key operation in integrating data
from multiple sources. This issue has been well studied for
years. As discussed in Subsection 3.2, though several
solutions have already been proposed for Web databases, all of
they are based on such assumption that the schema match
between Web databases has been built well. As well known,
due to the poor structure of Web pages, schema match in
Web context is a very hard work, and there is still not
automatic solution for it.</p>
        <p>So we are trying to ¯nd a way to implement entity
identi¯cation between Web databases without the help of schema
match. Our basic consideration is described as following.
We do not try to analyze the structure (or schema) of data
records in response pages. Instead, given two Web databases
A and B, each data record from A or B is considered as a
text document. We judge whether data record a (from A)
and data record b (from B ) by comparing the text
similarity of them. Obviously, it is very naive to compute the text
similarity of two data records directly, and the accuracy is
also not satisfying in our test. The reason is that, the
importance of every part in a data record is di®erent, and there
is much noise information in a data record (for example, the
words "author" and "price" often appear in the book data
records). In order to make the similarity of a and b more
reasonable (ideally, if a and b refer to a same entity, and a
and c do not, then the similarity of a and b must be
bigger than that of a and c), our approach is implemented as
following:
1. ¯lter the noise information from a and b as possible;
2. segment a into several blocks, and each block of a is
formulated into a query for b;
3. compute the similarity of each block and b;
4. assign an appropriate weight for the similarity of each
block and b, and sum up them;
5. judge whether a and b refer to a same entity according
to the whole similarity.</p>
        <p>At present, we are engaging to ¯nd an e®ective algorithm
to train the weights and threshold of the whole similarity
by a small set of sample data records pairs. A data record
pair is two data records from di®erent Web databases, and
they refer to a same entity. The algorithm is now being
detailed. The primary experiment result is very satisfying
under the book topic. Further, the experiments under other
topics (car, estate, etc.) will be done.</p>
      </sec>
      <sec id="sec-4-8">
        <title>4.2 Vision Based Result Extraction</title>
        <p>Most current approaches extract the results from response
pages based on HTML language. But they have several
inextirpable limitations. First, besides HTML, some other
languages, such as XML and XHTML, have been introduced
design Web pages. Second, HTML is still evolving. New
versions of HTML will be proposed in the future, and new
tags may appear and applied continuously. Third, as more
and more web pages use more complex JavaScript and CSS
to in°uence the structure of web pages, the applicability of
the existing solutions will become lower. Fourth, if HTML
is replaced by a new language in the future, then previous
solutions will have to be revised greatly or even abandoned,
and other approaches must be proposed to accommodate
the new language.</p>
        <p>
          Based on such motivations, it is important to ¯nd an
approach which is vision based and language independent. In
current phrase, we only aim at the response pages with
multiple data records. Our basic idea is that, though the data
records in a response page are di®erent on the contents, they
are similar on the appearance. The following is the
implementation we are engaging in:
1. achieve the vision information (such as the font of a
text, the size of an image, and their location in the
Web page) by accessing the program interface of Web
browser;
2. build a vision based block tree by VIPs[
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] algorithm.
        </p>
        <p>A data record is composed by one or more blocks in
the vision based block tree. So result extraction here
is to ¯nd these blocks and judge which blocks compose
a data record.
3. locate the data region (the region contains all data
records in a response page) in the vision based block
tree.
4. ¯nd the boundaries of all data records by computing
the vision similarity of blocks in the vision based block
tree.</p>
        <p>The primary experiment has indicated that this approach
is not only HTML language independent, but also very suit
for extracting information-rich data records.</p>
      </sec>
      <sec id="sec-4-9">
        <title>4.3 Web Database Selection</title>
        <p>There are myriads of Web databases in the Web. So maybe
a lot of Web databases are integrated under a topic. If a user
submits a query on the integrated interface and the query
is dispatched to all the Web databases integrated, it will
be time-consuming and overhead to process all the returned
results, especially data cleaning and deduplication. In most
cases, we only need select several ones among them to get
the satisfying results. So Web Database Selection aims to
select appropriate Web databases for a given user's query on
integrated interface, which can help users get their desired
results at the lowest cost.</p>
        <p>In order to judge whether a Web database should be selected
to answer a given query, there are two aspects must be
considered. One is the pertinency of the Web database and the
given query; the other is the query capability of the query
interface of the Web database. The following gives some our
considerations about the two aspects.</p>
        <p>The prerequisite of selecting a Web database is it is
pertinent to the given query. Extremely, it is meaningless to
query a Web database if it does not has any useful
information for the query. Figure 3 gives an example to illustrate
this. Suppose A, B, C, and D are four Web databases, and
q is a query to them. Where the size of A, B, C and D is
the quantity of data records in them, the size of q is the
quantity of data records satis¯es q. Instinctively, C does
not satisfy q at all, B satis¯es q partly, A and D can satisfy
q completely, but at last D is the best selection compared
with A. So we need achieve the features of Web databases
in advance. The features of a Web database include the
size, the update ratio, the distribution on each attribute,
etc. Because we can only access a Web database through
its query interface, it is impossible to understand a Web
database directly. The challenge is how to obtain the
features by the query interface only. In the future, we want to
design a sample records retriever to address this problem.
Sample records retriever is a tool that can obtain a small
set of data records which are distributed evenly in the Web
database. We can pro¯le the Web database by analyzing the
obtained data records. Sample records retriever should have
two components: query interface analyzer and query
generator. Query interface analyzer is to obtain the necessary
information of each attribute; query generator produces a
set of smart queries according to the information obtained
by query interface analyzer.</p>
        <p>The query interfaces are often di®erent about the query
capability among Web databases, and this will in°uence the
accuracy of a query. For example, in the book topic, a query
on the integrated interface is \title=java and price&lt;20$". If
the query interface of a Web database contains both the
two attributes , it can answer the query accurately. But
if it only contains the attribute \title" or \price", then the
results returned from the Web database will contain quite
many data records which do not satisfy the query. So the
challenge tasks are how to how to make the returned results
be satisfying(for example, the minimal superset or maximal
subset of the query).</p>
      </sec>
      <sec id="sec-4-10">
        <title>5. CONCLUSIONS</title>
        <p>With the rapid increasing of Web databases, it is impending
to integrate these Web databases and provide people a
uni¯ed access to them and achieve information automatically.
In this paper, a comprehensive solution for Web database
integration is proposed. There are a number of components
in the solution, and each of them is also a research issue in
this area. After summarizing the research statuses of the
issues in this area, we introduce the issues which are
being focused on now and will be addressed in the future. In
conclusion, the focuses of my PhD thesis are building a Web
database integration system and addressing several issues in
this area.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K. C.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z. Zhang.</surname>
          </string-name>
          <article-title>Structured Databases on the Web: Observations and Implications</article-title>
          .
          <source>SIGMOD Record</source>
          <volume>33</volume>
          (
          <issue>3</issue>
          ):
          <fpage>61</fpage>
          -
          <lpage>70</lpage>
          (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>[2] http://www.completeplanet.com/.</mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. C.</given-names>
            <surname>Chang</surname>
          </string-name>
          .
          <article-title>Statistical Schema Matching across Web Query Interfaces</article-title>
          .
          <source>SIGMOD Conference</source>
          <year>2003</year>
          :
          <fpage>217</fpage>
          -
          <lpage>228</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu. WISE-Integrator</surname>
          </string-name>
          :
          <article-title>An Automatic Integrator of Web Search Interfaces for E-Commerce</article-title>
          .
          <source>VLDB Conference</source>
          <year>2003</year>
          :
          <fpage>117</fpage>
          -
          <lpage>128</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>W.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Doan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. T.</given-names>
            <surname>Yu</surname>
          </string-name>
          .
          <article-title>WebIQ: Learning from the Web to Match Deep-Web Query Interfaces</article-title>
          .
          <source>ICDE Conference</source>
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>E.</given-names>
            <surname>Dragut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Sistla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Meng</surname>
          </string-name>
          .
          <source>Merging Source Query Interfaces on Web Databases. ICDE Conference</source>
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. C.</given-names>
            <surname>Chang</surname>
          </string-name>
          .
          <article-title>Understanding Web Query Interfaces: Best-E®ort Parsing with Hidden Syntax</article-title>
          .
          <source>SIGMOD Conference</source>
          <year>2004</year>
          :
          <fpage>107</fpage>
          -
          <lpage>118</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. T.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          .
          <article-title>Automatic extraction of web search interfaces for interface schema integration</article-title>
          .
          <source>WWW Conference</source>
          <year>2004</year>
          :
          <fpage>414</fpage>
          -
          <lpage>415</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Barbosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Freire</surname>
          </string-name>
          .
          <article-title>Searching for Hidden-Web Databases</article-title>
          .
          <source>WebDB</source>
          <year>2005</year>
          :
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cope</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hawking</surname>
          </string-name>
          .
          <source>Automated Discovery of Search Interfaces on the Web. ADC Conference</source>
          <year>2003</year>
          :
          <fpage>181</fpage>
          -
          <lpage>189</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. T.</given-names>
            <surname>Yu</surname>
          </string-name>
          .
          <article-title>WISE-cluster: clustering e-commerce search engines automatically</article-title>
          .
          <source>WIDM</source>
          <year>2004</year>
          :
          <fpage>104</fpage>
          -
          <lpage>111</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>B.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. C.</given-names>
            <surname>Chang</surname>
          </string-name>
          .
          <article-title>Clustering Structured Web Sources: A Schema-Based, Model-Di®erentiation Approach</article-title>
          .
          <source>EDBT</source>
          <year>2004</year>
          :
          <fpage>536</fpage>
          -
          <lpage>546</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Grossman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhai</surname>
          </string-name>
          .
          <article-title>Mining data records in Web pages</article-title>
          .
          <source>KDD Conference</source>
          <year>2003</year>
          :
          <fpage>601</fpage>
          -
          <lpage>606</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          .
          <article-title>Web data extraction based on partial tree alignment</article-title>
          .
          <source>WWW Conference</source>
          <year>2005</year>
          :
          <fpage>76</fpage>
          -
          <lpage>85</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Raghavan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. T.</given-names>
            <surname>Yu</surname>
          </string-name>
          .
          <article-title>Fully automatic wrapper generation for search engines</article-title>
          .
          <source>WWW Conference</source>
          <year>2005</year>
          :
          <fpage>66</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tejada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Knoblock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Minton</surname>
          </string-name>
          .
          <article-title>Learning domain-independent string transformation weights for high accuracy object identi¯cation</article-title>
          .
          <source>KDD Conference</source>
          <year>2002</year>
          :
          <fpage>350</fpage>
          -
          <lpage>359</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. H.</given-names>
            <surname>Lochovsky</surname>
          </string-name>
          .
          <article-title>Data extraction and label assignment for web databases</article-title>
          .
          <source>WWW Conference</source>
          <year>2003</year>
          :
          <fpage>187</fpage>
          -
          <lpage>196</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>D.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wen</surname>
          </string-name>
          , W. Ma.
          <source>Extracting Content Structure for Web Pages Based on Visual Representation. APWeb Conference</source>
          <year>2003</year>
          :
          <fpage>406</fpage>
          -
          <lpage>417</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>