<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Automatic Structured Web Data Extraction System</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Vilnius Gediminas Technical University</institution>
          ,
          <addr-line>Vilnius</addr-line>
          ,
          <country country="LT">Lithuania</country>
        </aff>
      </contrib-group>
      <fpage>197</fpage>
      <lpage>201</lpage>
      <abstract>
        <p>Automatic extraction of structured data from web pages is one of the key challenges for the Web search engines to advance into the more expressive semantic level. Here we propose a novel data extraction method, called ClustVX. It exploits visual as well as structural features of web page elements to group them into semantically similar clusters. Resulting clusters reflect the page structure and are used to derive data extraction rules. The preliminary evaluation results of ClustVX system on three public benchmark datasets demonstrate a high efficiency and indicate a need for a much bigger up-to-date benchmark data set that reflects contemporary WEB 2.0 web pages.</p>
      </abstract>
      <kwd-group>
        <kwd />
        <kwd>Information extraction</kwd>
        <kwd>structured web data</kwd>
        <kwd>deep web</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Tomas GRIGALIS a;1
For the Web search engines to advance into a more expressive semantic level, we need
tools that could extract the information from the Web and represent it in a machine
readable format such as RDF [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Information extraction at Web scale is the first step in
persuing this goal. However, current algorithmic approaches often fail to achieve
satisfactory performance in real-world application scenarios due to abundant structurally
complicated WEB 2.0 pages.
      </p>
      <p>In this work we address the problem of automatic information extraction from
structured Web data, such as lists of products in online stores. We propose a novel approach,
called ClustVX, which is fully automatic, scalable, and domain independent.</p>
      <p>ClustVX is based on two fundamental observations. First, vast amount of
information on the Web is presented using fixed templates and filled with data from underlying
databases. For example, Fig. 1(a) shows three Data Records (DRs) representing
information about three digital cameras in an online store. The three DRs are listed according
to some unknown to us style template and the information comes from a database. This
also means, that each DR has almost the same Xpath (tag path from root node in HTML
tree to particular web page element), where only a few node numbers differs.</p>
      <p>Second, although the templates and underlying data differ from site to site, humans
understand it easily by analyzing repeating visual patterns on a given Web page. We
hypothesize, that the data which has the same semantic meaning is visualized using the
same style. Therefore humans, viewing such a web page, are able to comprehend its
unique structure quickly and effortlessly and distinguish items photos, titles, prices and
etc. For example in Fig. 1(a) prices are brown red and bold, title is green and bold, text
”Online Price” is grey.</p>
      <p>ClustVX exploits both of these two observations by representing each web page
element with a combination of its Xpath and visual features such as font, color and etc.
For each visible web page element we encode this combination into the string called
Xstring. Clustering Xstrings allows us to identify visually similar elements, which are
located in the same region of a web page and in turn have same semantic meaning.
See Fig. 1(b) where price elements are clustered together according to their Xstring.
Subsequent data extraction leads to a machine readable structured data. The result of
this extraction is shown in Fig. 1(c). Our preliminary evaluation on three public datasets
demonstrate that the new method is able to consistently achieve high recall and precision
in extracting structured data from given web pages.</p>
      <p>(a) An example of three digital cameras (Data Records) in a web page</p>
      <p>(b) A cluster with visually similar price elements</p>
    </sec>
    <sec id="sec-2">
      <title>Image 1</title>
      <p>Image 2
Image 3</p>
      <p>Samsung ES80 $84.95
Fujifilm FinePix T300 $174.95
Vivitar ViviCam F529 $84.95
(c) Desired extraction result</p>
    </sec>
    <sec id="sec-3">
      <title>Online Price Online Price Online Price</title>
      <p>In the following we present a brief review of the current related research work and
then, in Sec. 2 we outline the ClustVX system. We present experimental results in Sec. 3
and, finally, outline the necessary future research directions and further aspects of
experimental evaluation in Sec. 4.</p>
      <sec id="sec-3-1">
        <title>1. Related Work</title>
        <p>Data extraction systems can be broadly divided into supervised and unsupervised
categories. Supervised learning approaches require some manual human effort to derive the
extraction rules, while automated data extraction systems work automatically and need
no manual intervention to extract data.</p>
        <p>In this work we focus on the latter as we believe that only fully automatic systems
can be applied for web-scale data extraction. Our proposed ClustVX system belongs to
this category.</p>
        <p>
          One widely adopted technique to automatically detect and extract DRs is to search
for repetitive patterns in HTML source code by calculating the similarity of HTML tree
nodes. Variations of simple tree matching algorithm [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] are employed for this task [
          <xref ref-type="bibr" rid="ref3 ref8">3,8</xref>
          ].
However, this technique finds it difficult to deal with structural irregularities amongst
DRs , such as lists inside DRs [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
        <p>
          Contrary to the above recent system VIDE [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] tries to not tie itself to HTML tree
at all and instead depends purely on visual features of a web page. It builds a visual
containment tree of a web page using patented VIPS [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] algorithm and then uses it instead
of HTML tree. However if there are some unloaded images or missing style information
in a web page VIPS may fail to build correct visual containment tree which leads to data
extraction problems [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>
          Combining previous two approaches ViNTs [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and DEPTA [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] systems try to
exploit visual features of web pages to aid structural based data extraction process.
However, ViNTs system do not extract data items, it just segment DRs, and evaluation of
DEPTA demonstrated, that it cannot handle contemporary pages efficiently [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
        <p>
          Systems like TextRunner [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] try to extract entities and their relationships from web
pages using natural language processing and machine learning approaches, but those
techniques usually work on regular text and are not suitable for detecting repetitive
patterns in web pages. By contrast, WebTables [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] extracts entities from structured web
tables enclosed in &lt; table &gt; HTML tags. However, it would miss structured data
presented in other form of html tags.
        </p>
        <p>In summary, none of the systems can properly handle visual and structural features
of web page to effectively extract structured web data. The ClustVX system proposed in
this work fully exploits visual and structural information and achieves promising results.</p>
      </sec>
      <sec id="sec-3-2">
        <title>2. The ClustVX Approach</title>
        <p>The ClustVX processes a given Web page in the following steps:
1. Preprocessing the page. In this stage the web page is cleaned from all HTML text
formatting tags, such as &lt; b &gt;, &lt; em &gt;, which appear in the middle of a text and
may hinder the clustering process. Visual features of web page elements acquired
from browser’s API are embedded into HTML tags for processing in the next step.
2. Generating Xstring representation of each HTML element. Each visible web page
elements is represented by a Xstring, by which the elements are later clustered.
As we see in Fig. 1(b) Xstring consists of a) tag names from Xpath b) visual
features of that element (font style, color, weight, etc.). Structural features (string of
tag names) identifies position in HTML document. Visual features enhance
understanding of semantic similarity between web page elements.
3. Clustering of web page elements. All visible web page elements are clustered
according to their Xstring. Resulting clusters contain only semantically similar web
page elements. In Fig. 1(b) at (#1) we see a cluster of price elements.
4. Extraction of structured data. By analyzing Xpath of clustered visually similar web
page elements, extraction rules are induced and data is extracted. In Fig. 1(b) at
(#2) is a Xpath of the region in a page where Data Records are located. Each Data
Record is enclosed in a DIV tag (#3). The final path of price elements inside a
Data Record is (#4).</p>
      </sec>
      <sec id="sec-3-3">
        <title>3. Research Methodology</title>
        <p>
          To evaluate ClustVX approach we use the following three publicly available benchmark
datasets containing in total of 7098 data records: 1) TBDW Ver. 1.02 [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], 2) ViNTs
dataset 2 [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], 3) M. Alvarez et al. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. See Tab. 1 for details. These data sets contain
search result pages generated from databases. Following the works of other authors [
          <xref ref-type="bibr" rid="ref11 ref3 ref4 ref5 ref8">8,
11,4,3,5</xref>
          ] in structured data extraction we use three evaluation metrics which come from
information retrieval field: precision, recall and F-score.
        </p>
        <p>
          The positive preliminary results showing that ClustVX can achieve higher than 0.98
F-Score encourage further development and evaluation of ClustVX on real-world data.
We see a must to have a data set containing thousands of pages from different web sites.
To create such a huge data set we are planning to exploit the power of crowdsourcing
by the help of Amazon Mechanical Turk service [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. The sercvice lets to present
simple human intelligence requiring tasks, such as labeling data or telling if extraction was
successful, to thousands of voluntary workers, which are paid on per hour or per task
basis.
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>4. Conclusions and Research Directions</title>
        <p>In this paper we presented ClustVX system, which, by exploiting visual and structural
features of web page elements, extracts structured data. The preliminary evaluation of
ClustVX on three publicly available benchmark data sets demonstrated, that our method
can achieve very high quality in terms of precision and recall. Our future work will be
concentrated on creating a new huge benchmark data set, dealing with extremely
malformed HTML source code and comparing ClustVX system to competing approaches.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Weikum</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Theobald</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>From information to knowledge: harvesting entities and relationships from web sources</article-title>
          . In:
          <article-title>Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems</article-title>
          ,
          <source>ACM</source>
          (
          <year>2010</year>
          ),
          <fpage>65</fpage>
          -
          <lpage>76</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Cafarella</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halevy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          , Zhang, Y.:
          <article-title>Webtables: exploring the power of tables on the web</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          <volume>1</volume>
          (
          <issue>1</issue>
          ) (
          <year>2008</year>
          ),
          <fpage>538</fpage>
          -
          <lpage>549</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Zhai</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Web data extraction based on partial tree alignment</article-title>
          .
          <source>In: Proceedings of the 14th international conference onWorld WideWeb</source>
          , ACM (
          <year>2005</year>
          )
          <fpage>76</fpage>
          -
          <lpage>85</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meng</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meng</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Vide: A vision-based approach for deep web data extraction</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>22</volume>
          (
          <issue>3</issue>
          ) (
          <year>2010</year>
          ),
          <fpage>447</fpage>
          -
          <lpage>460</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Meng</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raghavan</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Fully automatic wrapper generation for search engines</article-title>
          .
          <source>In: Proceedings of the 14th international conference on World Wide Web, ACM</source>
          (
          <year>2005</year>
          ),
          <fpage>66</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Banko</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cafarella</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soderland</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Broadhead</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Etzioni</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Open information extraction for the web</article-title>
          , University of Washington (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Cai</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Ma, W.:
          <article-title>Vips: a visionbased page segmentation algorithm</article-title>
          .
          <source>Tech. rep., Microsoft Technical Report</source>
          , MSR-TR-
          <volume>2003-79</volume>
          (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Alvarez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raposo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bellas</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cacheda</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Extracting lists of data records from semistructured web pages</article-title>
          .
          <source>Data and Knowledge Engineering</source>
          <volume>64</volume>
          (
          <issue>2</issue>
          ) (
          <year>2008</year>
          ),
          <fpage>491</fpage>
          -
          <lpage>509</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Yamada</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Craswell</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nakatoh</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hirokawa</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Testbed for information extraction from deep web</article-title>
          .
          <source>In: Proceedings of the 13th international World Wide Web conference on Alternate track papers and posters</source>
          ,
          <source>ACM</source>
          (
          <year>2004</year>
          ),
          <fpage>346</fpage>
          -
          <lpage>347</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Identifying syntactic differences between two programs</article-title>
          .
          <source>Software: Practice and Experience</source>
          <volume>21</volume>
          (
          <issue>7</issue>
          ) (
          <year>1991</year>
          ),
          <fpage>739</fpage>
          -
          <lpage>755</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Jindal</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>A generalized tree matching algorithm considering nested lists for web data extraction</article-title>
          .
          <source>In: The SIAM International Conference on Data Mining</source>
          (
          <year>2010</year>
          ),
          <fpage>930</fpage>
          -
          <lpage>941</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Alonso</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rose</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stewart</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Crowdsourcing for relevance evaluation</article-title>
          .
          <source>In: ACM SIGIR Forum, ACM</source>
          , Vol.
          <volume>42</volume>
          (
          <year>2008</year>
          ),
          <fpage>9</fpage>
          -
          <lpage>15</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>