=Paper= {{Paper |id=None |storemode=property |title=Extracting Product Data from E-Shops |pdfUrl=https://ceur-ws.org/Vol-1214/40.pdf |volume=Vol-1214 |dblpUrl=https://dblp.org/rec/conf/itat/GurskyCNVV14 }} ==Extracting Product Data from E-Shops== https://ceur-ws.org/Vol-1214/40.pdf
V. Kůrková et al. (Eds.): ITAT 2014 with selected papers from Znalosti 2014, CEUR Workshop Proceedings Vol. 1214, pp. 40–45
http://ceur-ws.org/Vol-1214, Series ISSN 1613-0073, c 2014 P. Gurský, V. Chabal’, R. Novotný, M. Vaško, M. Vereščák



                                   Extracting Product Data from E-Shops

                     Peter Gurský, Vladimír Chabal’, Róbert Novotný, Michal Vaško, and Milan Vereščák

                                              Institute of Computer Science
                                             Univerzita Pavla Jozefa Šafárika
                                                         Jesenná 5,
                                                 040 01 Košice, Slovakia
                       {peter.gursky, robert.novotny, michal.vasko, milan.verescak}@upjs.sk,
                                                 v.chabal@gmail.com

Abstract: We present a method for extracting product data               which extracts product data from a particular e-shop and
from e-shops based on annotation tool embedded within                   requires annotation of just single page. Furthermore, many
web browser. This tool simplifies automatic detection of                annotation aspects are automatized within this process.
data presented in tabular and list form. The annotations                The whole annotation proceeds within a web browser.
serve as a basis for extraction rules for a particular web
page, which are subsequently used in the product data ex-
traction method.                                                        2     State of the Art

1 Introduction and Motivation                                           The area of web extraction systems is well-researched.
                                                                        There are many surveys and comparisons of the existing
Since the beginnings, web pages have served for presenta-
                                                                        systems [1, 2, 3, 4]. The actual code that extracts relevant
tion of information to human readers. Unfortunately, not
                                                                        data from a web page and outputs it in a structured form
even advent of the semantic web, which has been with us
                                                                        is traditionally called wrapper [1]. Wrappers can be clas-
for more than ten years, was able to successfully solve the
                                                                        sified according to the process of creation and method of
problem of structured web data extraction from web pages.
                                                                        use into the following categories:
Currently, there are various approaches to web extraction
methods for information that was not indented for machine
                                                                            • manually constructed systems of web information ex-
processing.
                                                                              traction
   The scope of Kapsa.sk project is to retrieve information
contained within e-shop products by crawling and extract-
ing data and presenting it in a unified form which simpli-                  • automatically constructed systems requiring user as-
fies the user’s decision of preferred products.                               sistance
   The result of crawling is a set of web pages that contain
product details. As a subproblem, the crawler identifies                    • automatically constructed systems with a partial user
pages that positively contain product details, and ignores                    assistance
other kind of pages.
   A typical e-shop contains various kinds of products.                     • fully automatized systems without user assistance
Our goal is to retrieve as much structured data about prod-
uct as possible. More specifically, this means retrieving
their properties or attributes including their values. We               2.1    Manually Constructed Web Information
have observed that each kind of product, called domain                         Extraction Systems
has a different set of attributes. For example, a domain of
television set has such attributes as display size, or refresh          Manually constructed systems generally require the use
rate. On the other hand, these attributes will not appear               of a programming language or define a domain-specific
in the domains of washing machines or bicycles. How-                    language (DSL). Wrapper construction is then equivalent
ever, we can see that there are certain attributes which are            to wrapper programming. The main advantage lies in the
common to all domains, such as product name, price or                   easy customization for different domains, while the obvi-
quantity in stock. We will call such attributes to be domain            ous drawback is the required programming skill (which
independent. Often, the names of domain independent at-                 may be made ease by lesser complexity of a particular
tributes are implicit or omitted in the HTML code of a web              DSL). The well-known systems are M INERVA [5], TSIM-
page (price being the most notorious example).                          MIS [6] and W EB OQL [7]. The OXPATH [20] language
   Since the number of product domains can be fairly large              is a more recent extension of the XPath language specifi-
(tens, even hundreds), we have developed an extraction                  cally targeted to information extraction, crawling and web
system, in which it is not necessary to annotate each prod-             browser automation. It is possible to fill the forms, follow
uct domain separately. In this paper, we present a method               the hyperlinks and create iterative rules.
Extracting Product Data from E-Shops                                                                                     41


2.2   Automatically Constructed Web Extraction                3 Web Extraction System within Kapsa.sk
      Systems Requiring User Assistance                         Project
These systems are based on various methods for automatic      Our design focuses on an automatically constructed web
wrapper generation (also known as wrapper induction),         information extractor system with a partial user assistance.
mostly using machine learning. This approach usually re-      We have designed an annotation tool, which is used to an-
quires an input set of manually annotated examples (i. e.     notate the relevant product attributes occurring on a sam-
web pages), where additional annotated pages are auto-        ple page from a single e-shop. Each annotated product at-
matically induced. A wrapper is created according to the      tribute corresponds to an element within the HTML tree
presented pages. Such approaches do not require any pro-      structure of the product page, and can be uniquely ad-
gramming skills. Very often, the actual annotation is real-   dressed by an XPath expression optionally enriched with
ized within the GUI. On the other hand, the annotation pro-   regular expressions.
cess can be heavily domain-dependent and web page de-            Then, we have observed that many e-shops generate
pended and may be very demanding. Tools in this category      product pages from a server-side template engine. This
include WIEN [8], S OFTMEALY [9] and S TALKER [9].            means that in many cases, XPath expressions that address
                                                              relevant product attributes remain the same. Generally,
                                                              this allows us to annotate the data only once, on a suitable
2.3 Automatically Constructed Web Extraction                  web page. (See Figure 1). To ease an effort of annotation,
    Systems With Partial User Assistance                      we discover the repeating data regions with the modified
                                                              MDR algorithm [18] described in the section 5.1).
These tools use automated wrapper generation methods.            The result of the annotation process is an extractor (cor-
They tend to be more automated, and do not require users      responding to the notion of a wrapper) represented as a set
to fully annotate sample web pages. Instead, they work        of extraction rules. In the implementation, we represent
well with partial or incomplete pages. One approach is        these rules in JSON, thus making them independent from
to induce wrappers from these samples. User assistance        the annotation tool. (see section 4 for more information).
is required only during the actual extraction rule creation      This way, we are able to enrich the manual annotation
process. The most well-known tools are IEPAD [11],            approach with a certain degree of automation. Further im-
OLERA [12] and T HRESHER [13].                                provements on ideas from other solutions are based on
                                                              addressing HTML elements with product data not only
                                                              with XPath (an approach used in OXPATH [20]), but also
2.4   Fully Automatized Systems Without User
                                                              with regular expression. It is known that some product
      Assistance
                                                              attributes may occur in a single HTML element in a semi-
                                                              structured form (for example as a comma-delimited list).
A typical tool in this group aims to fully automate the ex-
                                                              Since XPath expressions are unable to address such non-
traction process with no or minimal user assistance. It
                                                              atomic values, we use the regular expressions to reach be-
searches for repeating patterns and structures within a web
                                                              low this level of coarseness. Although a similar approach
page or data records. Such structures are then used as
                                                              is used in the W4F [21], we have built upon similar ideas
a basis for a wrapper. Usually, they are designed for
                                                              and we are presenting them in our web-browser-based an-
web pages with a fixed template format. This means that
                                                              notation tool. Furthermore, we allow the use of the modi-
extracted information needs to be refined or further pro-
                                                              fied MDR algorithm to detect the repeating regions.
cessed. Example tools in this category are ROAD RUN -
NER [14], EXALG [15] or approach used by Maruščák et
al. [16].                                                     4 Extractors – The Fundament of
                                                                Annotation


                                                              4.1   Extraction Rules
                                                              In the first step of annotation, an extractor is constructed.
                                                              It is composed from one or multiple extraction rules, each
                                                              corresponding to an object attribute. All extraction rules
                                                              have two common properties:
                                                               1. They address a single HTML element on a web page
                                                                  that contains the extracted value. The addressing is
                                                                  represented by an XPath expression.
   Figure 1: Extracting data from template-based pages         2. The default representation of the extraction rule in
                                                                  both annotation and extraction tools is JSON.
42                                                                        P. Gurský, V. Chabal’, R. Novotný, M. Vaško, M. Vereščák


 {
      "site": "http://www.kaktusbike.sk/terrano-lady-12097",
      "extractor": {
        "type": "complex",
        "items": [
          {
            /* -- Rule with fixed attribute name */
            "type": "label-and-manual-value",
            "xpath": "//*[contains(@class,\"name\")]/h1",
            "label": "Name"
          },
          {
             /* -- list (table rows) -- */
            "type": "list",
            "xpath": ’//*[contains(@class,\"columns\")]/table/tbody/tr’,
            "label": "N/A"
            "items": [
              {
                "type": "label-and-value",
                "labelXPath": "td[1]",
                "xpath": "td[2]"
              }]}]}}


                            Figure 2: Defining an extractor along with various extraction rules


4.2     Types of Extraction Rules                               a rule for extracted name). In the example, we use the
                                                                extraction rule declared as a list.
The value with fixed name rule is used to extract one
                                                                  An extractor defined by rule composition (i. e. with
(atomic) value along with a predefined attribute name.
                                                                the complex rule) is specifically suited for data extraction
Usually, this attribute is specified via the graphic user in-
                                                                not only from a particular web page (as implemented in the
terface of the annotation tool. Alternatively, this is spec-
                                                                user interface, see section 6), but also for any other product
ified as the name of well-known domain-independent at-
                                                                pages of a particular e-shop. In this case, no additional
tribute. See Figure 2, rule label-and-manual-value for
                                                                cooperation with annotation tool is required.
an example.
                                                                   The annotation of domain-independent values is usu-
The value with extracted name expands upon the previous         ally realized with the value with fixed name rule, since the
rule. The name of the extracted attribute is defined by an      attribute name is not explicitly available within a HTML
additional XPath expression that corresponds to an HTML         source of the web page.
element that contains attribute name (e. g. string Price).         Domain-dependent attributes (which are more frequent
The example in Figure 2 uses the label-and-value ex-            than the domain-independent ones) usually occur in a vi-
traction rule.                                                  sually structured "tabular" form. The annotation automa-
                                                                tization process described in the next section, allows us to
The complex rule is a composition (nesting) of other rules.     infer a list rule along with a nested value-with-extracted-
This allows to define extractor for multiple values, usu-       name rule. This combination of rules is sufficient to ex-
ally corresponding to attributes of the particular product.     tract product data from multiple detail on web pages. Fur-
Whenever the complex rule contains an XPath expression          thermore, this particular combination supports attribute
(addressing a single element), all nested rules use this el-    permutation or variation. Therefore, we can successfully
ement as a context node. In other words, nested rules           identify and extract attributes that are swapped, or even
can specify their XPath expression relative to this element.    omitted on some web pages. This feature allows us to cre-
The example uses the extraction rule declared as complex.       ate wrappers that are suitable for all product domains of an
Usually, a complex rule is a top-level rule in an extractor.    e-shop.
                                                                  Moreover, this set of extraction rules may be further ex-
The list rule is used to extract multiple values with a com-    panded. We may specify additional rules that support reg-
mon ancestor addressed by an XPath expression. This             ular expressions along with the XPath or we may possibly
expression then corresponds to multiple HTML subtrees.          support the extraction of attribute values from web page
This rule must contain one or more nested extraction rules.     metadata, e. g. the product identifier specified within an
A typical use is to extract cells in table rows (by nesting     URL of the web page.
Extracting Product Data from E-Shops                                                                                       43


5     Automatizing Annotation Process                            Supporting shallow trees. The MDR algorithm has a lim-
                                                                 ited use for shallow tree data regions. (The original au-
As we have mentioned in the previous section, we aim to          thors state minimal limit of four layers.) However, at-
make the annotation process easier and quicker. A product        tributes or user comments very often occur in such shallow
page often uses either tabular or list forms, which visually     trees. For example, a user comment occurring in element
clarify complex information about many product proper-