=Paper= {{Paper |id=None |storemode=property |title=Extracting Product Data from E-Shops |pdfUrl=https://ceur-ws.org/Vol-1214/40.pdf |volume=Vol-1214 |dblpUrl=https://dblp.org/rec/conf/itat/GurskyCNVV14 }} ==Extracting Product Data from E-Shops== https://ceur-ws.org/Vol-1214/40.pdf

V. Kůrková et al. (Eds.): ITAT 2014 with selected papers from Znalosti 2014, CEUR Workshop Proceedings Vol. 1214, pp. 40–45
http://ceur-ws.org/Vol-1214, Series ISSN 1613-0073, c 2014 P. Gurský, V. Chabal’, R. Novotný, M. Vaško, M. Vereščák

Extracting Product Data from E-Shops

Peter Gurský, Vladimír Chabal’, Róbert Novotný, Michal Vaško, and Milan Vereščák

Institute of Computer Science
Univerzita Pavla Jozefa Šafárika
Jesenná 5,
040 01 Košice, Slovakia
{peter.gursky, robert.novotny, michal.vasko, milan.verescak}@upjs.sk,
v.chabal@gmail.com

Abstract: We present a method for extracting product data which extracts product data from a particular e-shop and
from e-shops based on annotation tool embedded within requires annotation of just single page. Furthermore, many
web browser. This tool simplifies automatic detection of annotation aspects are automatized within this process.
data presented in tabular and list form. The annotations The whole annotation proceeds within a web browser.
serve as a basis for extraction rules for a particular web
page, which are subsequently used in the product data ex-
traction method. 2 State of the Art

1 Introduction and Motivation The area of web extraction systems is well-researched.
There are many surveys and comparisons of the existing
Since the beginnings, web pages have served for presenta-
systems [1, 2, 3, 4]. The actual code that extracts relevant
tion of information to human readers. Unfortunately, not
data from a web page and outputs it in a structured form
even advent of the semantic web, which has been with us
is traditionally called wrapper [1]. Wrappers can be clas-
for more than ten years, was able to successfully solve the
sified according to the process of creation and method of
problem of structured web data extraction from web pages.
use into the following categories:
Currently, there are various approaches to web extraction
methods for information that was not indented for machine
• manually constructed systems of web information ex-
processing.
traction
The scope of Kapsa.sk project is to retrieve information
contained within e-shop products by crawling and extract-
ing data and presenting it in a unified form which simpli- • automatically constructed systems requiring user as-
fies the user’s decision of preferred products. sistance
The result of crawling is a set of web pages that contain
product details. As a subproblem, the crawler identifies • automatically constructed systems with a partial user
pages that positively contain product details, and ignores assistance
other kind of pages.
A typical e-shop contains various kinds of products. • fully automatized systems without user assistance
Our goal is to retrieve as much structured data about prod-
uct as possible. More specifically, this means retrieving
their properties or attributes including their values. We 2.1 Manually Constructed Web Information
have observed that each kind of product, called domain Extraction Systems
has a different set of attributes. For example, a domain of
television set has such attributes as display size, or refresh Manually constructed systems generally require the use
rate. On the other hand, these attributes will not appear of a programming language or define a domain-specific
in the domains of washing machines or bicycles. How- language (DSL). Wrapper construction is then equivalent
ever, we can see that there are certain attributes which are to wrapper programming. The main advantage lies in the
common to all domains, such as product name, price or easy customization for different domains, while the obvi-
quantity in stock. We will call such attributes to be domain ous drawback is the required programming skill (which
independent. Often, the names of domain independent at- may be made ease by lesser complexity of a particular
tributes are implicit or omitted in the HTML code of a web DSL). The well-known systems are M INERVA [5], TSIM-
page (price being the most notorious example). MIS [6] and W EB OQL [7]. The OXPATH [20] language
Since the number of product domains can be fairly large is a more recent extension of the XPath language specifi-
(tens, even hundreds), we have developed an extraction cally targeted to information extraction, crawling and web
system, in which it is not necessary to annotate each prod- browser automation. It is possible to fill the forms, follow
uct domain separately. In this paper, we present a method the hyperlinks and create iterative rules.
Extracting Product Data from E-Shops 41

2.2 Automatically Constructed Web Extraction 3 Web Extraction System within Kapsa.sk
Systems Requiring User Assistance Project
These systems are based on various methods for automatic Our design focuses on an automatically constructed web
wrapper generation (also known as wrapper induction), information extractor system with a partial user assistance.
mostly using machine learning. This approach usually re- We have designed an annotation tool, which is used to an-
quires an input set of manually annotated examples (i. e. notate the relevant product attributes occurring on a sam-
web pages), where additional annotated pages are auto- ple page from a single e-shop. Each annotated product at-
matically induced. A wrapper is created according to the tribute corresponds to an element within the HTML tree
presented pages. Such approaches do not require any pro- structure of the product page, and can be uniquely ad-
gramming skills. Very often, the actual annotation is real- dressed by an XPath expression optionally enriched with
ized within the GUI. On the other hand, the annotation pro- regular expressions.
cess can be heavily domain-dependent and web page de- Then, we have observed that many e-shops generate
pended and may be very demanding. Tools in this category product pages from a server-side template engine. This
include WIEN [8], S OFTMEALY [9] and S TALKER [9]. means that in many cases, XPath expressions that address
relevant product attributes remain the same. Generally,
this allows us to annotate the data only once, on a suitable
2.3 Automatically Constructed Web Extraction web page. (See Figure 1). To ease an effort of annotation,
Systems With Partial User Assistance we discover the repeating data regions with the modified
MDR algorithm [18] described in the section 5.1).
These tools use automated wrapper generation methods. The result of the annotation process is an extractor (cor-
They tend to be more automated, and do not require users responding to the notion of a wrapper) represented as a set
to fully annotate sample web pages. Instead, they work of extraction rules. In the implementation, we represent
well with partial or incomplete pages. One approach is these rules in JSON, thus making them independent from
to induce wrappers from these samples. User assistance the annotation tool. (see section 4 for more information).
is required only during the actual extraction rule creation This way, we are able to enrich the manual annotation
process. The most well-known tools are IEPAD [11], approach with a certain degree of automation. Further im-
OLERA [12] and T HRESHER [13]. provements on ideas from other solutions are based on
addressing HTML elements with product data not only
with XPath (an approach used in OXPATH [20]), but also
2.4 Fully Automatized Systems Without User
with regular expression. It is known that some product
Assistance
attributes may occur in a single HTML element in a semi-
structured form (for example as a comma-delimited list).
A typical tool in this group aims to fully automate the ex-
Since XPath expressions are unable to address such non-
traction process with no or minimal user assistance. It
atomic values, we use the regular expressions to reach be-
searches for repeating patterns and structures within a web
low this level of coarseness. Although a similar approach
page or data records. Such structures are then used as
is used in the W4F [21], we have built upon similar ideas
a basis for a wrapper. Usually, they are designed for
and we are presenting them in our web-browser-based an-
web pages with a fixed template format. This means that
notation tool. Furthermore, we allow the use of the modi-
extracted information needs to be refined or further pro-
fied MDR algorithm to detect the repeating regions.
cessed. Example tools in this category are ROAD RUN -
NER [14], EXALG [15] or approach used by Maruščák et
al. [16]. 4 Extractors – The Fundament of
Annotation

4.1 Extraction Rules
In the first step of annotation, an extractor is constructed.
It is composed from one or multiple extraction rules, each
corresponding to an object attribute. All extraction rules
have two common properties:
1. They address a single HTML element on a web page
that contains the extracted value. The addressing is
represented by an XPath expression.
Figure 1: Extracting data from template-based pages 2. The default representation of the extraction rule in
both annotation and extraction tools is JSON.
42 P. Gurský, V. Chabal’, R. Novotný, M. Vaško, M. Vereščák

{
"site": "http://www.kaktusbike.sk/terrano-lady-12097",
"extractor": {
"type": "complex",
"items": [
{
/* -- Rule with fixed attribute name */
"type": "label-and-manual-value",
"xpath": "//*[contains(@class,\"name\")]/h1",
"label": "Name"
},
{
/* -- list (table rows) -- */
"type": "list",
"xpath": ’//*[contains(@class,\"columns\")]/table/tbody/tr’,
"label": "N/A"
"items": [
{
"type": "label-and-value",
"labelXPath": "td[1]",
"xpath": "td[2]"
}]}]}}

Figure 2: Defining an extractor along with various extraction rules

4.2 Types of Extraction Rules a rule for extracted name). In the example, we use the
extraction rule declared as a list.
The value with fixed name rule is used to extract one
An extractor defined by rule composition (i. e. with
(atomic) value along with a predefined attribute name.
the complex rule) is specifically suited for data extraction
Usually, this attribute is specified via the graphic user in-
not only from a particular web page (as implemented in the
terface of the annotation tool. Alternatively, this is spec-
user interface, see section 6), but also for any other product
ified as the name of well-known domain-independent at-
pages of a particular e-shop. In this case, no additional
tribute. See Figure 2, rule label-and-manual-value for
cooperation with annotation tool is required.
an example.
The annotation of domain-independent values is usu-
The value with extracted name expands upon the previous ally realized with the value with fixed name rule, since the
rule. The name of the extracted attribute is defined by an attribute name is not explicitly available within a HTML
additional XPath expression that corresponds to an HTML source of the web page.
element that contains attribute name (e. g. string Price). Domain-dependent attributes (which are more frequent
The example in Figure 2 uses the label-and-value ex- than the domain-independent ones) usually occur in a vi-
traction rule. sually structured "tabular" form. The annotation automa-
tization process described in the next section, allows us to
The complex rule is a composition (nesting) of other rules. infer a list rule along with a nested value-with-extracted-
This allows to define extractor for multiple values, usu- name rule. This combination of rules is sufficient to ex-
ally corresponding to attributes of the particular product. tract product data from multiple detail on web pages. Fur-
Whenever the complex rule contains an XPath expression thermore, this particular combination supports attribute
(addressing a single element), all nested rules use this el- permutation or variation. Therefore, we can successfully
ement as a context node. In other words, nested rules identify and extract attributes that are swapped, or even
can specify their XPath expression relative to this element. omitted on some web pages. This feature allows us to cre-
The example uses the extraction rule declared as complex. ate wrappers that are suitable for all product domains of an
Usually, a complex rule is a top-level rule in an extractor. e-shop.
Moreover, this set of extraction rules may be further ex-
The list rule is used to extract multiple values with a com- panded. We may specify additional rules that support reg-
mon ancestor addressed by an XPath expression. This ular expressions along with the XPath or we may possibly
expression then corresponds to multiple HTML subtrees. support the extraction of attribute values from web page
This rule must contain one or more nested extraction rules. metadata, e. g. the product identifier specified within an
A typical use is to extract cells in table rows (by nesting URL of the web page.
Extracting Product Data from E-Shops 43

5 Automatizing Annotation Process Supporting shallow trees. The MDR algorithm has a lim-
ited use for shallow tree data regions. (The original au-
As we have mentioned in the previous section, we aim to thors state minimal limit of four layers.) However, at-
make the annotation process easier and quicker. A product tributes or user comments very often occur in such shallow
page often uses either tabular or list forms, which visually trees. For example, a user comment occurring in element
clarify complex information about many product proper-