V. Kůrková et al. (Eds.): ITAT 2014 with selected papers from Znalosti 2014, CEUR Workshop Proceedings Vol. 1214, pp. 40–45 http://ceur-ws.org/Vol-1214, Series ISSN 1613-0073, c 2014 P. Gurský, V. Chabal’, R. Novotný, M. Vaško, M. Vereščák Extracting Product Data from E-Shops Peter Gurský, Vladimír Chabal’, Róbert Novotný, Michal Vaško, and Milan Vereščák Institute of Computer Science Univerzita Pavla Jozefa Šafárika Jesenná 5, 040 01 Košice, Slovakia {peter.gursky, robert.novotny, michal.vasko, milan.verescak}@upjs.sk, v.chabal@gmail.com Abstract: We present a method for extracting product data which extracts product data from a particular e-shop and from e-shops based on annotation tool embedded within requires annotation of just single page. Furthermore, many web browser. This tool simplifies automatic detection of annotation aspects are automatized within this process. data presented in tabular and list form. The annotations The whole annotation proceeds within a web browser. serve as a basis for extraction rules for a particular web page, which are subsequently used in the product data ex- traction method. 2 State of the Art 1 Introduction and Motivation The area of web extraction systems is well-researched. There are many surveys and comparisons of the existing Since the beginnings, web pages have served for presenta- systems [1, 2, 3, 4]. The actual code that extracts relevant tion of information to human readers. Unfortunately, not data from a web page and outputs it in a structured form even advent of the semantic web, which has been with us is traditionally called wrapper [1]. Wrappers can be clas- for more than ten years, was able to successfully solve the sified according to the process of creation and method of problem of structured web data extraction from web pages. use into the following categories: Currently, there are various approaches to web extraction methods for information that was not indented for machine • manually constructed systems of web information ex- processing. traction The scope of Kapsa.sk project is to retrieve information contained within e-shop products by crawling and extract- ing data and presenting it in a unified form which simpli- • automatically constructed systems requiring user as- fies the user’s decision of preferred products. sistance The result of crawling is a set of web pages that contain product details. As a subproblem, the crawler identifies • automatically constructed systems with a partial user pages that positively contain product details, and ignores assistance other kind of pages. A typical e-shop contains various kinds of products. • fully automatized systems without user assistance Our goal is to retrieve as much structured data about prod- uct as possible. More specifically, this means retrieving their properties or attributes including their values. We 2.1 Manually Constructed Web Information have observed that each kind of product, called domain Extraction Systems has a different set of attributes. For example, a domain of television set has such attributes as display size, or refresh Manually constructed systems generally require the use rate. On the other hand, these attributes will not appear of a programming language or define a domain-specific in the domains of washing machines or bicycles. How- language (DSL). Wrapper construction is then equivalent ever, we can see that there are certain attributes which are to wrapper programming. The main advantage lies in the common to all domains, such as product name, price or easy customization for different domains, while the obvi- quantity in stock. We will call such attributes to be domain ous drawback is the required programming skill (which independent. Often, the names of domain independent at- may be made ease by lesser complexity of a particular tributes are implicit or omitted in the HTML code of a web DSL). The well-known systems are M INERVA [5], TSIM- page (price being the most notorious example). MIS [6] and W EB OQL [7]. The OXPATH [20] language Since the number of product domains can be fairly large is a more recent extension of the XPath language specifi- (tens, even hundreds), we have developed an extraction cally targeted to information extraction, crawling and web system, in which it is not necessary to annotate each prod- browser automation. It is possible to fill the forms, follow uct domain separately. In this paper, we present a method the hyperlinks and create iterative rules. Extracting Product Data from E-Shops 41 2.2 Automatically Constructed Web Extraction 3 Web Extraction System within Kapsa.sk Systems Requiring User Assistance Project These systems are based on various methods for automatic Our design focuses on an automatically constructed web wrapper generation (also known as wrapper induction), information extractor system with a partial user assistance. mostly using machine learning. This approach usually re- We have designed an annotation tool, which is used to an- quires an input set of manually annotated examples (i. e. notate the relevant product attributes occurring on a sam- web pages), where additional annotated pages are auto- ple page from a single e-shop. Each annotated product at- matically induced. A wrapper is created according to the tribute corresponds to an element within the HTML tree presented pages. Such approaches do not require any pro- structure of the product page, and can be uniquely ad- gramming skills. Very often, the actual annotation is real- dressed by an XPath expression optionally enriched with ized within the GUI. On the other hand, the annotation pro- regular expressions. cess can be heavily domain-dependent and web page de- Then, we have observed that many e-shops generate pended and may be very demanding. Tools in this category product pages from a server-side template engine. This include WIEN [8], S OFTMEALY [9] and S TALKER [9]. means that in many cases, XPath expressions that address relevant product attributes remain the same. Generally, this allows us to annotate the data only once, on a suitable 2.3 Automatically Constructed Web Extraction web page. (See Figure 1). To ease an effort of annotation, Systems With Partial User Assistance we discover the repeating data regions with the modified MDR algorithm [18] described in the section 5.1). These tools use automated wrapper generation methods. The result of the annotation process is an extractor (cor- They tend to be more automated, and do not require users responding to the notion of a wrapper) represented as a set to fully annotate sample web pages. Instead, they work of extraction rules. In the implementation, we represent well with partial or incomplete pages. One approach is these rules in JSON, thus making them independent from to induce wrappers from these samples. User assistance the annotation tool. (see section 4 for more information). is required only during the actual extraction rule creation This way, we are able to enrich the manual annotation process. The most well-known tools are IEPAD [11], approach with a certain degree of automation. Further im- OLERA [12] and T HRESHER [13]. provements on ideas from other solutions are based on addressing HTML elements with product data not only with XPath (an approach used in OXPATH [20]), but also 2.4 Fully Automatized Systems Without User with regular expression. It is known that some product Assistance attributes may occur in a single HTML element in a semi- structured form (for example as a comma-delimited list). A typical tool in this group aims to fully automate the ex- Since XPath expressions are unable to address such non- traction process with no or minimal user assistance. It atomic values, we use the regular expressions to reach be- searches for repeating patterns and structures within a web low this level of coarseness. Although a similar approach page or data records. Such structures are then used as is used in the W4F [21], we have built upon similar ideas a basis for a wrapper. Usually, they are designed for and we are presenting them in our web-browser-based an- web pages with a fixed template format. This means that notation tool. Furthermore, we allow the use of the modi- extracted information needs to be refined or further pro- fied MDR algorithm to detect the repeating regions. cessed. Example tools in this category are ROAD RUN - NER [14], EXALG [15] or approach used by Maruščák et al. [16]. 4 Extractors – The Fundament of Annotation 4.1 Extraction Rules In the first step of annotation, an extractor is constructed. It is composed from one or multiple extraction rules, each corresponding to an object attribute. All extraction rules have two common properties: 1. They address a single HTML element on a web page that contains the extracted value. The addressing is represented by an XPath expression. Figure 1: Extracting data from template-based pages 2. The default representation of the extraction rule in both annotation and extraction tools is JSON. 42 P. Gurský, V. Chabal’, R. Novotný, M. Vaško, M. Vereščák { "site": "http://www.kaktusbike.sk/terrano-lady-12097", "extractor": { "type": "complex", "items": [ { /* -- Rule with fixed attribute name */ "type": "label-and-manual-value", "xpath": "//*[contains(@class,\"name\")]/h1", "label": "Name" }, { /* -- list (table rows) -- */ "type": "list", "xpath": ’//*[contains(@class,\"columns\")]/table/tbody/tr’, "label": "N/A" "items": [ { "type": "label-and-value", "labelXPath": "td[1]", "xpath": "td[2]" }]}]}} Figure 2: Defining an extractor along with various extraction rules 4.2 Types of Extraction Rules a rule for extracted name). In the example, we use the extraction rule declared as a list. The value with fixed name rule is used to extract one An extractor defined by rule composition (i. e. with (atomic) value along with a predefined attribute name. the complex rule) is specifically suited for data extraction Usually, this attribute is specified via the graphic user in- not only from a particular web page (as implemented in the terface of the annotation tool. Alternatively, this is spec- user interface, see section 6), but also for any other product ified as the name of well-known domain-independent at- pages of a particular e-shop. In this case, no additional tribute. See Figure 2, rule label-and-manual-value for cooperation with annotation tool is required. an example. The annotation of domain-independent values is usu- The value with extracted name expands upon the previous ally realized with the value with fixed name rule, since the rule. The name of the extracted attribute is defined by an attribute name is not explicitly available within a HTML additional XPath expression that corresponds to an HTML source of the web page. element that contains attribute name (e. g. string Price). Domain-dependent attributes (which are more frequent The example in Figure 2 uses the label-and-value ex- than the domain-independent ones) usually occur in a vi- traction rule. sually structured "tabular" form. The annotation automa- tization process described in the next section, allows us to The complex rule is a composition (nesting) of other rules. infer a list rule along with a nested value-with-extracted- This allows to define extractor for multiple values, usu- name rule. This combination of rules is sufficient to ex- ally corresponding to attributes of the particular product. tract product data from multiple detail on web pages. Fur- Whenever the complex rule contains an XPath expression thermore, this particular combination supports attribute (addressing a single element), all nested rules use this el- permutation or variation. Therefore, we can successfully ement as a context node. In other words, nested rules identify and extract attributes that are swapped, or even can specify their XPath expression relative to this element. omitted on some web pages. This feature allows us to cre- The example uses the extraction rule declared as complex. ate wrappers that are suitable for all product domains of an Usually, a complex rule is a top-level rule in an extractor. e-shop. Moreover, this set of extraction rules may be further ex- The list rule is used to extract multiple values with a com- panded. We may specify additional rules that support reg- mon ancestor addressed by an XPath expression. This ular expressions along with the XPath or we may possibly expression then corresponds to multiple HTML subtrees. support the extraction of attribute values from web page This rule must contain one or more nested extraction rules. metadata, e. g. the product identifier specified within an A typical use is to extract cells in table rows (by nesting URL of the web page. Extracting Product Data from E-Shops 43 5 Automatizing Annotation Process Supporting shallow trees. The MDR algorithm has a lim- ited use for shallow tree data regions. (The original au- As we have mentioned in the previous section, we aim to thors state minimal limit of four layers.) However, at- make the annotation process easier and quicker. A product tributes or user comments very often occur in such shallow page often uses either tabular or list forms, which visually trees. For example, a user comment occurring in element clarify complex information about many product proper-