V. Kůrková et al. (Eds.): ITAT 2014 with selected papers from Znalosti 2014, CEUR Workshop Proceedings Vol. 1214, pp. 40–45
http://ceur-ws.org/Vol-1214, Series ISSN 1613-0073, c 2014 P. Gurský, V. Chabal’, R. Novotný, M. Vaško, M. Vereščák


                                   Extracting Product Data from E-Shops

                     Peter Gurský, Vladimír Chabal’, Róbert Novotný, Michal Vaško, and Milan Vereščák

                                              Institute of Computer Science
                                             Univerzita Pavla Jozefa Šafárika
                                                         Jesenná 5,
                                                 040 01 Košice, Slovakia
                       {peter.gursky, robert.novotny, michal.vasko, milan.verescak}@upjs.sk,
                                                 v.chabal@gmail.com

Abstract: We present a method for extracting product data               which extracts product data from a particular e-shop and
from e-shops based on annotation tool embedded within                   requires annotation of just single page. Furthermore, many
web browser. This tool simplifies automatic detection of                annotation aspects are automatized within this process.
data presented in tabular and list form. The annotations                The whole annotation proceeds within a web browser.
serve as a basis for extraction rules for a particular web
page, which are subsequently used in the product data ex-
traction method.                                                        2     State of the Art

1 Introduction and Motivation                                           The area of web extraction systems is well-researched.
                                                                        There are many surveys and comparisons of the existing
Since the beginnings, web pages have served for presenta-
                                                                        systems [1, 2, 3, 4]. The actual code that extracts relevant
tion of information to human readers. Unfortunately, not
                                                                        data from a web page and outputs it in a structured form
even advent of the semantic web, which has been with us
                                                                        is traditionally called wrapper [1]. Wrappers can be clas-
for more than ten years, was able to successfully solve the
                                                                        sified according to the process of creation and method of
problem of structured web data extraction from web pages.
                                                                        use into the following categories:
Currently, there are various approaches to web extraction
methods for information that was not indented for machine
                                                                            • manually constructed systems of web information ex-
processing.
                                                                              traction
   The scope of Kapsa.sk project is to retrieve information
contained within e-shop products by crawling and extract-
ing data and presenting it in a unified form which simpli-                  • automatically constructed systems requiring user as-
fies the user’s decision of preferred products.                               sistance
   The result of crawling is a set of web pages that contain
product details. As a subproblem, the crawler identifies                    • automatically constructed systems with a partial user
pages that positively contain product details, and ignores                    assistance
other kind of pages.
   A typical e-shop contains various kinds of products.                     • fully automatized systems without user assistance
Our goal is to retrieve as much structured data about prod-
uct as possible. More specifically, this means retrieving
their properties or attributes including their values. We               2.1    Manually Constructed Web Information
have observed that each kind of product, called domain                         Extraction Systems
has a different set of attributes. For example, a domain of
television set has such attributes as display size, or refresh          Manually constructed systems generally require the use
rate. On the other hand, these attributes will not appear               of a programming language or define a domain-specific
in the domains of washing machines or bicycles. How-                    language (DSL). Wrapper construction is then equivalent
ever, we can see that there are certain attributes which are            to wrapper programming. The main advantage lies in the
common to all domains, such as product name, price or                   easy customization for different domains, while the obvi-
quantity in stock. We will call such attributes to be domain            ous drawback is the required programming skill (which
independent. Often, the names of domain independent at-                 may be made ease by lesser complexity of a particular
tributes are implicit or omitted in the HTML code of a web              DSL). The well-known systems are M INERVA [5], TSIM-
page (price being the most notorious example).                          MIS [6] and W EB OQL [7]. The OXPATH [20] language
   Since the number of product domains can be fairly large              is a more recent extension of the XPath language specifi-
(tens, even hundreds), we have developed an extraction                  cally targeted to information extraction, crawling and web
system, in which it is not necessary to annotate each prod-             browser automation. It is possible to fill the forms, follow
uct domain separately. In this paper, we present a method               the hyperlinks and create iterative rules.
Extracting Product Data from E-Shops                                                                                     41


2.2   Automatically Constructed Web Extraction                3 Web Extraction System within Kapsa.sk
      Systems Requiring User Assistance                         Project
These systems are based on various methods for automatic      Our design focuses on an automatically constructed web
wrapper generation (also known as wrapper induction),         information extractor system with a partial user assistance.
mostly using machine learning. This approach usually re-      We have designed an annotation tool, which is used to an-
quires an input set of manually annotated examples (i. e.     notate the relevant product attributes occurring on a sam-
web pages), where additional annotated pages are auto-        ple page from a single e-shop. Each annotated product at-
matically induced. A wrapper is created according to the      tribute corresponds to an element within the HTML tree
presented pages. Such approaches do not require any pro-      structure of the product page, and can be uniquely ad-
gramming skills. Very often, the actual annotation is real-   dressed by an XPath expression optionally enriched with
ized within the GUI. On the other hand, the annotation pro-   regular expressions.
cess can be heavily domain-dependent and web page de-            Then, we have observed that many e-shops generate
pended and may be very demanding. Tools in this category      product pages from a server-side template engine. This
include WIEN [8], S OFTMEALY [9] and S TALKER [9].            means that in many cases, XPath expressions that address
                                                              relevant product attributes remain the same. Generally,
                                                              this allows us to annotate the data only once, on a suitable
2.3 Automatically Constructed Web Extraction                  web page. (See Figure 1). To ease an effort of annotation,
    Systems With Partial User Assistance                      we discover the repeating data regions with the modified
                                                              MDR algorithm [18] described in the section 5.1).
These tools use automated wrapper generation methods.            The result of the annotation process is an extractor (cor-
They tend to be more automated, and do not require users      responding to the notion of a wrapper) represented as a set
to fully annotate sample web pages. Instead, they work        of extraction rules. In the implementation, we represent
well with partial or incomplete pages. One approach is        these rules in JSON, thus making them independent from
to induce wrappers from these samples. User assistance        the annotation tool. (see section 4 for more information).
is required only during the actual extraction rule creation      This way, we are able to enrich the manual annotation
process. The most well-known tools are IEPAD [11],            approach with a certain degree of automation. Further im-
OLERA [12] and T HRESHER [13].                                provements on ideas from other solutions are based on
                                                              addressing HTML elements with product data not only
                                                              with XPath (an approach used in OXPATH [20]), but also
2.4   Fully Automatized Systems Without User
                                                              with regular expression. It is known that some product
      Assistance
                                                              attributes may occur in a single HTML element in a semi-
                                                              structured form (for example as a comma-delimited list).
A typical tool in this group aims to fully automate the ex-
                                                              Since XPath expressions are unable to address such non-
traction process with no or minimal user assistance. It
                                                              atomic values, we use the regular expressions to reach be-
searches for repeating patterns and structures within a web
                                                              low this level of coarseness. Although a similar approach
page or data records. Such structures are then used as
                                                              is used in the W4F [21], we have built upon similar ideas
a basis for a wrapper. Usually, they are designed for
                                                              and we are presenting them in our web-browser-based an-
web pages with a fixed template format. This means that
                                                              notation tool. Furthermore, we allow the use of the modi-
extracted information needs to be refined or further pro-
                                                              fied MDR algorithm to detect the repeating regions.
cessed. Example tools in this category are ROAD RUN -
NER [14], EXALG [15] or approach used by Maruščák et
al. [16].                                                     4 Extractors – The Fundament of
                                                                Annotation


                                                              4.1   Extraction Rules
                                                              In the first step of annotation, an extractor is constructed.
                                                              It is composed from one or multiple extraction rules, each
                                                              corresponding to an object attribute. All extraction rules
                                                              have two common properties:
                                                               1. They address a single HTML element on a web page
                                                                  that contains the extracted value. The addressing is
                                                                  represented by an XPath expression.
   Figure 1: Extracting data from template-based pages         2. The default representation of the extraction rule in
                                                                  both annotation and extraction tools is JSON.
42                                                                        P. Gurský, V. Chabal’, R. Novotný, M. Vaško, M. Vereščák


 {
      "site": "http://www.kaktusbike.sk/terrano-lady-12097",
      "extractor": {
        "type": "complex",
        "items": [
          {
            /* -- Rule with fixed attribute name */
            "type": "label-and-manual-value",
            "xpath": "//*[contains(@class,\"name\")]/h1",
            "label": "Name"
          },
          {
             /* -- list (table rows) -- */
            "type": "list",
            "xpath": ’//*[contains(@class,\"columns\")]/table/tbody/tr’,
            "label": "N/A"
            "items": [
              {
                "type": "label-and-value",
                "labelXPath": "td[1]",
                "xpath": "td[2]"
              }]}]}}


                            Figure 2: Defining an extractor along with various extraction rules


4.2     Types of Extraction Rules                               a rule for extracted name). In the example, we use the
                                                                extraction rule declared as a list.
The value with fixed name rule is used to extract one
                                                                  An extractor defined by rule composition (i. e. with
(atomic) value along with a predefined attribute name.
                                                                the complex rule) is specifically suited for data extraction
Usually, this attribute is specified via the graphic user in-
                                                                not only from a particular web page (as implemented in the
terface of the annotation tool. Alternatively, this is spec-
                                                                user interface, see section 6), but also for any other product
ified as the name of well-known domain-independent at-
                                                                pages of a particular e-shop. In this case, no additional
tribute. See Figure 2, rule label-and-manual-value for
                                                                cooperation with annotation tool is required.
an example.
                                                                   The annotation of domain-independent values is usu-
The value with extracted name expands upon the previous         ally realized with the value with fixed name rule, since the
rule. The name of the extracted attribute is defined by an      attribute name is not explicitly available within a HTML
additional XPath expression that corresponds to an HTML         source of the web page.
element that contains attribute name (e. g. string Price).         Domain-dependent attributes (which are more frequent
The example in Figure 2 uses the label-and-value ex-            than the domain-independent ones) usually occur in a vi-
traction rule.                                                  sually structured "tabular" form. The annotation automa-
                                                                tization process described in the next section, allows us to
The complex rule is a composition (nesting) of other rules.     infer a list rule along with a nested value-with-extracted-
This allows to define extractor for multiple values, usu-       name rule. This combination of rules is sufficient to ex-
ally corresponding to attributes of the particular product.     tract product data from multiple detail on web pages. Fur-
Whenever the complex rule contains an XPath expression          thermore, this particular combination supports attribute
(addressing a single element), all nested rules use this el-    permutation or variation. Therefore, we can successfully
ement as a context node. In other words, nested rules           identify and extract attributes that are swapped, or even
can specify their XPath expression relative to this element.    omitted on some web pages. This feature allows us to cre-
The example uses the extraction rule declared as complex.       ate wrappers that are suitable for all product domains of an
Usually, a complex rule is a top-level rule in an extractor.    e-shop.
                                                                  Moreover, this set of extraction rules may be further ex-
The list rule is used to extract multiple values with a com-    panded. We may specify additional rules that support reg-
mon ancestor addressed by an XPath expression. This             ular expressions along with the XPath or we may possibly
expression then corresponds to multiple HTML subtrees.          support the extraction of attribute values from web page
This rule must contain one or more nested extraction rules.     metadata, e. g. the product identifier specified within an
A typical use is to extract cells in table rows (by nesting     URL of the web page.
Extracting Product Data from E-Shops                                                                                       43


5     Automatizing Annotation Process                            Supporting shallow trees. The MDR algorithm has a lim-
                                                                 ited use for shallow tree data regions. (The original au-
As we have mentioned in the previous section, we aim to          thors state minimal limit of four layers.) However, at-
make the annotation process easier and quicker. A product        tributes or user comments very often occur in such shallow
page often uses either tabular or list forms, which visually     trees. For example, a user comment occurring in element
clarify complex information about many product proper-           <div id="comment3787" class="hbox comment" in
ties (see Figure 4). We will call such form a data record        the classical MDR is transformed into the string div. It is
denoting a regularly structured object within a web page,        obvious that such a string is too frequent in the HTML
containing product attributes, user comments etc.                page, and therefore the Levenshtein distance cannot be
   Within the annotation tool, we need to nest a value-with-     used. We improve the string transformation by consider-
extracted-name within a list-based rule. Unfortunately, it       ing not only the name of the element, but also some of its
is quite difficult to address an element which contains list     attributes. In our example, the element is transformed into
items by a mouse click. Although it is possible to directly      the string div.hbox.comment. This vastly improves the
create an XPath expression, this requires advanced skills in     efficiency of the comparison in shallow trees.
this language and knowledge of the HTML tree structure
of the annotated web page. Therefore, we identify such list      Slicing tree by levels. The classical MDR algorithm trans-
elements automatically by using the modified MDR algo-           forms elements into strings by the depth-first traversal.
rithm. To recall, the classic MDR algorithm [17] is based        This means, that a subtree of an element is represented
on the Levenshtein string distance and on two observations       as a string, which is used in the Levenshtein distance. In
on HTML element relationships within a web page.                 our approach, we slice the subtree into levels, transform
                                                                 them into strings and subsequently calculate the distance
    1. Data records occur next to each other or in the           for each of these strings. The total distance is calculated
       close vicinity of each other. Moreover, they are          from the partial distances for each layer.
       presented with similar or identical formatting repre-
       sented within HTML elements. Such data records            Example 1. Consider the example in Figure 3. The clas-
       constitute a data region (see Figure 3). Since HTML       sical MDR algorithm transforms the HTML tree into the
       elements can be transformed into string representa-       following strings: [2, 5, 6, 7, 8, 9, 10] and [4,
       tions, we can easily use the Levenshtein string dis-      11, 12, 13, 14, 15, 16, 17, 18], which are then
       tance.                                                    compared.
                                                                    In our modification, we slice the tree by levels and cre-
    2. HTML elements are naturally tree-based. Therefore,        ate the following strings: first layer transforms to [2] and
       two similar data records have similar levels of nesting   [4], the second layer transforms to [5, 6, 7, 8, 9,
       and share the same parent HTML element.                   10] and [11, 12], and final layer maps to [] and [13,
                                                                 14, 15, 16, 17, 18]. Then, each pair is compared ac-
                                                                 cording to Algorithm 1.
                                                                    The slicing approach has a positive effect on time and
                                                                 space complexity. This method needs not to compare el-
                                                                 ements in different layers (which usually are not related
                                                                 at all), and simultaneously does not decrease the distance
                                                                 between subtrees.

                                                                 Removing non-structure elements. In the transformation
                                                                 process, we intentionally ignore elements which do not
                                                                 define a document structure. We omit purely formatting
                   Figure 3: Data Regions                        elements, such as b, i, tt, abbr etc.
                                                                    The algorithm for searching similar elements (see Algo-
                                                                 rithm 1) retrieves a set of elements A. Each element of this
                                                                 set is compared to an element E and return a set of similar
5.1 Modifying the MDR algorithm: attribute
                                                                 elements, while using the similarity threshold P. For each
    discovery
                                                                 element Y from set A, the algorithm computes the similar-
Bottom-up approach. The classical MDR algorithm tra-             ity of a subtree of element Y with a subtree of element E,
verses the HTML tree in the top-down direction and               level-by-level. This level-wise slicing computes two sets
searches the data regions in the whole HTML document.            of elements denoted as Zx and Zy . A similarity score is
Our modification uses the bottom-up approach: we start           computed as a ratio of edit distance to number of elements
with an element annotated by the user and move upwards           that were compared.
to the root element. This user annotation denotes the start-        The resulting score is a value between 0 and 1, ranging
ing point for further comparisons, which removes the need        from identical subtress to completely different subtrees.
for many computation steps.                                      This value is compared with the pre-set threshold P.
44                                                                        P. Gurský, V. Chabal’, R. Novotný, M. Vaško, M. Vereščák


Algorithm 1 Searching similar elements                          the add-on E XAGO suitable for Mozilla Firefox browser
Let A be an element set in which we search a similar ele-       (see Figure 4). This open-source and multiplatform tool
ment.                                                           (in comparison with commercial tools like M OZENDA,
Let E be an element for which we search a similar element.      V ISUAL W EB R IPPER, DE I XT O) represents a simple and
Let P be a similarity threshold.                                practical user interface. We support a preview of the ex-
  function SEARCH _ SIMILAR _ ELEMENTS(A, E, P)                 tractors based on the attribute discovery or manual anno-
      similar ← []                                              tations in the JSON representation.
      for Y in A do                                                The usual workflow includes visiting a web page of
          i←0                                                   a particular product and annotating product attributes, thus
          score ← 0                                             declaring an extractor structure in the background (see ex-
          c←0         . number of elements for comparison       ample on Figure 2).
          while exists i-th level in E or in Y do                  The declared extractor can be used in two ways: it can
               Zx ← elements_in_level(i, E)                     be immediately executed within the browser context, thus
               Zy ← elements_in_level(i, Y)                     extracting product values from the web page. This data can
               score ← score + Levenshtein(Zx , Zy )            be consequently sent to the server-side database for further
               c ← c + max(length(Zx ), length(Zy ))            processing. Otherwise, the JSON representation of the ex-
               i ← i+1                                          tractor can be sent to the server middleware. The extrac-
               score = score/c                                  tion process can be performed independently on the server
          end while                                             by one of the more advanced extraction techniques [19].
          if score ≤ P then
               add element Y to result
                                                                7 Conclusion and Future Work
          end if
      end for
                                                                We have presented a tool for annotating attributes of prod-
      return result
                                                                uct in e-shops. We have defined a language of extrac-
  end function
                                                                tion rules, which are created either with the assistance
                                                                of the user or are automatically inferred by the modified
5.2   Annotation in the Annotation Tool                         MDR algorithm. Extraction rules along with the algorithm
                                                                were implemented as an add-on for Mozilla Firefox web
The process of annotation of tabular or list-based attributes   browser.
is initiated by marking a single attribute value (by click-        The future research will be focused on further exten-
ing on the particular highlighted element in the annotation     sion of the modified MDR algorithm, which will search
tool). Then, the similar subtrees are discovered in the sur-    for data regions in web pages. Besides, we will aim to im-
roundings of this element. We start with the parent and         plement the product page identification on the server-side
run the algorithm for similar elements. If no similar el-       and extend the extraction methods in order to support the
ements are found on this level, we emerge at the parent         pagination and tabs within a single web page.
level. Then, we rerun the algorithm for similar elements,
until we find a level in which there exist elements that de-       This research was supported by the Agency of the Min-
fine all product attributes.                                    istry of Education, Science, Research and Sport of the Slo-
   If we find similar elements on an incorrect level (for       vak Republic for the Structural Funds of European Union,
example, it is necessary to create a list rule based on the     by project ITMS 26220220182.
parent or ancestor of discovered elements, or it is desired
to dive one level deeper), we have a possibility to manually
move above or below the discovered elements. Effectively,       References
this allows for increasing or decreasing the level of the
discovered element, for which we create the list rule.           [1] C.-H. Chang, M. Kayed, M.R. Girgis, K.F. Shaalan: A Sur-
   Beside the element with a product value, we need to an-           vey of Web Information Extraction Systems. IEEE Trans.
notate an element with an attribute name. Whenever any of            Knowledge and Data Eng., vol.18, no. 10, pp. 1411–1428,
subtrees generated by the list-based rule contains exactly           Oct. 2006.
two text elements, and one of these elements is annotated        [2] Doan, A. Halevy: Semantic integration research in the
as an attribute name, the remaining element is considered            database community: A brief survey. AI magazine, 2005,
as the attribute name element. Otherwise, a manual anno-             26(1): p. 83.
tation assisted by the user is required.                         [3] B. Liu: Web Data Mining: Exploring Hyperlinks, contents
                                                                     and Using Data, Second edition, Springer 2011. ISBN 978-
6     User Interface for Annotations                                 3-642-19459-7
                                                                 [4] E. Ferrara, G. Fiumara, R. Baumgartner. Web Data Extrac-
Extraction rules defined in section 4 and attribute discov-          tion, Applications and Techniques: A Survey. Tech. Report,
ery can be achieved in the annotation tool implemented as            2010.
Extracting Product Data from E-Shops                                                                                              45


                                       Figure 4: Extracting data from template-based pages


 [5] V. Crescenzi, G. Mecca: Grammars have exceptions. Infor-           wards automatic data extraction from large Web sites. Pro-
     mation Systems, 23(8): 539–565, 1998.                              ceedings of the 26th International Conference on Very
 [6] J. Hammer, J. McHugh, Garcia-Molina: Semi-structured               Large Database Systems (VLDB), Rome, Italy, pp. 109–
     data: the TSIMMIS experience. In Proceedings of the 1st            118, 2001.
     East-European Symposium on Advances in Databases and          [15] Arasu, H. Garcia-Molina: Extracting structured data from
     Information Systems (ADBIS), St. Petersburg, Russia, pp.           Web pages. Proceedings of the ACM SIGMOD Interna-
     1–8, 1997.                                                         tional Conference on Management of Data, San Diego, Cal-
 [7] G. O. Arocena, A. O. Mendelzon: WebOQL: Restructur-                ifornia, pp. 337–348, 2003.
     ing documents, databases, and Webs. Proceedings of the        [16] D. Maruščák, R. Novotný, P. Vojtáš: Unsupervised Struc-
     14th IEEE International Conference on Data Engineering             tured Web Data and Attribute Value Extraction. Proceed-
     (ICDE), Orlando, Florida, pp. 24–33, 1998.                         ings of Znalosti 2009.
 [8] N. Kushmerick: Wrapper induction for information extrac-      [17] B. Liu, R. Grossman, Y. Zhai: Mining Data Records in Web
     tion, PhD Thesis. 1997.                                            Pages. In: Proc S IGKDD.03, August 24–27, 2003, Wash-
 [9] C. Hsu, M. Dung: Generating finite-state transducers for           ington, DC, USA.
     semi-structured data extraction from the Web. Information     [18] V. Chabal’: Poloautomatická extrakcia komentárov z pro-
     Systems, 1998, 23(8): p. 521–538.                                  duktových katalógov. Diploma Thesis. Defended on P. J.
[10] Muslea, S. Minton, C. Knoblock: A hierarchical approach            Šafárik University, Košice, 2014.
     to wrapper induction. In Proceedings of Intl. Conf. on Au-    [19] P. Gurský, R. Novotný, M. Vaško, M. Vereščák: Obtaining
     tonomous Agents (AGENTS-1999) 1999.                                product attributes by web crawling. WIKT ’13: Proc. of
[11] C.-H. Chang, S.-C. Lui: IEPAD: Information extraction              the 8th Workshop on Intelligent and Knowledge Oriented
     based on pattern discovery. Proceedings of the Tenth Inter-        Technologies, pp. 29–34, 2013.
     national Conference on World Wide Web (WWW), Hong-            [20] T. Furche, G. Gottlob, G. Grasso, C. Schallhart, A. Sellers:
     Kong, pp. 223–231, 2001.                                           OXPath: A language for scalable data extraction, automa-
[12] C.-H. Chang, S.-C. Kuo: OLERA: A semi-supervised ap-               tion, and crawling on the deep web. The VLDB Journal
     proach for Web data extraction with visual support. IEEE           22(1): 47–72, 2013.
     Intelligent Systems, 19(6):56-64, 2004.                       [21] A. Saiiuguet, F. Azavant: Building intelligent Web appli-
[13] Hogue, D. Karger: Thresher: Automating the Unwrapping              cations using lightweight wrappers. Data and Knowledge
     of Semantic Content from the World Wide. Proceedings               Engineering 36(3): 283–316, 2001.
     of the 14th International Conference on World Wide Web
     (WWW), Japan, pp. 86–95, 2005.
[14] V. Crescenzi, G. Mecca, P. Merialdo: RoadRunner: to-