Product Centric Web Page Segmentation and Localization
               John Cuzzola                                       Dragan Gašević                                    Ebrahim Bagheri
            Ryerson University                                   Athabasca University                               Ryerson University
              350 Victoria St                                     1 University Drive                                  350 Victoria St
           Toronto, ON M5B 2K3                                 Athabasca, AB T9S 3A3                               Toronto, ON M5B 2K3
                  Canada                                              Canada                                             Canada
         jcuzzola@ryerson.ca                                 dgasevic@acm.org                                     bagheri@ryerson.ca

ABSTRACT                                                                       contain information regarding a specific goods or service. The
The Internet is home to an ever increasing array of goods and                  remainder of this paper explains our Web page classification,
services available to the general consumer. These products are                 segmentation and deal localization technology.
often discovered through search engines whose focus is on
document retrieval rather than product procurement. The demand
                                                                               2.        BACKGROUND
                                                                               Our work reported in this paper was inspired by the needs of our
for details of specific products as opposed to just documents
                                                                               industrial partner, SideBuy Technologies, which is a daily deal
containing such information has resulted in an influx of product
                                                                               aggregator; a service which collects for-purchase goods and
collection databases, deal aggregation services, mobile apps,
                                                                               services from various deal sites such as Groupon, PriceGrabber
twitter feeds and other just-in-time methods for rapid finding,
                                                                               and others. The process of collecting and aggregating these deal
indexing, and notifying shoppers to sale events. This has led to
                                                                               information is performed manually where large numbers of staff
our development of intelligent Web crawler technology aimed
                                                                               are employed as deal seekers [5]. Deal aggregators commonly
towards this specific category of information retrieval. In this
                                                                               deploy web scraping tools targeted at deal sites to harvest these
paper, we demonstrate our solution for Web page categorization,
                                                                               deals. However, the collection process usually is dependent on
segmentation and localization for identifying Web pages with
                                                                               pre-programmed recognized patterns specific to the site being
shopping deals and automatically extracting specifics from the
                                                                               scraped, e.g., using specific sequence of HTML tags.
identified Web pages. Our work is supported with empirical data
                                                                               Consequently, even small modifications in such Websites will
of its effectiveness. A screencast demonstration is also available
                                                                               require programming changes in scraping tools to accommodate
online at http://youtu.be/HHPme6AJuCk.
                                                                               these changes. Furthermore, this targeted pattern matching
Categories and Subject Descriptors                                             approach does not scale to the unstructured and ever-changing
H.3.3 [Information Storage and Retrieval]: Information Search                  content of the Web where many products are being sold but
and Retrieval - Information filtering, retrieval models, search                remain unnoticed and out-of-reach from the scrapers. Finally, the
process, selection process. I.2.7 [Artificial Intelligence]: Natural           time sensitive nature of these deals further fuels the desire to
Language Processing - text analysis.                                           leverage a more automated solution to the deal discovery
                                                                               dilemma.
                                                                               To this end, we have developed algorithms to allow Web crawlers
General Terms                                                                  to identify unstructured, previously unseen, Web pages as
Algorithms, Experimentation.
                                                                               containing information regarding relevant online deals. Once a
                                                                               page is classified as containing relevant information, our
Keywords                                                                       algorithms can segment and localize the regions of the Web page
Natural language processing, search, classification, segmentation,             that contain product information, while discarding those areas that
localization, deals, products, web crawling                                    are not of interest.

1.         INTRODUCTION
The World Wide Web has given rise to a digital marketplace
where goods and services of all varieties are sold. Retailers,
wholesalers, and private individuals are using this communication
medium to advertise their products directly to the consumer.
Conversely, consumers are looking for these products and are
using the traditional search engine as the method for discovery.
However, these engines are document-centric rather than product-
centric; hence they are optimized for the former rather than the
latter. A successful search engine relies on its web crawlers to
intelligently process visited Web pages for useful information                 Figure 1: Technology pipeline: (a) Web crawler (b) Deal
while discarding data that does not contribute to retrieval. Geared            classifier (c) Page Segmentation (d) Localization (e) Storage
specifically to this domain of product search, we have created
technology that can identify product Web pages, segment Web
pages into logical regions, and discard those regions that do not              3.        SYSTEM PIPELINE
                                                                               Our process of information extraction from unstructured Web
Permission to make digital or hard copies of all or part of this work for
                                                                               content is summarized in Figure 1. A Web crawler scrapes a given
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
                                                                               page for its HTML content (a). A binary classifier then determines
bear this notice and the full citation on the first page. To copy otherwise,   whether the text of the page contains products for purchase (deal)
or republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee.
or no such offerings exists on that page (no-deal). Those pages
classified as not containing products (no-deal) are discarded (f)
while those pages categorized as deal undergo segmentation
resulting in several segments per page (c). Each of the extracted
segments will in turn be recursively classified as either containing
deal or no-deal information in their own respect in an effort to
localize individual products (d). Further processing on the deal
segments involve semantic annotation, pattern matching, and
image recognition that would extract property/value pairs, which
are ultimately stored in a central repository (e).

3.1       Binary Classifier
We have developed a binary classifier capable of classifying a
text/html fragment as either containing relevant products (deals)
information or being void of such information (no-deal). The
classifier is a hybrid Naive Bayes/Expectation-Maximization
model trained using the WEKA machine learning framework [4].
We use the OpenNLP toolkit to incorporate named entity
                                                                       Figure 2: A segmented page. Product blocks in green dashed.
recognition for dates, organizational entities, time, location,
                                                                       “Noisy” blocks include header/footer, navigation bar, company
percentages, money, and people. Part of speech tagging is
                                                                       logo, customer tweets in yellow solid.
combined with the WordNet lexical database to disambiguate
word sense forms [3]. This information is used as features within      Our system segments web pages based on HTML structure and
our training dataset. The classifier is trained on information         textual clues obtained from natural language processing.
already manually extracted using SideBuy Technologies’ deal            Segmentation of a Web page is accomplished by finding the
scrapers. The detail of our classifier is available in [1].            Longest Frequent Pattern (LFP) [2] of HTML tags at the topmost
                                                                       (outermost) block level. The identified LFP becomes the
                                                                       boundary of division for each partition in the Web page. For
                                                                       example, consider the sequence of nested HTML tags and textual
         <div class>
                                                                       content in Listing 1.
         |________<div style>
                                                                       The topmost longest frequent pattern occurs twice with <div
                   |________<p>                                        class>,<div style> resulting in two segments with fragments of
                                                                       “<p> the X7 Smartphone feature a/b/g/n WiFi” and
                               The     X7     Smartphone               “<blockquote> the model S2 tablet comes with 4-GB RAM”. The
                              features a/b/g/n WiFi.                   result of this segmentation process is the localization of individual
         <div class>                                                   product offerings within each page in such a way that each
                                                                       individual segment will either contain individual product
         |________<div style>                                          specifications such as name, description, and price or will
                   |________<blockquote>                               represent non-product information in which case the segment is of
                                                                       no interest to us.
                               The model S2 tablet
                              comes with 4-GB RAM.                       1. Let C be a set of candidate blocks of a web page.
                                                                              1.1 Initialize C with the outermost block.
       Listing 1: A sample recurring pattern in HTML.
                                                                               (Typically C←<HTML>…</HTML>)

3.2       Segmentation                                                   2. For each block in C, classify block as either deal or no-
Web page segmentation is the process of partitioning a Web page          deal using the binary classifier. Separate blocks into a deal
into logically grouped sections either visually, structurally, or        set (η) or non-deal set.
semantically to form cohesive subsets of the Web page. As                      2.1 for each block ƒ ϵ η
already reported by various researchers [6,7,8], ecommerce
Websites often use a recurring pattern to represent product                          2.1.1 Find the longest frequent HTML pattern
information. Therefore, each of the product information sets is                            (LFP) of sentence block ƒ.
represented under its own Web segment within the page. Besides
the product segments on the page, there may be other segments                       2.1.2 If (LFP) exists:
such as banners, Web page footers, and others that are not
                                                                                              2.1.2.1 Split ƒ in blocks on (LFP) → β
relevant to product retrieval and search and can hence be
discarded for our purpose (see Figure 2). We base our work on                                 2.1.2.2 Add split blocks to C: C ← C + β
this observation and develop a Web page partitioning algorithm
that processes Web page HTML contents and extracts all possible          3. Goto Step 2 if C is non-empty
Web segments from that page.
                                                                           Algorithm 1: The Segmentation-Localization algorithm
3.3       Localization                                                  under-partitioned and should have undergone further
Once Web segments have been extracted from a Web page, we               segmentation in order to split its contents into individual
perform localization on each of these segments. Localization is         products.
the process of determining which of these extracted segments
contain useful and relevant product information such as the green
dashed boxes in Figure 2 and also identifying those segments that
contain non-relevant information and can be discarded such as the
solid yellow boxes in Figure 2. In order to be able to efficiently
perform the location process, we employ the same classifier that
was introduced in Section 3.1. The classifier will now be used to
determine whether each segment on their own would be classified
as containing product-specific information or not. Therefore, the
difference between the first step and the localization step would
be that in the first step the classifier is used to determine whether
the whole page contains product information, while in the
localization step an individual segment within an already
positively classified page is tested for containing product-specific
information. Here, rather than evaluating the text of the entire
page, only the text within this candidate segment is considered. If     Figure 4: Tree representation of Figure 3. Segments 6, 8, and 9
this block is positively classified, it is split recursively into               contain individual product offerings (relevant).
smaller segments using the segmentation approach of Section 3.2.
This process repeats iteratively for each newly segmented block         Criteria 2: Because the descriptiveness of a product will vary
until either the new block is negatively labeled, or a frequent         significantly between websites; the minimum amount of
pattern of HTML tags cannot be found. This process is illustrated       information necessary is the name of the product and its price.
in Figure 3 and can be visually summarized in a segmentation            Blocks that do not meet this minimum were considered to be
parse tree which is constructed by our implementation shown in          over-partitioned.
Figure 4. The leaves of the segmentation parse tree represent the       Criteria 3: A leaf node that satisfies Criteria 1 and 2 but makes
final outcome where each leaf node is either a segment of non-          reference to the same product will only get credit for correctly
interest (negatively classified) or a segment containing a single       classifying the product once.
product offering (positively classified localized segment). The
localization algorithm is formally defined in Algorithm 1.              With the above criteria in place, our system performed favorably
                                                                        with an average F-score of 0.903. The algorithm correctly
                                                                        identified 1,282 products with 154 misclassifications (false
4.        EVALUATION                                                    positives). A summary of the results is given in Table 1 sorted by
Initial testing of our segmentation and deal localization algorithm
                                                                        best F-score. The relatively poor F-score’s of the bottom 5 web
involved 42 individual Web pages each from different Web sites.
                                                                        pages appeared to be related to either the structure of the web
This set gave us a total of 1,402 individual products. The criteria
                                                                        page in which frequent patterns were difficult to find or the
used in the determination whether the final outcome was
                                                                        content of the page itself where the classifier mislabeled the
successful were as follows.
                                                                        segmented region as a non-deal area.
Criteria 1: A block is correctly classified if and only if the block
makes reference to exactly one product offering. If the block
contains information for more than a single product then it was


Figure 3: Segmentation/Localization illustrated. (a) Segmentation is performed on the entire web page using Longest Frequent
Pattern (LFP). (b) Binary classifier labels each segment as either relevant (dashed green) or non-relevant (solid yellow). (c) Relevant
segments are further partitioned using LFP. (d) The classifier labels partitions
     Table 1: segmentation/localization evaluation results.         the other was “coming soon” and therefore not yet available. A
                                                                    further illustration of our system is available as a screencast at:
                                                                    http://youtu.be/HHPme6AJuCk. Also, visit the inextweb
                                                                    showcase section at http://inextweb.com which demonstrates
                                                                    how a database of localized segments are being utilized to
                                                                    provide an object-centered search engine over the familiar
                                                                    document centric engines of Google, Bing, and others.

                                                                    6.        CONCLUSION
                                                                    This paper demonstrates our approach to Web page
                                                                    classification, segmentation and localization specific to the
                                                                    domain of goods and services procurement. We describe an
                                                                    intelligent Web crawler implementation that sees Web pages as
                                                                    containing product information. Our technology can be used to
                                                                    build a collection of properly annotated product objects, which
5.        DEMONSTRATION                                             can be leveraged for smarter search in the domain of e-
Our segmentation/localization system was tested on a Web page       commerce. In our demonstration we will showcase the described
from a deal aggregator’s website: pushadeal.com. The output of      technology as follows 1) We will demonstrate how our machine
the analysis is shown in Figure 5. Our intelligent crawler          learning and page segmentation techniques were trained and
correctly identified the HTML pattern that encompasses              built; 2) We will introduce and provide open access to the
individual products on this Web page.                               wrapper API of our technology that is able to extract product
                                                                    information segments from Web pages; 3) We will show how to
                                                                    use our API to quickly write an application that would crawl a
                                                                    given website and extract product segments. An online demo is
                                                                    available at:
                                                                    http://ls3.rnet.ryerson.ca:8086/DealExtractorSampleJavaClient/s
                                                                    ampleform.html

                                                                    7.        ACKNOWLEDGMENTS
                                                                    The authors would like to thank The National Science and
                                                                    Research Council of Canada (NSERC) and SideBuy
                                                                    Technologies Inc. for their funding support.

                                                                    8.        REFERENCES
                                                                    [1] Cuzzola, J., Gašević, D., Bagheri, E., "What’s the Deal? –
                                                                        Identifying Online Bargains," In Proceedings of the 2013
                                                                        Australasian Web Conference (AWC 2013), Adelaide,
                                                                        Australia, 2013.
                                                                    [2] J. Kang, J. Yang, J. Choi, “Repetition-based Web Page
                                                                        Segmentation by Detecting Tag Patterns for Small-Screen
                                                                        Devices”, IEEE Transactions on Consumer Electronics,
                                                                        vol. 56, no. 2, pp. 980-986, 2010.
                                                                    [3] Miller, G. “WordNet: A Lexical Database for English”.
                                                                        Communications of the ACM 38(11): 39-41, 1995.
                                                                    [4] Hall,     M. Eibe, F., Holmes, G. Pfahringer, B.,
                                                                        Reutemann, P. Witten, I. The WEKA Data Mining
                                                                        Software: An Update. SIGKDD Explorations, 11(1): 2009.
                                                                    [5] Ghigliotty, D. “Do You Really Want a Job at Groupon?”
                                                                        Retrieved       from       http://salesjobs.fins.com/Articles/
                                                                        SBB0001424052970204528204577012073472414832/
                                                                        Do-You-ReallyWant-a-Job-at-Groupon, 2011.
                                                                    [6] Chakrabarti, D.,Kumar, R., Punera,K. Page-level template
                                                                        detection via isotonic smoothing. In Proceedings of the
                                                                        16th international conference on World Wide Web (WWW
                                                                        '07). ACM, New York, NY, USA, 61-70, 2007.
     Figure 5: A segmented and deal-localized Web page.             [7] Kao, H., Ho, J., Chen, M. WISDOM: Web Intrapage
                                                                        Informative Structure Mining Based on Document Object
The leaves of the generated segmentation parse tree reveal two          Model. IEEE TKDE 17 (5): 614-627, 2005.
potential product offerings that were classified as non-relevant    [8] Chakrabarti, D., Kumar, R., Punera, K. A graph-theoretic
( ). By looking closely at the content of the page, one can see         approach to webpage segmentation, International
that this was correct since one product offer had “expired” while       conference on World Wide Web, pp 377-386., 2008.