Product Centric Web Page Segmentation and Localization John Cuzzola Dragan Gašević Ebrahim Bagheri Ryerson University Athabasca University Ryerson University 350 Victoria St 1 University Drive 350 Victoria St Toronto, ON M5B 2K3 Athabasca, AB T9S 3A3 Toronto, ON M5B 2K3 Canada Canada Canada jcuzzola@ryerson.ca dgasevic@acm.org bagheri@ryerson.ca ABSTRACT contain information regarding a specific goods or service. The The Internet is home to an ever increasing array of goods and remainder of this paper explains our Web page classification, services available to the general consumer. These products are segmentation and deal localization technology. often discovered through search engines whose focus is on document retrieval rather than product procurement. The demand 2. BACKGROUND Our work reported in this paper was inspired by the needs of our for details of specific products as opposed to just documents industrial partner, SideBuy Technologies, which is a daily deal containing such information has resulted in an influx of product aggregator; a service which collects for-purchase goods and collection databases, deal aggregation services, mobile apps, services from various deal sites such as Groupon, PriceGrabber twitter feeds and other just-in-time methods for rapid finding, and others. The process of collecting and aggregating these deal indexing, and notifying shoppers to sale events. This has led to information is performed manually where large numbers of staff our development of intelligent Web crawler technology aimed are employed as deal seekers [5]. Deal aggregators commonly towards this specific category of information retrieval. In this deploy web scraping tools targeted at deal sites to harvest these paper, we demonstrate our solution for Web page categorization, deals. However, the collection process usually is dependent on segmentation and localization for identifying Web pages with pre-programmed recognized patterns specific to the site being shopping deals and automatically extracting specifics from the scraped, e.g., using specific sequence of HTML tags. identified Web pages. Our work is supported with empirical data Consequently, even small modifications in such Websites will of its effectiveness. A screencast demonstration is also available require programming changes in scraping tools to accommodate online at http://youtu.be/HHPme6AJuCk. these changes. Furthermore, this targeted pattern matching Categories and Subject Descriptors approach does not scale to the unstructured and ever-changing H.3.3 [Information Storage and Retrieval]: Information Search content of the Web where many products are being sold but and Retrieval - Information filtering, retrieval models, search remain unnoticed and out-of-reach from the scrapers. Finally, the process, selection process. I.2.7 [Artificial Intelligence]: Natural time sensitive nature of these deals further fuels the desire to Language Processing - text analysis. leverage a more automated solution to the deal discovery dilemma. To this end, we have developed algorithms to allow Web crawlers General Terms to identify unstructured, previously unseen, Web pages as Algorithms, Experimentation. containing information regarding relevant online deals. Once a page is classified as containing relevant information, our Keywords algorithms can segment and localize the regions of the Web page Natural language processing, search, classification, segmentation, that contain product information, while discarding those areas that localization, deals, products, web crawling are not of interest. 1. INTRODUCTION The World Wide Web has given rise to a digital marketplace where goods and services of all varieties are sold. Retailers, wholesalers, and private individuals are using this communication medium to advertise their products directly to the consumer. Conversely, consumers are looking for these products and are using the traditional search engine as the method for discovery. However, these engines are document-centric rather than product- centric; hence they are optimized for the former rather than the latter. A successful search engine relies on its web crawlers to intelligently process visited Web pages for useful information Figure 1: Technology pipeline: (a) Web crawler (b) Deal while discarding data that does not contribute to retrieval. Geared classifier (c) Page Segmentation (d) Localization (e) Storage specifically to this domain of product search, we have created technology that can identify product Web pages, segment Web pages into logical regions, and discard those regions that do not 3. SYSTEM PIPELINE Our process of information extraction from unstructured Web Permission to make digital or hard copies of all or part of this work for content is summarized in Figure 1. A Web crawler scrapes a given personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies page for its HTML content (a). A binary classifier then determines bear this notice and the full citation on the first page. To copy otherwise, whether the text of the page contains products for purchase (deal) or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. or no such offerings exists on that page (no-deal). Those pages classified as not containing products (no-deal) are discarded (f) while those pages categorized as deal undergo segmentation resulting in several segments per page (c). Each of the extracted segments will in turn be recursively classified as either containing deal or no-deal information in their own respect in an effort to localize individual products (d). Further processing on the deal segments involve semantic annotation, pattern matching, and image recognition that would extract property/value pairs, which are ultimately stored in a central repository (e). 3.1 Binary Classifier We have developed a binary classifier capable of classifying a text/html fragment as either containing relevant products (deals) information or being void of such information (no-deal). The classifier is a hybrid Naive Bayes/Expectation-Maximization model trained using the WEKA machine learning framework [4]. We use the OpenNLP toolkit to incorporate named entity Figure 2: A segmented page. Product blocks in green dashed. recognition for dates, organizational entities, time, location, “Noisy” blocks include header/footer, navigation bar, company percentages, money, and people. Part of speech tagging is logo, customer tweets in yellow solid. combined with the WordNet lexical database to disambiguate word sense forms [3]. This information is used as features within Our system segments web pages based on HTML structure and our training dataset. The classifier is trained on information textual clues obtained from natural language processing. already manually extracted using SideBuy Technologies’ deal Segmentation of a Web page is accomplished by finding the scrapers. The detail of our classifier is available in [1]. Longest Frequent Pattern (LFP) [2] of HTML tags at the topmost (outermost) block level. The identified LFP becomes the boundary of division for each partition in the Web page. For example, consider the sequence of nested HTML tags and textual
the X7 Smartphone feature a/b/g/n WiFi” and The X7 Smartphone “
the model S2 tablet comes with 4-GB RAM”. The features a/b/g/n WiFi. result of this segmentation process is the localization of individualproduct offerings within each page in such a way that each individual segment will either contain individual product |________specifications such as name, description, and price or will |________represent non-product information in which case the segment is of no interest to us. The model S2 tablet comes with 4-GB RAM. 1. Let C be a set of candidate blocks of a web page. 1.1 Initialize C with the outermost block. Listing 1: A sample recurring pattern in HTML. (Typically C←…) 3.2 Segmentation 2. For each block in C, classify block as either deal or no- Web page segmentation is the process of partitioning a Web page deal using the binary classifier. Separate blocks into a deal into logically grouped sections either visually, structurally, or set (η) or non-deal set. semantically to form cohesive subsets of the Web page. As 2.1 for each block ƒ ϵ η already reported by various researchers [6,7,8], ecommerce Websites often use a recurring pattern to represent product 2.1.1 Find the longest frequent HTML pattern information. Therefore, each of the product information sets is (LFP) of sentence block ƒ. represented under its own Web segment within the page. Besides the product segments on the page, there may be other segments 2.1.2 If (LFP) exists: such as banners, Web page footers, and others that are not 2.1.2.1 Split ƒ in blocks on (LFP) → β relevant to product retrieval and search and can hence be discarded for our purpose (see Figure 2). We base our work on 2.1.2.2 Add split blocks to C: C ← C + β this observation and develop a Web page partitioning algorithm that processes Web page HTML contents and extracts all possible 3. Goto Step 2 if C is non-empty Web segments from that page. Algorithm 1: The Segmentation-Localization algorithm 3.3 Localization under-partitioned and should have undergone further Once Web segments have been extracted from a Web page, we segmentation in order to split its contents into individual perform localization on each of these segments. Localization is products. the process of determining which of these extracted segments contain useful and relevant product information such as the green dashed boxes in Figure 2 and also identifying those segments that contain non-relevant information and can be discarded such as the solid yellow boxes in Figure 2. In order to be able to efficiently perform the location process, we employ the same classifier that was introduced in Section 3.1. The classifier will now be used to determine whether each segment on their own would be classified as containing product-specific information or not. Therefore, the difference between the first step and the localization step would be that in the first step the classifier is used to determine whether the whole page contains product information, while in the localization step an individual segment within an already positively classified page is tested for containing product-specific information. Here, rather than evaluating the text of the entire page, only the text within this candidate segment is considered. If Figure 4: Tree representation of Figure 3. Segments 6, 8, and 9 this block is positively classified, it is split recursively into contain individual product offerings (relevant). smaller segments using the segmentation approach of Section 3.2. This process repeats iteratively for each newly segmented block Criteria 2: Because the descriptiveness of a product will vary until either the new block is negatively labeled, or a frequent significantly between websites; the minimum amount of pattern of HTML tags cannot be found. This process is illustrated information necessary is the name of the product and its price. in Figure 3 and can be visually summarized in a segmentation Blocks that do not meet this minimum were considered to be parse tree which is constructed by our implementation shown in over-partitioned. Figure 4. The leaves of the segmentation parse tree represent the Criteria 3: A leaf node that satisfies Criteria 1 and 2 but makes final outcome where each leaf node is either a segment of non- reference to the same product will only get credit for correctly interest (negatively classified) or a segment containing a single classifying the product once. product offering (positively classified localized segment). The localization algorithm is formally defined in Algorithm 1. With the above criteria in place, our system performed favorably with an average F-score of 0.903. The algorithm correctly identified 1,282 products with 154 misclassifications (false 4. EVALUATION positives). A summary of the results is given in Table 1 sorted by Initial testing of our segmentation and deal localization algorithm best F-score. The relatively poor F-score’s of the bottom 5 web involved 42 individual Web pages each from different Web sites. pages appeared to be related to either the structure of the web This set gave us a total of 1,402 individual products. The criteria page in which frequent patterns were difficult to find or the used in the determination whether the final outcome was content of the page itself where the classifier mislabeled the successful were as follows. segmented region as a non-deal area. Criteria 1: A block is correctly classified if and only if the block makes reference to exactly one product offering. If the block contains information for more than a single product then it was Figure 3: Segmentation/Localization illustrated. (a) Segmentation is performed on the entire web page using Longest Frequent Pattern (LFP). (b) Binary classifier labels each segment as either relevant (dashed green) or non-relevant (solid yellow). (c) Relevant segments are further partitioned using LFP. (d) The classifier labels partitions Table 1: segmentation/localization evaluation results. the other was “coming soon” and therefore not yet available. A further illustration of our system is available as a screencast at: http://youtu.be/HHPme6AJuCk. Also, visit the inextweb showcase section at http://inextweb.com which demonstrates how a database of localized segments are being utilized to provide an object-centered search engine over the familiar document centric engines of Google, Bing, and others. 6. CONCLUSION This paper demonstrates our approach to Web page classification, segmentation and localization specific to the domain of goods and services procurement. We describe an intelligent Web crawler implementation that sees Web pages as containing product information. Our technology can be used to build a collection of properly annotated product objects, which 5. DEMONSTRATION can be leveraged for smarter search in the domain of e- Our segmentation/localization system was tested on a Web page commerce. In our demonstration we will showcase the described from a deal aggregator’s website: pushadeal.com. The output of technology as follows 1) We will demonstrate how our machine the analysis is shown in Figure 5. Our intelligent crawler learning and page segmentation techniques were trained and correctly identified the HTML pattern that encompasses built; 2) We will introduce and provide open access to the individual products on this Web page. wrapper API of our technology that is able to extract product information segments from Web pages; 3) We will show how to use our API to quickly write an application that would crawl a given website and extract product segments. An online demo is available at: http://ls3.rnet.ryerson.ca:8086/DealExtractorSampleJavaClient/s ampleform.html 7. ACKNOWLEDGMENTS The authors would like to thank The National Science and Research Council of Canada (NSERC) and SideBuy Technologies Inc. for their funding support. 8. REFERENCES [1] Cuzzola, J., Gašević, D., Bagheri, E., "What’s the Deal? – Identifying Online Bargains," In Proceedings of the 2013 Australasian Web Conference (AWC 2013), Adelaide, Australia, 2013. [2] J. Kang, J. Yang, J. Choi, “Repetition-based Web Page Segmentation by Detecting Tag Patterns for Small-Screen Devices”, IEEE Transactions on Consumer Electronics, vol. 56, no. 2, pp. 980-986, 2010. [3] Miller, G. “WordNet: A Lexical Database for English”. Communications of the ACM 38(11): 39-41, 1995. [4] Hall, M. Eibe, F., Holmes, G. Pfahringer, B., Reutemann, P. Witten, I. The WEKA Data Mining Software: An Update. SIGKDD Explorations, 11(1): 2009. [5] Ghigliotty, D. “Do You Really Want a Job at Groupon?” Retrieved from http://salesjobs.fins.com/Articles/ SBB0001424052970204528204577012073472414832/ Do-You-ReallyWant-a-Job-at-Groupon, 2011. [6] Chakrabarti, D.,Kumar, R., Punera,K. Page-level template detection via isotonic smoothing. In Proceedings of the 16th international conference on World Wide Web (WWW '07). ACM, New York, NY, USA, 61-70, 2007. Figure 5: A segmented and deal-localized Web page. [7] Kao, H., Ho, J., Chen, M. WISDOM: Web Intrapage Informative Structure Mining Based on Document Object The leaves of the generated segmentation parse tree reveal two Model. IEEE TKDE 17 (5): 614-627, 2005. potential product offerings that were classified as non-relevant [8] Chakrabarti, D., Kumar, R., Punera, K. A graph-theoretic ( ). By looking closely at the content of the page, one can see approach to webpage segmentation, International that this was correct since one product offer had “expired” while conference on World Wide Web, pp 377-386., 2008.