<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Classification of e-commerce websites by product categories</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">George</forename><surname>Moiseev</surname></persName>
							<email>gvmoiseev@edu.hse.ru</email>
							<affiliation key="aff0">
								<orgName type="department">Higher School of Economics</orgName>
								<address>
									<settlement>Moscow</settlement>
									<country key="RU">Russia</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Classification of e-commerce websites by product categories</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">97D4DF269A6827914548DF5CDAE2B712</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T10:00+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>e-commerce website classification</term>
					<term>product classification</term>
					<term>webmining</term>
					<term>web page classification</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Nowadays, the number of e-commerce websites steadily grows. Therefore, it is hard to collect and analyze such websites manually. Meanwhile, there are many market researchers and aggregation services that need to collect e-commerce websites for some reasons, for instance find them in a predefined domain zone or find only sites that belong to a certain product category. This paper proposes several methods for improving the preprocessing and the feature extraction stages of the web sites classification process. They are applied to the task of e-commerce websites automatic classification based on the sold product type. Experimental results show that proposed methods improve the classification accuracy.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction and Related Works</head><p>One of the most common problems nowadays is a high amount of mostly unorganized information on the web. With the exponential growth of data around the web the arrangement of the information becomes an important task for assisting users and companies in storing and retrieving the information. One example of such task is an automated e-commerce websites categorization problem which also includes the issue of retrieving such sites from the web. This problem comes from a high need of clustered or categorized e-commerce websites for market researchers who need to take into account different types of statistics in the e-commerce sphere, such as <ref type="bibr" target="#b0">[1]</ref>, comparison of shopping engines like Google Shopping or Yandex Market <ref type="bibr" target="#b1">[2]</ref>, information retrieving systems <ref type="bibr" target="#b2">[3]</ref> followed by other types of services. Normally, categorization by sold product type attracts the most interest.</p><p>To complete the description of the raised problem we have to specify that by «ecommerce website» we mean only business-to-customer (selling consumer goods and/or services to customers to earn a profit) or business-to-business (one business makes a commercial transaction with another) online shopping and we do not include customer-to-customer type of shopping when customers interact with each other while business only facilitates an environment. This makes the e-shops retrieving process more complex.</p><p>At first sight, the raised problem is a part of a wider text categorization issue or a web page classification task and it can be solved by direct borrowing of existing algorithms from machine learning literature dedicated to these issues <ref type="bibr" target="#b3">[4]</ref><ref type="bibr" target="#b4">[5]</ref><ref type="bibr" target="#b5">[6]</ref><ref type="bibr" target="#b6">[7]</ref>. Nevertheless, the solution is far from being so straightforward. Web pages are highly structured and filled with noisy content such as javascript code, advertisements and copyrights. Without taking these factors into account, they would have negative impact on performance of pure text classification algorithm. It has been proved that exploitation of the structure of a web page (HTML tags, hyperlinks) enhances the quality of classification <ref type="bibr" target="#b11">[12]</ref>.</p><p>Although most web page classification <ref type="bibr" target="#b5">[6]</ref><ref type="bibr" target="#b6">[7]</ref><ref type="bibr" target="#b7">[8]</ref> algorithms apply noise reducing techniques and use structure of web pages to improve classification, there are still issues to discuss and ways to improve.</p><p>Firstly, classifying a website raises some ambiguous questions: which webpages of the website should be downloaded and processed, should hyperlinks from the main page be used and how, is the content from the main page more valuable than content from other pages, should one consider content from all pages of the website in feature selection process.</p><p>The second non trivial question is related to the use of a web page structure in classification process. Does one need to take into account words location on the web page and how to perform that? Should display properties of the content be considered in classification process?</p><p>Another significant point to consider is a language-based approach. Because of particularly high interest in research on the Russian online market we mostly focus on classifying Russian websites. Most researches on this topic study only web pages in English or try to develop language-independent method <ref type="bibr" target="#b6">[7]</ref>. Concentrating on a single language area of the web allows us to use some language specific features that enhance the quality of classification. An interesting example of such features is using transliterated words which are frequently occurred in markup tags or hyperlinks on typical e-shop webpage.</p><p>Similarly, narrowing the sample of categorized websites to online stores gives us the opportunity to exploit the domain knowledge: predefine some handcrafted features, use typical e-shop webpage structure, check the existence of special HTML tags and look for some specific words in hyperlinks. Exploiting most of these features is impossible without restricting the category of websites.</p><p>In this paper, we propose an approach to download and preprocess a website and feature engineering techniques concerned with the mentioned issues.</p><p>The rest of paper is organized as follows: downloading and preprocessing stage is described in Section 2, feature engineering methods are discussed in Section 3 and Section 4 contains experiment environment, description of the dataset and experiments results. Finally, we conclude our work and point out the related future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Downloading and preprocessing</head><p>Most e-commerce websites consist not only of a home (main) page, but also of several additional pages such as «shopping cart» page, «catalogue» page or «help» page. However, most website catalogues and database don't store links to all pages of websites. Usually, only the home page link of a website is saved.</p><p>Therefore, the expected input of the system is suggested to be a list of home page links. Certainly, a classification process may be based only on features extracted from the main page. But there are cases when the main page may contain many images or Flash object and less textual content. Some researches on this topic show that exploiting information from hyperlinks improves the accuracy of classification <ref type="bibr" target="#b9">[10]</ref><ref type="bibr" target="#b10">[11]</ref>. Furthermore, it is a common practice for e-commerce websites to place information on sold products on individual pages or on a page with catalogue. Since this information is principal for our categorization we need find a way to gather and exploit the data from some pages of a given website.</p><p>The primary way to obtain other webpages when one has only the home page is to use hyperlinks located on the home page. The issue in this case is that we have to retrieve only useful hyperlinks because most hyperlinks contain useless or even deteriorative data for classification. In the course of the study we have derived an empirical rule which consists of ignoring links that satisfy one or more conditions listed below:</p><p>1. the link refers to a different website; 2. the hyperlink anchor text contains terms frequently used in anchor texts of web pages from different categories. The list of these terms is a union of sets of the most frequent anchor texts from each category. Some examples of such terms are: «доставка» (delivery), «контакты» (contacts), «помощь» (help) and «корзина» (shopping cart).</p><p>Another issue here is the exploitation of the content from selected links. Several researches on this topic show that this should be done very carefully. The obvious method here is to concatenate text from the main page and texts from other pages, and afterwards use concatenated text as input for classifier. But Chukrabarti in his research shows that this method performs dismally, and classification without concatenating is more efficient <ref type="bibr" target="#b8">[9]</ref>. In the course of the study we carried out a similar experiment and observed same results in binary classification task. Our idea is to extract only meta tags and title (which usually gives a good summary of the page) from other pages of the website and combine them with information from the main page. This approach eliminates the most possible noise from other pages by extracting only summarized information from meta tags and title. But in cases when the possibility of topic drift and noisy information is quite low using entire pages may have its advantage because it allows considering more information about the web site. Thus both variants are tested and compared with classifying only by main page. Experimental results can be found in Section 4.</p><p>The next preprocessing steps are quite obvious: removing noisy content such as copyright, advertisements or javascript code, removing stop words, extracting pure text from the page, tokenization, lowercase conversion and stemming. Stemming process uses Snowball algorithm. To detect advertisements we use separate classifier with the set of features including occurrences of links or words referring to widely used online advertisements systems («ad.yandex.ru», «adsense», «adserver», «adsystem», «adsale», «openx»), number of links in the tag, number of words, the proportion of capital letters, proportion of full stops, links and tag density, font size, margins. To remove copyright we check the occurrences of word «copyright» or «©» symbol and date. However, the initial web page with markup tags is saved for the feature extraction process.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Feature Engineering</head><p>As it was mentioned before, HTML tags provide significant information about the content of a web page. For instance, words nested in &lt;title&gt; tag or &lt;description&gt; meta tag are usually more important for classification than words from &lt;body&gt; as they should give a summary of the page. Also, authors of a web page use header tags, color or font to emphasize some information. There are special tags for important text such as &lt;strong&gt; and &lt;em&gt;. But the style of web pages can be very different: some authors use special tags to emphasize several important words on a whole page while others may mark every second sentence.</p><p>Our idea is to weight terms against the nearest tag they are nested in and to calculate the weight of tags inversely proportional to their frequency (i.e. the more frequent the tag is, the less valuable enclosed terms are).</p><p>The term weighting formula for the ith term in the kth web site is derived from TF-IDF <ref type="bibr" target="#b14">[15]</ref> as follows:</p><formula xml:id="formula_0">𝑊 𝑖𝑘 = 𝑡𝑓 𝑖𝑘 log 𝑁 𝑛 𝑖 √ ∑ (𝑡𝑓 𝑖𝑗 log 𝑁 𝑛 𝑗 ) 2 𝑁 𝑗=1 (1)</formula><p>where n i is the number of websites where the ith term appears, Ntotal number of web sites in the sample and tf ik is computed as:</p><formula xml:id="formula_1">𝑡𝑓 𝑖𝑘 = ∑ 𝑤(𝑡)f(𝑖, 𝑘, 𝑡) 𝑇 𝑡 (<label>2</label></formula><formula xml:id="formula_2">)</formula><p>where T is the set of all tags of kth web site, f(i, k, t) is the frequency of the ith term in tag t from web site k and 𝑤(𝑡) is calculated as follows:</p><formula xml:id="formula_3">𝑤(𝑡) = 1 ∑ [𝑥=𝑡] 𝑇 𝑥 (3)</formula><p>Besides these features binary «e-commerce or not» classification process considers some handcrafted empirical binary features which were discovered through careful analysis of several tens of typical e-commerce web sites. These binary features are checked before the stemming stage because in some of them the form of a word is important. They are presented in the table below in common regular expression notation: </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experiments</head><p>In order to test the effectiveness of using tags in feature engineering and considering meta tags and title from pages obtained via hyperlinks, several experiments are conducted. Since the focus of this paper is on the preprocessing and feature engineering stages, we choose one of the most popular classifiers -Support Vector Machine (SVM). This powerful learning algorithm was proposed by V. Vapnik <ref type="bibr" target="#b12">[13]</ref> and has been proved as one of the most powerful algorithms for text categorization <ref type="bibr" target="#b13">[14]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Dataset</head><p>The dataset was received from datainsight.ru and completely consists of websites in Russian. General dataset consists of two subsets: one for binary «e-commerce or not» classification task and the second one for categorization. Both subsets were gathered separately. Thus not all web sites from second subset are presented in the first subset and vice versa.</p><p>The dataset for binary classification contains 1312 e-commerce and 1077 non ecommerce web sites. Some of non e-commerce web sites are specially chosen C2C sites while others are chosen randomly from .ru and .рф domains. The dataset for categorization contains 1448 web sites in total. The list of available categories and numbers of web sites in each of them are listed in the Table <ref type="table" target="#tab_1">2</ref>. General department stores like amazon.com or aliexpress.com belong to the «General stores» category. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Evaluation</head><p>We use common measures to evaluate the performance of the classifier: precision, recall and F-score <ref type="bibr" target="#b15">[16]</ref>. To evaluate the average performance between multiple categories the macro-average method of calculating f-score is used <ref type="bibr" target="#b16">[17]</ref>.</p><p>Also 7-fold cross-validation algorithm was employed for testing. The F-score is computed for each fold and after that the average of all these F-scores is computed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Experimental results</head><p>There are two main subjects for experiments. The first one is related to the use of information from web pages found via hyperlinks. Here we compare 3 approaches: using only main page for classification, using main page and title + meta tags from other selected pages and using concatenation of main page with other selected pages.</p><p>Second subject concerns our approach of using and weighting markup tags in feature extraction. In order to build the baseline for this method we remove all markup tags after preprocessing (leaving the content of these tags) and apply pure TF-IDF algorithm <ref type="bibr" target="#b14">[15]</ref> before the classification step.</p><p>Both subjects are tested in binary classification task and in categorization by product type task.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Binary «e-commerce or not» classification</head><p>In case of binary classification on «e-commerce» and «non-e-commerce» classes retrieving e-commerce web sites is more valuable for us than filtering not ecommerce. Thus we evaluate binary classifier with F-score of «e-commerce» class. The F-score results are listed in the table below: While analyzing the results, we have found that considerable part (approximately 43% on average) of mistakes here is caused by customer-to-customer web sites which are not included in e-commerce web sites (in our classification), but which feature values are quite similar to B2B and B2C web sites.</p><p>Observed results show that the most efficient way of using information from other pages is to extract only meta tags and title. This approach ignores possible noise in other parts of additional pages and takes only the summary of these pages which is useful for detecting e-shops, while using whole content of the pages leads to some mistakes. For example, other pages may contain description of any e-shop or some other information which can increase the chances of false positive error.</p><p>As it can be seen from the results proposed, Tag Weighting method is more efficient than pure TF-IDF as it outperforms pure TF-IDF in all ways of extracting data. Obviously, most of e-commerce websites announce that they are e-shops in &lt;title&gt; and meta-information and Tag Weighting method weights give them maximum weight as these tags are unique. Also Tag Weighting is better at handling additional information from other pages as its F-score on «main page + whole other pages» is bigger than on «only main page» data. This is because Tag Weighting assigns small weight to some kinds of noisy information as it is usually located in frequently repeated tags.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E-commerce categorization</head><p>Table <ref type="table" target="#tab_4">5</ref> lists the result of categorization expressed in macro-average F-score: Again, Tag Weighting method performs better than pure TF-IDF. But in this case the most efficient was extracting the whole pages from hyperlinks found on main pages. This is due to the fact that the dataset for categorization contains only ecommerce web sites. This reduces the volume of noisy information on the pages of the website and thus decreases the chances of topic drift on different pages. Most highly specialized e-shops place hyperlinks to catalogue page or to some product categories pages or to certain product pages where they describe them thoroughly. This information is useful for classification by product type in most cases. Exceptions here are universal e-shops for which detailed descriptions of some goods may lead to misclassification. This can be seen on Table <ref type="table" target="#tab_5">6</ref> which presents average F-score for each category for the case when Tag Weighting is used and main page is concatenated with whole other pages. Significant number of misclassifications is connected with «Souvenirs and presents» and «Jewelry and clocks» categories because many web sites from these categories are very similar to each other. For instance, there are some web sites from «presents» category which sell clocks as an «expensive present». Also, there were many mistakes in «Sport equipment and hobbies» and «Clothing and footwear» categories because clothes and footwear are included in assortment of Sport equipment and hobbies» shops. Thus this is not a great surprise that the best classified categories are the least similar to others: «Auto product», «Medical goods», «Household goods» and «Furniture».</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>The main goal of the study was to understand which information can be useful in classifying web sites and how it can be used.</p><p>In order to check the hypothesis that the information from hyperlinks improves the quality of classification we have suggested the method for retrieving useful hyperlinks. We also compared some ways of using information from web pages found with these links. As a result, the experiments show that using the data from hyperlinks retrieved with our method increases the accuracy of classification. The experiments also revealed that it is preferably to exploit only meta tags and title from retrieved pages when diversity of data is high enough. Conversely, when the type of classifying data is more or less limited exploiting entire pages may improve the performance.</p><p>Another important idea was about exploiting the structure of a web page to enhance the classification. This paper introduces the approach of using weighted markup tags in feature extraction process and the idea of how to weight them. As illustrated by the experiment this approach is more efficient than feature extraction without considering the structure.</p><p>Proposed methods and approaches (except e-commerce handcrafted features) can be used not only in e-commerce classification but in any web classification tasks.</p><p>Also, some Russian e-commerce specific features were listed and explained. The statement of the problem together with the interesting dataset gives a wide field of possible improvements and research: (a) make the number of websites between categories more balanced and filter noisy web sites from datasets; (b) test different machine learning algorithms and their ensembles on a current dataset; (c) try to use hyperlinks which lead to another web site with filtering noisy hyperlinks and compare with the use of local hyperlinks only; (d) develop some variations of Tag Weighting method.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Handcrafted features for e-commerce web sites classification</figDesc><table><row><cell>Feature</cell><cell>Remarks</cell></row><row><cell>корзин[а-я]</cell><cell>shopping cart</cell></row><row><cell>[a-z]*cart</cell><cell>often occurs in shopping cart hyperlink</cell></row><row><cell></cell><cell>anchor text</cell></row><row><cell>[a-z]*basket</cell><cell>often occurs in shopping cart hyperlink</cell></row><row><cell></cell><cell>anchor text</cell></row><row><cell>достав[илк][а-я]*</cell><cell>delivery</cell></row><row><cell>самовывоз[а-я]*</cell><cell>pickup</cell></row><row><cell>ассортимент[а-я]</cell><cell>variety</cell></row><row><cell>([0-9]*|)руб</cell><cell>indicates price</cell></row><row><cell>сумм[ауые]</cell><cell>total cost</cell></row><row><cell>товар[а-я]*</cell><cell>good</cell></row><row><cell>оплат[а-я]*</cell><cell>payment</cell></row><row><cell>заказ[а-я]*</cell><cell>order</cell></row><row><cell>купить</cell><cell>to bye</cell></row><row><cell>покуп[а-я]*</cell><cell>purchase</cell></row><row><cell>pay(ment|)</cell><cell>often occurs in payment hyperlink anchor</cell></row><row><cell></cell><cell>text</cell></row><row><cell>pric(i|e)[a-z]*</cell><cell>often occurs in price list hyperlink anchor</cell></row><row><cell></cell><cell>text</cell></row><row><cell>наличи[а-я]</cell><cell>presence</cell></row><row><cell>(розниц[а-я]|розничн[а-я]*)</cell><cell>retail</cell></row><row><cell>скидк[а-я]</cell><cell>discount</cell></row><row><cell>цен([аеоуы].{0,2}|ник)</cell><cell>indicates price</cell></row><row><cell>аксессуар[а-я]*</cell><cell></cell></row><row><cell>(рас|)продаж[а-я]{0,2}</cell><cell>sale</cell></row><row><cell>products?</cell><cell>often occurs in catalogue hyperlink an-</cell></row><row><cell></cell><cell>chor text</cell></row><row><cell>интернет.{0,5}магазин.{0,5}</cell><cell>online shop</cell></row><row><cell>delivery[a-z]*</cell><cell>often occurs in delivery hyperlink anchor</cell></row><row><cell></cell><cell>text</cell></row><row><cell>sales?</cell><cell>often occurs in catalogue hyperlink an-</cell></row><row><cell></cell><cell>chor text</cell></row><row><cell>оптом</cell><cell>wholesale</cell></row><row><cell>oplat(a|y)</cell><cell>often occurs in payment hyperlink anchor</cell></row><row><cell></cell><cell>text</cell></row><row><cell>bitrix</cell><cell>often occurs at web sites about creating</cell></row><row><cell></cell><cell>e-shop sites.</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 .</head><label>2</label><figDesc>E-commerce product categories dataset</figDesc><table><row><cell>category id</cell><cell>category name</cell><cell>number of web sites</cell></row><row><cell>0</cell><cell>Auto products</cell><cell>138</cell></row><row><cell>1</cell><cell>Medical goods</cell><cell>289</cell></row><row><cell>2</cell><cell cols="2">Health and beauty products 114</cell></row><row><cell>3</cell><cell>Appliances and electronics</cell><cell>168</cell></row><row><cell>4</cell><cell>Household goods</cell><cell>171</cell></row><row><cell>5</cell><cell>Furniture</cell><cell>79</cell></row><row><cell>6</cell><cell>Souvenirs, presents</cell><cell>36</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 .</head><label>3</label><figDesc>Example of how the data set is stored</figDesc><table><row><cell>domain name</cell><cell>category id</cell></row><row><cell>seving.ru</cell><cell>9</cell></row><row><cell>evalar.ru</cell><cell>1</cell></row><row><cell>mojon.ru</cell><cell>13</cell></row><row><cell>hunt.ru</cell><cell>12</cell></row><row><cell>…</cell><cell>…</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4 .</head><label>4</label><figDesc>F-score of «e-commerce» class</figDesc><table><row><cell cols="2">Used web site information pure TF-IDF</cell><cell>TF-IDF with Tag weighting</cell></row><row><cell>only main page</cell><cell>0.85</cell><cell>0.89</cell></row><row><cell>main page + meta and title</cell><cell>0.89</cell><cell>0.94</cell></row><row><cell>from other pages</cell><cell></cell><cell></cell></row><row><cell>main page + whole other</cell><cell>0.86</cell><cell>0.92</cell></row><row><cell>pages</cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 5 .</head><label>5</label><figDesc>macro-averaged F-score of e-commerce categorization by sold product type</figDesc><table><row><cell cols="2">Used web site information pure TF-IDF</cell><cell>TF-IDF with Tag Weighting</cell></row><row><cell>only main page</cell><cell>0.67</cell><cell>0.72</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 6 .</head><label>6</label><figDesc>F-score of e-commerce categorization by sold product type for each category</figDesc><table><row><cell>category id</cell><cell>category name</cell><cell>average F-score</cell></row><row><cell>0</cell><cell>Auto products</cell><cell>0.89</cell></row><row><cell>1</cell><cell>Medical goods</cell><cell>0.98</cell></row><row><cell>2</cell><cell>Health and beauty</cell><cell>0.82</cell></row><row><cell></cell><cell>products</cell><cell></cell></row><row><cell>3</cell><cell>Appliances and elec-</cell><cell>0.79</cell></row><row><cell></cell><cell>tronics</cell><cell></cell></row><row><cell>4</cell><cell>Household goods</cell><cell>0.94</cell></row><row><cell>5</cell><cell>Furniture</cell><cell>0.92</cell></row><row><cell>6</cell><cell>Souvenirs and pre-</cell><cell>0.69</cell></row><row><cell></cell><cell>sents</cell><cell></cell></row><row><cell>7</cell><cell>Media (books, disks</cell><cell>0.76</cell></row><row><cell></cell><cell>and concert tickets)</cell><cell></cell></row><row><cell>8</cell><cell>jewelry and clocks</cell><cell>0.73</cell></row><row><cell>9</cell><cell>Technical and indus-</cell><cell>0.79</cell></row><row><cell></cell><cell>trial equipment</cell><cell></cell></row><row><cell>10</cell><cell>Food and kindred</cell><cell>0.79</cell></row><row><cell></cell><cell>products</cell><cell></cell></row><row><cell>11</cell><cell>Pet supplies</cell><cell>0.85</cell></row><row><cell>12</cell><cell>Sport equipment and</cell><cell>0.73</cell></row><row><cell></cell><cell>hobbies</cell><cell></cell></row><row><cell>13</cell><cell>Clothing and foot-</cell><cell>0.78</cell></row><row><cell></cell><cell>wear</cell><cell></cell></row><row><cell>14</cell><cell>General stores</cell><cell>0.63</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Electronic commerce adoption: an empirical study of small and medium US businesses</title>
		<author>
			<persName><forename type="first">E</forename><surname>Grandon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Pearson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information &amp; Management</title>
		<imprint>
			<biblScope unit="volume">42</biblScope>
			<biblScope unit="page" from="197" to="216" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">A personalized and integrative comparison-shopping engine and its applications</title>
		<author>
			<persName><forename type="first">S</forename><surname>Yuan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Decision Support Systems</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page" from="139" to="156" />
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Automatic text categorization and its application to text retrieval</title>
		<author>
			<persName><forename type="first">Wai</forename><surname>Lam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ruiz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Srinivasan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Trans. Knowl. Data Eng</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="page" from="865" to="879" />
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Machine learning in automated text categorization</title>
		<author>
			<persName><forename type="first">F</forename><surname>Sebastiani</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">CSUR</title>
		<imprint>
			<biblScope unit="volume">34</biblScope>
			<biblScope unit="page" from="1" to="47" />
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Automated learning of decision rules for text categorization</title>
		<author>
			<persName><forename type="first">C</forename><surname>Apte</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Damerau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Weiss</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Transactions on Information Systems</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="page" from="233" to="251" />
			<date type="published" when="1994">1994</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Classifier and feature set ensembles for web page classification</title>
		<author>
			<persName><forename type="first">A</forename><surname>Onan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Information Science</title>
		<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Web page classification</title>
		<author>
			<persName><forename type="first">X</forename><surname>Qi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Davison</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">CSUR</title>
		<imprint>
			<biblScope unit="volume">41</biblScope>
			<biblScope unit="page" from="1" to="31" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Research on web page classification-based core characteristics and web structure</title>
		<author>
			<persName><forename type="first">G</forename><surname>Zengmin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Jianxia</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Wireless and Mobile Computing</title>
		<imprint>
			<biblScope unit="volume">7</biblScope>
			<biblScope unit="page">253</biblScope>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Enhanced hypertext categorization using hyperlinks</title>
		<author>
			<persName><forename type="first">S</forename><surname>Chakrabarti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Dom</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Indyk</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM SIGMOD Record</title>
		<imprint>
			<biblScope unit="volume">27</biblScope>
			<biblScope unit="page" from="307" to="318" />
			<date type="published" when="1998">1998</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Hypertext categorization using hyperlink patterns and meta data</title>
		<author>
			<persName><forename type="first">R</forename><surname>Ghani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Slattery</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Eighteenth International Conference on Machine Learning</title>
				<meeting>the Eighteenth International Conference on Machine Learning</meeting>
		<imprint>
			<date type="published" when="2001">2001</date>
			<biblScope unit="page" from="178" to="185" />
		</imprint>
	</monogr>
	<note>ICML 01</note>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">A practical hypertext categorization method using links and incrementally available class information</title>
		<author>
			<persName><forename type="first">H</forename><surname>Oh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Myaeng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</title>
				<meeting>the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval</meeting>
		<imprint>
			<date type="published" when="2000">2000</date>
			<biblScope unit="page" from="264" to="271" />
		</imprint>
	</monogr>
	<note>SIGIR 00</note>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Webpage Classification based on Compound of Using HTML Features &amp; URL Features and Features of Sibling Pages</title>
	</analytic>
	<monogr>
		<title level="j">International Journal of Advancements in Computing Technology</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="36" to="46" />
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Support vector networks</title>
		<author>
			<persName><forename type="first">V</forename><surname>Vapnik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Cortez</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Machine Learning</title>
				<imprint>
			<date type="published" when="1995">1995</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Text categorization with support vector machines: learning with many relevant features</title>
		<author>
			<persName><forename type="first">T</forename><surname>Joachims</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">10th European Conference on Machine Learning</title>
				<meeting><address><addrLine>Heidelberg</addrLine></address></meeting>
		<imprint>
			<publisher>Springer Verlag</publisher>
			<date type="published" when="1998">1998</date>
			<biblScope unit="page" from="137" to="142" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">An information-theoretic perspective of tf-idf measures</title>
		<author>
			<persName><forename type="first">A</forename><surname>Aizawa</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information Processing &amp; Management</title>
		<imprint>
			<biblScope unit="volume">39</biblScope>
			<biblScope unit="page" from="45" to="65" />
			<date type="published" when="2003">2003</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<title level="m" type="main">Information retrieval</title>
		<author>
			<persName><forename type="first">C</forename><surname>Van Rijsbergen</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1979">1979</date>
			<publisher>Butterworths</publisher>
			<pubPlace>London</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<title level="m" type="main">Natural language processing for online applications</title>
		<author>
			<persName><forename type="first">P</forename><surname>Jackson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Moulinier</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2002">2002</date>
			<publisher>John Benjamins Pub</publisher>
			<pubPlace>Amsterdam</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
