<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Social Tag Prediction Base on Supervised Ranking Model</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Hao</forename><surname>Cao</surname></persName>
							<email>caohao@mail.nankai.edu.cn</email>
							<affiliation key="aff0">
								<orgName type="department">College of Software</orgName>
								<orgName type="institution">Nankai University</orgName>
								<address>
									<settlement>Tianjin</settlement>
									<country key="CN">P.R.China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Maoqiang</forename><surname>Xie</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">College of Software</orgName>
								<orgName type="institution">Nankai University</orgName>
								<address>
									<settlement>Tianjin</settlement>
									<country key="CN">P.R.China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Lian</forename><surname>Xue</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">College of Software</orgName>
								<orgName type="institution">Nankai University</orgName>
								<address>
									<settlement>Tianjin</settlement>
									<country key="CN">P.R.China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Chunhua</forename><surname>Liu</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">College of Software</orgName>
								<orgName type="institution">Nankai University</orgName>
								<address>
									<settlement>Tianjin</settlement>
									<country key="CN">P.R.China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Fei</forename><surname>Teng</surname></persName>
							<email>nktengfei@mail.nankai.edu.cn</email>
							<affiliation key="aff0">
								<orgName type="department">College of Software</orgName>
								<orgName type="institution">Nankai University</orgName>
								<address>
									<settlement>Tianjin</settlement>
									<country key="CN">P.R.China</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Yalou</forename><surname>Huang</surname></persName>
							<email>huangyl@nankai.edu.cn</email>
							<affiliation key="aff0">
								<orgName type="department">College of Software</orgName>
								<orgName type="institution">Nankai University</orgName>
								<address>
									<settlement>Tianjin</settlement>
									<country key="CN">P.R.China</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Social Tag Prediction Base on Supervised Ranking Model</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">D2B46C3498EB80B2D72ABF3051A33D08</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T06:52+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Recently, social tag recommendation has gained more attention in web research, and many approaches were proposed, which can be classified into two types: rule-based and classification-based approaches. However, too much expert experience and manual work are needed in rule-based approaches, and its generalization is limited. Additionally, there are some essential barriers in classification-based approaches, since tag recommendation is transformed into a multi-classes classification problem, such as tag collection is not fixed. Different from them, ranking model is more suitable, in which supervised learning can be used. In additions, the whole tag recommendation task can be divided into 4 subtasks according to the existence of users and resources. In different subtasks, different features are constructed, in order that existed information can be used sufficiently. The experimental results show that the proposed supervised ranking model performs well on the training and test dataset of RSDC 2008 recovered by ourselves.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Tag is a new form to index web resources, which help users to categorize and share the resources, and later search them. Also, the tags assigned by specified user reveal the user's interests, therefore, according to the tags user have already tagged, someone can find other users who have the similar interests, as well as similar interesting resources. Therefore, it is widely used in social network such as Bibsonmy, Del.icio.us, Last.fm , etc. A tag recommendation system can suggest someone a few tags to specified web resource, thus it can save the user time and effort when them mark up resources. Further, the recommended tags and existing tags can be used to predict the profile of the user and the interesting to the web resource, for example, to predict what they like and dislike. The research of tag recommendation is also very suggestive for other applications, such as online advertisement. In the field of online advertisement, we can predict what advertisement the browser might be interested in with the help of the surrounding text and his browsing history.</p><p>Recently, social tag recommendation has gained more attention in web research. It has been a hot issue for both industry and research area. For example, tag recommendation is one of the tasks in ECML RSDC's 08. Now, in ECML PKDD 09, tag recommendation has become the exclusive task. However, the performance of tag recommendation is not good enough to be widely used, more research work is needed and progress is essential for the practical use of tag recommendation in commercial system. In this paper, supervise ranking model is applied to tackle tag recommendation problem, and good result is achieved on test data.</p><p>The rest of paper is organized as follows: Section 2 lists the previous work on tag recommendation. Section 3 gives a description of supervised ranking model. Section 4 lists our experiment settings, experiment procedure and our analysis of the results on recovered 08's dataset. The model's performance on 09's dataset is presented in Section 5. Section 6 summarizes our work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Previous Work</head><p>Much research work has been done for tag recommendation, most of which can be categorized into two types, one is rule-based, the other is classificationbased.</p><p>Rule based approach is used by many researchers. Lipczak <ref type="bibr" target="#b0">[1]</ref> proposed a three-step tag recommendation system in their paper : Basic tags are extracted from the resource title. In the next step, the set of potential recommendations is extended by related tags proposed by a lexicon based on co-occurrences of tags within resource's posts. Finally, tags are filtered by the user's personomy -a set of tags previously used by the user. Tatu, et al. <ref type="bibr" target="#b1">[2]</ref>used document and user models derived from the textual content associated with URLs and publications by social bookmarking tool users, the textual information includes information present in a URL's title, a user's description of a document, or a bibtex field associated with a scientific publication, they used natural language understanding approach for producing tag recommendations, such as extraction of concepts, extraction of conflated tags which group tags to semantically related groups. However, too much expert experience and manual work are needed in rule-based approaches, and its generalization is limited.</p><p>Classification-based approach is also used for the tag recommendation task. Katakis et al. <ref type="bibr" target="#b2">[3]</ref> tried to model the automated tag suggestion problem as a multilabel text classification task. Heymann et al. <ref type="bibr" target="#b3">[4]</ref> predicted tags based on page text, anchor text, surrounding hosts, and other tags applied to the URL. They found an entropy-based metric which captures the generality of a particular tag and informs an analysis of how well that tag can be predicted. They also found that tag-based association rules can produce very high-precision predictions as well as giving deeper understanding into the relationships between tags. Their results have implications for both the study of tagging systems as potential information retrieval tools, and for the design of such systems. However , the application of classification does not suggest a good solution to the tag prediction problem: first, the tag space is fixed , all the resource can be categorized to the existed tags only, also, the amount of tags number could be very large, the traditional classification model would be rather low efficient.</p><p>Collaborative filtering is a commonly used technical for user-oriented task. Many researchers tried collaborative filtering in tag recommendation. Gilad Mishne <ref type="bibr" target="#b4">[5]</ref> used collaborative approach to automated tag assignment for weblog posts. Robert Jaschke, et al <ref type="bibr" target="#b5">[6]</ref> evaluated and compared user-based collaborative filtering and graph-based recommender, the result shows that both of these two methods provide better results than non-personalized baseline method, especially the graph-based recommender outperforms existing methods considerably.</p><p>Adriana Budura et al. <ref type="bibr" target="#b6">[7]</ref> used neighborhood based tag recommendation, which make use of content similarity. Principle and simple score approach is used to select the candidate tags, however, in our paper, machine learning method is used, a ranking model is learned automatically, then the candidate tags are ranked and top-ranked tags are suggested as recommending tags.</p><p>3 Supervised Ranking Model for Tag Recommendation</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Problem Statement</head><p>The tag recommendation problem can be described as follows: For a given post P whose user is U and resource is R, a set of tags are suggested as tags for the post. Here we denote post as P, tag as T, resource as R, user as U.</p><p>A possible and most nature way to solve the tag recommendation problem is as follows: First, a set of candidate tags are selected for the post, and then tags which are most likely to be the tags for the post are selected as recommending tags. The commonly used approach to choose the tags is rule-based and classification-based methods, but both of them have defects: rule-based approach relies on expert experience and manual efforts to set up the rules and tuning the parameters; classification-based is restrict to the fix of tag space and is inefficient when it is treated as a multi-label problem. In this paper, tag recommendation is conveyed to a problem of ranking candidate tags. A ranking model is constructed to ensure tags that are most likely to be post's tags rank higher than tags that are not. Supervised learning model is used to construct the ranking model satisfying the restriction. Ranking-SVM model is the most frequently used supervised ranking model and is proofed to be a successful model, so it is used as our supervised ranking model in the experiments. All the candidate tags for one post are grouped as a ranking group and the top-ranked candidate tags are selected as recommendation tags.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Introduction to Ranking SVM</head><p>Here we briefly describe the Ranking Support Vector Machine(Ranking SVM) model for tag recommendation.</p><p>Assume that X ∈ ℜ m is the input feature space which represents feature of a candidate tag given a user and resource, and m denotes the feature number. Y = {0, 1} is the output rank space which is represented by the labels, and 1 represents the tag is labeled by user, and 0 is not. (x, y) ∈ X × Y denotes feature and label as the training instance.</p><p>Given a training set with tags T = {t 1 , t 2 , ..., t n }, for each tag t i there would be a {x, y} associated with it, the whole training set could be formulate as S = {x i , y i } N i=1 , where N represents the number of all tags. In Ranking SVM <ref type="bibr" target="#b7">[8]</ref>, ranking model f is a linear function represented by w, x , where w is the weight vector and •, • denotes the inner product. In RSVM we need to construct a new training set S ′ according to the original training set S = {x i , y i } N i=1 . For every y i = y j in S, construct (x i − x j , z ij ) and add it into S ′ , where z ij = +1 if y i ≻ y j , and otherwise −1. Here ≻ denotes the preference relationship, for example, y = 1 is preferred to y = 0. For denotation consistency, we denotes S ′ as {x</p><formula xml:id="formula_0">1 i − x 2 i , z i } D i=1 .</formula><p>The final model is formalized as the following Quadratic Programming problem:</p><formula xml:id="formula_1">min w,ξi 1 2C w 2 + D i=1 ξ i s.t. ξ i &gt; 0, z i w, x 1 i − x 2 i ≥ 1 − ξ i<label>(1)</label></formula><p>And ( <ref type="formula" target="#formula_1">1</ref>) could be solved using existing Quadratic Programming methods. Figure <ref type="figure" target="#fig_0">1</ref> is an example of ranking SVM model. The ranking SVM model convey the problem of ranking into binary classification problem: for each objects to be ranked, the model compare it with all other objects in the same ranking group. For n objects, the model compares the objects C 2 n times, and then outputs the ranking result.This is the advantage over classification model: in classification model, the existence of other candidate tags is not being considered, but in ranking model, the existence of other candidate tags is taken into consideration.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Ranking Process</head><p>For any post P ij in test dataset, we denote collection of all candidate tags for post P ij as CT {P ij } and CT k (k = 1, 2, ..., n) as the k-th candidate tag for the post P ij , CT {P ij } = {CT 1 , CT 2 , ..., CT n } . The ranking model ranks the candidate tags to {CT 1 ′ , CT 2 ′ , ..., CT n ′ } from top to bottom. Then top-k tags are selected as prediction of the tags of post P ij . Table <ref type="table" target="#tab_0">1</ref> shows the steps to rank the candidate tags. Also, the number of recommended tags affects the performance of the system. For example, if the actual number of tags for post whose content id=123456 is 3, a loss of precision is suffered when 4 tags are recommend to the user. So a proper number of tags to recommend should be found. The number used in our experiment is half the number of all candidate tags. If the number is bigger than 5, we cut them into 5, that means we recommend 5 tags at most.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Training Process</head><p>For all the post in the test dataset, candidate tags CT {P ij } for each post P ij are extracted. Then they are grouped by the post, and features are extracted for each of them in the post content. For those CT k ∈ T {P ij }, we label them '1', else label them with '0'. Then we use SVM-light tool to train a ranking-SVM model. When predicting the tags of the post in test dataset, the model learned on the training dataset is applied to rank the candidate tags, and top ranked tags are selected as recommending tags.</p><p>4 Experiments on 08's recovered dataset 4.1 Experiment settings 2008's dataset recovery In order to compare our experiments' performance with that of the 08's teams, we try to get the 08's dataset (both training and test data) and test our model's performance on the recovered dataset. Though the 08's test data can be downloaded from the web, we found that user IDs have been changed between the datasets. However, the content id field in 08's test data is consistent with 09's data, so we try to recover the 08's dataset on the 09's dataset using the content id field and date time field. The 08's real training data and test data are subset of 09's data, so it is possible to recover 08's data on 09's data. After observing 08's real test data, we found that all posts in 08's test data are between Mar. 31, 2008 and May. 15, 2009, so we use the posts during this period on 09's training data as recovered 08's test data and posts before Mar. 31, 2008 as our recovered 08's training data. There are still slight difference between our recovered data and the 08's real data. We assume that the difference won't affect our performance seriously, so the result is comparable with 08's results.</p><p>Some statistics have been made on our recovered 08's dataset. Table <ref type="table" target="#tab_1">2</ref> shows the statistics of posts on this recovered dataset. Table <ref type="table" target="#tab_2">3</ref> shows the statistics of posts according to the existence of their user and resource in the recovered training data. In following part in section 4, the training data refers to the recovered training data, the test data refers to the recovered test data. Data preprocess Firstly, the terms are converted into lowercase. Then the stop words are removed, such as "a, the, is, an", these terms are not likely to be the tags of the post. Finally, the punctuations as ':', ',', etc are removed. Latex symbols such as '{' and '}' is also removed using regular expressions.</p><p>Table <ref type="table" target="#tab_4">5</ref> shows example results of data preprocess.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Post Division</head><p>It can be observed from data distribution that some users of posts exist in the training data (54%) and some do not exist in the training data (46%). Also In the analysis above, we divide the posts in test dataset into two categories according to the existence of their users in the training data: existed user posts, non-existed user posts. Also, the posts in test dataset can be divided into two categories according to the existence of their resource in the training data: existed resource posts, non-existed resource posts.</p><p>The posts can be divided into four different categories according to their user status and resource status in the training data: existed user existed resource post, existed user non-existed resource post, non-existed user existed resource post, non-existed user non-existed resource post.</p><p>We denote symbols as shown in Table <ref type="table" target="#tab_5">6</ref> to simplify the language. Table <ref type="table" target="#tab_6">7</ref> and Table <ref type="table" target="#tab_7">8</ref> show statistics after our post division on our recovered 08's data. It can be observed from statistics that not every category of posts occupies the same ratio of the posts. In BOOKMARK, EUNR posts occupied about 82.80% of all BOOKMARK posts. In BIBTEX, NUNR posts occupied about 93.43% of all BIBTEX posts. In order to promote our model's performance on the test dataset, we should focus on those data which occupy high proportion of the posts, that is: EUNR posts of BOOKMARK and NUNR posts in BIBTEX.</p><p>After data division, the following steps are carried out for our tag recommendation task.</p><p>1. Extract candidate tags by different methods according to the category of post.</p><p>2. Rank the candidate tags, and select top ranked tags as recommendation tags.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Candidate tags extraction</head><p>According to the statistics of the sources of the tags on the dataset, we can find that tags can be retrieved from three sources mainly: 1.The content information of the post, such as 'description' field in BOOKMARK and 'title' field in BIBTEX. 2. T {R j }: The tags being assigned to the same resource previously.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>3.T {U i }:</head><p>The tags assigned by the same user previously. Statistics of tags from different sources for BOOKMARK and BIBTEX posts are listed in Table <ref type="table" target="#tab_8">9</ref> and Table <ref type="table" target="#tab_9">10</ref>. The four different categories of test dataset have different characters, for example, we can explore the tags assigned by user previously and the tags assigned to the resource previously for EUER posts. But for NUNR posts, we lack this information. So we should explore different features for the four different categories of posts individually, in order that existed information can be used sufficiently. In the following part, while using the supervised ranking model, we train four models to handle these four categories of posts individually.</p><p>The candidate tags extraction strategies for different categories of posts: For EUER post and NUER post, CT {P ij } = { terms in post (P ij ) T {R j }}.</p><p>For EUNR post and NUNR post, CT {P ij } = { terms in post (P ij )}. We denote the candidate tags for post whose user id=i and resource is j as CT {P ij }. { terms in post (P ij )} denotes the remaining set of words after trimming and removing of the stop words in the text information of post P ij .</p><p>Notice should be paid here that we do not take T {U i } (the user's pervious tags) as candidate tags because we find the tags are too massive. When they are added, the precision of the system drops down and the F-1 value on the whole dataset also declines dramatically. However, in the ranking procedure, we will use T {U i } as one of the features in SVM model to rank the candidate tags.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">SVM Features construction</head><p>While using SVM, we select features that discern high ranked tags and low ranked tags well and add the features according to our experience. For example, the term frequency in the post content: those words which have high term frequency within the post content tend to rank higher than those which have low term frequency. Also, whether the candidate words have been used as tags for other post in the training data is an excellent feature.</p><p>Table <ref type="table" target="#tab_10">11</ref> is a brief description of features of ranking SVM model for BOOK-MARK posts. The features for BIBTEX posts are almost the same except for the different data fields:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5">Analysis of Model</head><p>Table <ref type="table" target="#tab_11">12</ref> and Table <ref type="table" target="#tab_12">13</ref> show the results of our supervised Ranking SVM model on the recovered 08's data.</p><p>Combing different types and category of data together, we can get the overall performance on the recovered 08's test data, as shown in Table <ref type="table" target="#tab_13">14</ref>. The F1-value is 0.167, less than the F1-value 0.193 of the team ranked first in 08's competition.</p><p>It can be observed from the results that the performance of the model is poor on EUNR posts, which occupied most of the BOOKMARK posts. However, the model performs well on EUER posts. When comparing the two types of data, we find that the only difference is that the candidate tags of EUER posts are not only come from the post content but also from the tags of the same resource in the training data, however, the candidate tags for EUNR posts come from post content only. In order to overcome the weakness of lacking candidate tags, we relax restriction on the definition of the same resource. For those posts whose resources have not appeared in the training data, the role of the same post is substituted by the similar post. This method is based on the assumption that users tend to tag the similar posts with the same tags.</p><p>We try to use post content similarity to measure the similarity of posts. For those EUNR posts, which have no same resources in the training data, we add the tags of those posts whose content similarity with the current post content is above a certain threshold to the candidate tags set of the post.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.6">Post content similarity based KNN model</head><p>For EUNR post, the candidate tags come from text of the post content only, that is CT {P ij } = { terms in post (P ij )}. We attribute the poor performance of the model on such kind of data to the sparse of candidate tags. So we use content similarity to expand the candidate tags set. For any EUNR post P ij , we set a similarity threshold t, and find in the training dataset content P mn , whose sim(text(P ij ), text(p mn ) &gt; t). Then the tags of post P mn are added to the candidate tags ofP ij : CT {P ij } = { terms in post (P ij )} T {P mn }.</p><p>Post content P ij and P mn are mapped into vector space: text(P ij ) = {W 1 , W 2 , ..., W n } , text(P mn ) = {W 1 ′ , W 2 ′ , ..., W n ′ },Then we use vector space model to calculate the similarity between two posts P ij and P mn .</p><p>sim(text(P ij ), text(P mn )) = text(P ij ) * text(P mn )</p><formula xml:id="formula_2">|text(P ij )| * |text(P mn )|<label>(2)</label></formula><p>W i means the weigh of word i in the content. The simplest way to define W i is as following:W i = 0,word i in post content, W i = 0,word i not in post content.</p><p>In our experiment, we define the W i as TF(Term Frequency) multiply IDF (Inverted document frequency) :W i = T F i * IDF i .We applied open source software Lucene to calculate the similarity of two content , the scoring function of Lucene is a derivation of vector space model formula using TF/IDF weighing schema.</p><p>The modification of threshold value T and the corresponding performance on EUNR content in BOOKMARK are shown in Figure <ref type="figure" target="#fig_1">2</ref>.</p><p>It can be observed that the value of recall, precision and F1 value reach highest when threshold T=0.5. So, in the further experiment settings, we set threshold value T to 0.5. However, we find that the application of content similarity based KNN model works for BOOKMARK posts but not for BIBTEX posts. After investigation, we attribute it to the uneven distribution of the dataset in training datasets and test datasets. In training datasets, the number of BOOKMARK posts is 184,655 and the number of BIBTEX posts is 20,647. But in test dataset, the number of BOOKMARK post is 20,647 and the number of BIBTEX post is 49,479, it is easy for 20,647 BOOKMARK posts to find similar posts in 184,655 BOOKMARK posts, but difficult for 42,545 BIBTEX posts in only 20,647 posts. So this method is especially useful for BOOKMARK posts but not for BIBTEX posts.</p><p>After applying content similarity based KNN model on BOOKMARK EUNR posts, the performance on overall test dataset is as listed in Table <ref type="table" target="#tab_14">15</ref>. The F1-value is 0.238, higher than the F1-value 0.193 of the team ranked first in 08s competition.</p><p>5 Experiment on 09's dataset</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Statistics of 09's dataset</head><p>Table <ref type="table" target="#tab_15">16</ref> and Table <ref type="table" target="#tab_16">17</ref> show the distribution of different categories of posts on 09's dataset after data division according to the existence of their user and resource in the training data. In our experiment settings on 09's test data, cleandump dataset is used as training dataset in Task 1, Post-core dataset is used as training dataset in Task 2. It can be observed from the statistics of the distribution of categories in 09's test data for Task 1 agrees with the recovered 08's dataset: EUER posts occupied most of the BOOKMARK post and NUNR post occupied large proportion of BIBTEX posts, so we can expect our model a good result on such data. The whole posts in 09's test dataset for Task2 can be classified to EUER posts. Since the good performance of our model on EUER posts, we can also expect a good result on task 2.</p><p>Eight different models are trained on 09's clean-dump training data and applied in 09's test data for Task 1. For Task 2, we apply the BOOKMARK EUER post model and the BIBTEX EUER post model trained on 09's postcore dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Experiment results on 09's test dataset</head><p>The performance on the whole 09's test data of both task 1 and task 2 is shown in Table <ref type="table" target="#tab_17">18</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>In this paper, we briefly describe an approach utilizes supervised ranking model for tag recommendations. Our tag prediction contains three steps. First, posts are divided into four categories according to the existence of the user and the resource in the training data and then candidate tags are extracted for the different categories with different strategies. Second, features are decided according to categories. Then we rank the candidate tags, using the supervised ranking model, and pick the top tags as recommendation tags.</p><p>For the existed user non-existed resource post, we use post content similarity based KNN model to expand the candidate tags set. Performance of this experiment for the corresponding module is promoted after adding this model on 08's dataset. Our tag recommendation system is generated from the combination of these two models and applied to the 09's tags recommendation task 1 and task 2.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Fig. 1 .</head><label>1</label><figDesc>Fig. 1. Example of ranking SVM model</figDesc><graphic coords="4,222.60,369.58,170.00,113.30" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Fig. 2 .</head><label>2</label><figDesc>Fig. 2. KNN performance on various threshold t on BOOKMARK EUNR posts, k=5</figDesc><graphic coords="12,222.60,115.90,170.00,113.30" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 .</head><label>1</label><figDesc>Algorithm of rank the candidate tags</figDesc><table><row><cell>Input: candidate tags {CT1, CT2, ..., CTn}</cell></row><row><cell>Output: top-k tags {CT ′ 1 , CT ′ 2 , ..., CT ′ k }</cell></row><row><cell>1. Extract feature x = {xi}(i = 1, 2, ..., n) for a sequence</cell></row><row><cell>of candidate tags CT {Pij } = {CT1, CT2, ..., CTn}.</cell></row><row><cell>2. Rank the features using the learned ranking model as</cell></row><row><cell>{CT ′ 1 , CT ′ 2 , ..., CT ′ n }.</cell></row><row><cell>3. select top-k tags {CT ′ 1 , CT ′ 2 , ..., CT ′</cell></row></table><note>k } as recommending tags.</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 .</head><label>2</label><figDesc>Statistics of posts on recovered 08's dataset</figDesc><table><row><cell>Post in recovered training data</cell><cell>234,134</cell><cell>BOOKMARK 184,655 BIBTEX 49,479</cell></row><row><cell>Post in recovered test data</cell><cell>63,192</cell><cell>BOOKMARK 20,647 BIBTEX 42,545</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 .</head><label>3</label><figDesc>Statistics of posts according to their user and resource status Users in recovered test data appear in recovered training data 265 Users in recovered test data do not appear in recovered training data 225 Resources in recovered test data appear in recovered training data 1230 Resources in recovered test data do not appear in recovered training data 61970 Data format description The dataset used in experiments is released by ECML. The data consists of three tables: TAS table, BOOKMARK table and BIBTEX table. Table 4 is a description of the fields of the three tables. Only the fields we used in experiments are listed in the table.</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4 .</head><label>4</label><figDesc>Data fields of TAS, BOOKMARK and BIBTEX</figDesc><table><row><cell cols="2">Table name Fields name</cell></row><row><cell>TAS</cell><cell>user, tag, content id, content type, date</cell></row><row><cell cols="2">BOOKMARK content id (matches tas.content id) ,url</cell></row><row><cell></cell><cell>description ,extended ,description ,date ,bibtex</cell></row><row><cell>BIBTEX</cell><cell>content id (matches tas.content id) ,simhash1 (hash for duplicate detection</cell></row><row><cell></cell><cell>among users) ,title</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 5 .</head><label>5</label><figDesc>Example results of data preprocess</figDesc><table><row><cell>Before data preprocess</cell><cell>After data preprocess</cell></row><row><cell>Ben Mezrich: the telling of a true</cell><cell>ben mezrich telling true story</cell></row><row><cell>story</cell><cell></cell></row><row><cell>{XQ}uery 1.0: An {XML} Query</cell><cell>xquery 1.0 xml query language w3c</cell></row><row><cell>Language, {W3C} Working Draft</cell><cell>working draft</cell></row><row><cell cols="2">some resources of posts exist in the training data (2%) and others do not exist</cell></row><row><cell>in the training data (98%).</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 6 .</head><label>6</label><figDesc>Simplified symbols EUER post Existed user existed resource post EUNR post Existed user non-existed resource post NUER post Non-existed user existed resource post NUNR post Non-existed user non-existed resource post</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_6"><head>Table 7 .</head><label>7</label><figDesc>Distribution of different categories of BOOKMARK posts in test dataset</figDesc><table><row><cell>Category</cell><cell cols="2">Posts number ratio</cell></row><row><cell cols="2">EUER post 621</cell><cell>3.01%</cell></row><row><cell cols="2">EUNR post 17099</cell><cell>82.80%</cell></row><row><cell cols="2">NUER post 346</cell><cell>1.68%</cell></row><row><cell cols="2">NUNR post 2585</cell><cell>12.52%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_7"><head>Table 8 .</head><label>8</label><figDesc>Distribution of different categories of BIBTEX posts in test dataset</figDesc><table><row><cell>Category</cell><cell cols="2">Posts number ratio</cell></row><row><cell cols="2">EUER post 164</cell><cell>0.39%</cell></row><row><cell cols="2">EUNR post 2532</cell><cell>5.95%</cell></row><row><cell cols="2">NUER post 99</cell><cell>0.23%</cell></row><row><cell cols="2">NUNR post 39754</cell><cell>93.43%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_8"><head>Table 9 .</head><label>9</label><figDesc>Statistics of the tags from 3 sources of BOOKMARK Post</figDesc><table><row><cell>Total tags</cell><cell>56267</cell></row><row><cell cols="2">Tags from terms of description 5253</cell></row><row><cell>Tags from terms of URL</cell><cell>1353</cell></row><row><cell cols="2">Tags from user's previous tags 29672</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_9"><head>Table 10 .</head><label>10</label><figDesc>Statistics of the tags from 3 sources of BIBTEX Post</figDesc><table><row><cell>Total tags</cell><cell>95782</cell></row><row><cell>Tags from terms of title</cell><cell>41801</cell></row><row><cell>Tags from terms of URL</cell><cell>547</cell></row><row><cell cols="2">Tags from user's previous tags 5377</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_10"><head>Table 11 .</head><label>11</label><figDesc>Some of the features for ranking SVM model for BOOKMARK Feature1 Candidate tag's TF (term frequency) in post's description terms. Feature2 Candidate tag's TF in post's URL terms. Feature3 Candidate tag's TF in post's extended description terms. Feature4 Candidate tag's TF in T {Rj } (tags assigned to the post of the same URL in the training data). Feature5 Candidate tag's TF in T {Ui} (tags assigned previously by user in the training data.) Feature6 Times of candidate tag being assigned as a tag in the training data.</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_11"><head>Table 12 .</head><label>12</label><figDesc>Individual and overall Performance on BOOKMARK posts</figDesc><table><row><cell>Post category</cell><cell>Recall Precision F1-value ratio</cell></row><row><cell>EUER Post</cell><cell>0.369699 0.394973 0.381918 3.01%</cell></row><row><cell>EUNR Post</cell><cell>0.046591 0.053739 0.04991 82.80%</cell></row><row><cell>NUER Post</cell><cell>0.160883 0.255652 0.197487 1.68%</cell></row><row><cell>NUNR Post</cell><cell>0.069158 0.106366 0.083819 12.52%</cell></row><row><cell cols="2">overall-performance on BOOKMARK 0.061067 0.073997 0.066633</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_12"><head>Table 13 .</head><label>13</label><figDesc>Individual and overall Performance on BIBTEX posts</figDesc><table><row><cell>Post category</cell><cell>Recall</cell><cell>Precision F1-value ratio</cell></row><row><cell>EUER Post</cell><cell cols="2">0.4219356 0.3472393 0.3809605 0.39%</cell></row><row><cell>EUNR Post</cell><cell cols="2">0.2250226 0.1628605 0.1889605 5.95%</cell></row><row><cell>NUER Post</cell><cell cols="2">0.5667162 0.3715986 0.4488706 0.23%</cell></row><row><cell>NUNR Post</cell><cell cols="2">0.3561221 0.1603686 0.2211494 93.43%</cell></row><row><cell cols="3">overall-performance on BIBTEX 0.349063 0.161732 0.220381</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_13"><head>Table 14 .</head><label>14</label><figDesc>Overall performance on test dataset using ranking SVM model</figDesc><table><row><cell cols="2">Recall Precision F1-value</cell></row><row><cell>0.153 0.185</cell><cell>0.167</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_14"><head>Table 15 .</head><label>15</label><figDesc>Overall performance on test dataset adding content similarity based KNN model</figDesc><table><row><cell>Recall Precision F1-value</cell></row><row><cell>0.323828 0.200926 0.238803</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_15"><head>Table 16 .</head><label>16</label><figDesc>Different categories of BOOKMARK posts in 09s test dataset for Task 1</figDesc><table><row><cell>Category</cell><cell cols="2">Posts number ratio</cell></row><row><cell cols="2">EUER Post 821</cell><cell>4.86%</cell></row><row><cell cols="2">EUNR Post 10622</cell><cell>62.86%</cell></row><row><cell cols="2">NUER Post 872</cell><cell>5.16%</cell></row><row><cell cols="2">NUNR Post 4583</cell><cell>27.12%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_16"><head>Table 17 .</head><label>17</label><figDesc>Different categories of BIBTEX posts in 09s test dataset for Task 1</figDesc><table><row><cell>Category</cell><cell cols="2">Posts number ratio</cell></row><row><cell cols="2">EUER Post 365</cell><cell>1.40%</cell></row><row><cell cols="2">EUNR Post 9287</cell><cell>35.71%</cell></row><row><cell cols="2">NUER Post 591</cell><cell>2.27%</cell></row><row><cell cols="2">NUNR Post 15761</cell><cell>60.61%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_17"><head>Table 18 .</head><label>18</label><figDesc>Performance on 09's dataset @5</figDesc><table><row><cell cols="3">Task No. Submission ID Precision Recall F1-value</cell></row><row><cell>1</cell><cell>67797</cell><cell>0.162478 0.146582 0.154121</cell></row><row><cell>2</cell><cell>13651</cell><cell>0.31622 0.222065 0.260908</cell></row></table></figure>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgement</head><p>Thanks to Zhen Liao for his helpful discussions and suggestions for this paper. This paper is supported by the National Natural Science Foundation of China under the grant 60673009 and China National Hanban under the grant 2007-433.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Tag Recommendation for Folksonomies Oriented towards Individual Users</title>
		<author>
			<persName><forename type="first">Marek</forename><surname>Lipczak</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ECML</title>
		<imprint>
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">RSDC&apos;08: Tag Recommendations using Bookmark Content</title>
		<author>
			<persName><forename type="first">Marta</forename><surname>Tatu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Munirathnam</forename><surname>Srikanth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Thomas D'</forename><surname>Silva</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of ECML PKDD Discovery Challenge 2008</title>
				<meeting>ECML PKDD Discovery Challenge 2008<address><addrLine>RSDC</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Multilabel Text Classifcation for Automated Tag Suggestion</title>
		<author>
			<persName><forename type="first">Ioannis</forename><surname>Katakis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Grigorios</forename><surname>Tsoumakas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ioannis</forename><surname>Vlahavas</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of ECML PKDD Discovery Challenge</title>
				<meeting>ECML PKDD Discovery Challenge<address><addrLine>RSDC</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2008">2008. 2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<author>
			<persName><forename type="first">Paul</forename><surname>Heymann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Daniel</forename><surname>Ramage</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hector</forename><surname>Garcia-Molina</surname></persName>
		</author>
		<title level="m">Social Tag Prediction, SIGIR&apos;08</title>
				<meeting><address><addrLine>Singapore</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2008">July 20-24, 2008</date>
			<biblScope unit="page" from="531" to="538" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">AutoTag: A Collaborative Approach to Automated Tag Assignment for Weblog Posts</title>
		<author>
			<persName><forename type="first">Gilad</forename><surname>Mishne</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">WWW</title>
		<imprint>
			<biblScope unit="page" from="953" to="954" />
			<date type="published" when="2006-05-22">2006. May 22-26, 2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Tag Recommendations in Folksonomies</title>
		<author>
			<persName><forename type="first">Robert</forename><surname>J¨aschke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Andreas</forename><surname>Leandromarinho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lars</forename><surname>Hotho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Gerd</forename><surname>Schmidt-Thieme</surname></persName>
		</author>
		<author>
			<persName><surname>Stumme</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">PKDD 2007</title>
				<editor>
			<persName><forename type="first">J</forename><forename type="middle">N</forename><surname>Kok</surname></persName>
		</editor>
		<imprint>
			<date type="published" when="2007">2007</date>
			<biblScope unit="volume">4702</biblScope>
			<biblScope unit="page" from="506" to="514" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">Adriana</forename><surname>Budura</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sebastian</forename><surname>Michel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Philippe</forename><surname>Cudre-Mauroux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Karl</forename><surname>Aberer</surname></persName>
		</author>
		<title level="m">Neighborhood-based Tag Prediction, 6th Annual European Semantic Web Conference (ESWC2009)</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Large margin rank boundaries for ordinal regression</title>
		<author>
			<persName><surname>Herbrich</surname></persName>
		</author>
		<author>
			<persName><surname>Graepel</surname></persName>
		</author>
		<author>
			<persName><surname>Obermayer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the eighth ACM SIGKDD international</title>
				<meeting>the eighth ACM SIGKDD international</meeting>
		<imprint>
			<biblScope unit="volume">02</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Adapting Ranking SVM to Document Retrieval</title>
		<author>
			<persName><forename type="first">Cao</forename><surname>Yunbo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><forename type="middle">U</forename><surname>Jun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Liu</forename><surname>Tie-Yan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">I</forename><surname>Hang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Huang</forename><surname>Yalou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hon</forename><surname>Hsiao-Wuen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">SIGIR&apos;06</title>
				<meeting><address><addrLine>Seattle; Washington, USA</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2006">August 6-11,2006</date>
			<biblScope unit="page" from="186" to="193" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
