=Paper=
{{Paper
|id=Vol-1176/CLEF2010wn-WePS-YoshidaEt2010
|storemode=property
|title=ITC-UT: Tweet Categorization by Query Categorization for On-line Reputation Management
|pdfUrl=https://ceur-ws.org/Vol-1176/CLEF2010wn-WePS-YoshidaEt2010.pdf
|volume=Vol-1176
}}
==ITC-UT: Tweet Categorization by Query Categorization for On-line Reputation Management==
ITC-UT: Tweet Categorization by Query Categorization for On-line Reputation Management Minoru Yoshida, Shin Matsushima, Shingo Ono, Issei Sato, and Hiroshi Nakagawa University of Tokyo 7-3-1, Hongo, Bunkyo-ku, Tokyo 113-0033 {mino,masin,ono,sato,nakagawa}@r.dl.itc.u-tokyo.ac.jp Abstract. This paper describes our system, called ITC-UT, for the task-2 (on-line reputation management task) in WePS-3. Our idea is to categorize each query into 3 or 4 classes according to how much the tweets retrieved by the query contain the “true” entity names that refer to the target entity, and then categorize each tweet by the rules defined for each class of queries. We show the evaluation results for our system along with the details of results of query categorization. Keywords: Organization Name Disambiguation, Two-Stage Algorithm, Naive Bayes, Twitter 1 Introduction This paper reports the algorithms and results of the ITC-UT (Information Tech- nology Center, the University of Tokyo) team for the WePS-3 task-2 (on-line reputation management task.) The supposed situation of this task is where you search reputation of some organization in Twitter. Assuming that tweets are retrieved by the organization name query, the problem is to decide whether each organization name found in each tweet represents the target organization or not (such as “Apple PC” for the former and “Apple Pie” for the latter for the query “Apple”.) This is one type of name disambiguation problems that have been ex- tensively studied through previous WePS workshops[1, 2]. However, the current task setting is challenging because generally each tweet is small and provides little context for disambiguation. Our algorithm to solve this problem is based on the intuition that organiza- tion names can be classified into “organization-like names” and “general-word- like names”, such as “McDonald’s” for the former and “Pioneer” for the latter. This intuition is supported by the fact that the ratio of TRUE1 (or FALSE) tweets in the training data vary widely from entity to entity. For example, over 1 TRUE indicates that the tweet mentions the target organization (as defined in the next section). FALSE indicates the opposite. 2 M. Yoshida et al. 98% of tweets were labeled TRUE for entity “nikon”, while the ratio for entity “renaissance technologies” (for which the query term was “Renaissance”) was under 1%. Our strategy is to make aggressive use of such unbalance by predict- ing whether each query in the test set is biased towards TRUE or FALSE as described in detail in section 3.1. Then the heuristic rules suited for the bias of query are applied to categorize the tweets. For instance, if a query is highly likely to be an organization name, each tweet is labeled TRUE unless some strong evidences indicate the opposite. The detail is described in section 3.2. 2 Task Definitions In this section, we briefly give the definition of the task required for the de- scription of our algorithm. Both the training and test data contain the entity name (e.g., “marriott international”), the query term used to retrieve tweets (e.g., “Marriott”), the URL of the home page for the entity, and 700 tweets (per entity name) retrieved by the query term. The training data also contain the label “TRUE” or “FALSE” for each tweet that indicate whether the tweet men- tioned the entity or not. The task is to predict whether each tweet in the test data (provided with no label) are TRUE (i.e., mentions the entity) or FALSE (i.e., doesn’t mention the entity.) 3 Algorithm As mentioned above, our algorithm is mainly divided into two stages: the query categorization stage (stage 1) and the tweet categorization stage (stage 2). In this section, we describe each stage in more detail. 3.1 Stage 1: Query Categorization The first stage categorizes each query into 3 or 4 classes according to the con- fidence of “how the query indicates the given organization if no contexts are given”. For training data, the class of each query was determined by the ratio of the number of TRUE tweets (represented by t) to the number of all tweets for the query. We used two different configurations for the number of classes: 3 and 4. In the 3-class settings, each query is categorized into: class 1: TRUE-biased queries: if t > θ1 , class 2: FALSE-biased queries: if t < θ2 , class 3: neutral queries: otherwise. On the other hand, in the 4-class settings, each query is categorized into: class 1: TRUE-biased queries: if t > θ1 , class 2: FALSE-biased queries: if t < θ2 , class 3: neutral queries: if θ3 < t ≤ θ1 , ITC-UT: Tweet Categorization by Query Categorization 3 class 4: weakly FALSE-biased queries: otherwise. The threshold values θi and θi were determined manually by looking at the training data. The values were θ1 = 0.66.. and θ2 = 0.33.. for 3-class labeling, and θ1 = 0.9, θ2 = 0.1, and θ3 = 0.5 for 4-class labeling. For categorization, we did not use linguistic features (e.g., frequent words in tweets) other than very simple ones by pattern matching (such as “Is an acronym?” feature described below) because useful linguistic features for clas- sification seem to be different for different entities and it is difficult to find the features common between training and test data. Instead, we made an extensive use of meta-data such as URLs. The categorization was performed by the simple Naive Bayes classifier (in the Weka2 toolkit) with following 6 binary features. Is the query identical to the entity name? This feature value is true for query “Apple” for entity “Apple” and false for query “Amazon” for entity “Amazon.com”, for example. This feature is introduced based on the in- tuition that the difference between the query and the entity name suggests that the entity requires the full name to be specified, such as “Delta Holding” which may tend to be confused with other organizations including “Delta Air Lines” when the query “Delta” is used. Does the domain name in URL include the query or entity name? This feature value is true if, for example, the URL can be described by the regular expression http://(www.)?apple.[a-z]/ for the query “Apple”. This fea- ture being true may indicate that the organization has an original domain, and therefore a not so minor organization. Does Wikipedia have “disambiguation page” for the query? This feature is introduced based on the intuition that highly ambiguous names, for which the disambiguation task is difficult, might have a disambiguation page in Wikipedia (www.wikipedia.org). Is the query an acronym? This feature is based on the observation that acronyms tend to have high ambiguity because they have typically only 2 or 3 char- acters and therefore many different concepts are expressed by the same acronym. Does the given URL indicate the top page of Web search results? If the given entity is a major concept represented by the query word, the URL for the entity will come to the first of the search result list if we enter the query to an internet search engine, in which case the feature value is set to “true.” Is the query an entry of a dictionary? This feature is introduced to detect whether the query word is a general word or not. If the former is the case, it will be a risk of the query being used not as the specific organization name, but as some general words. 2 http://www.cs.waikato.ac.nz/ml/weka/ 4 M. Yoshida et al. 3.2 Stage 2: Tweet Categorization Stage 2 categorizes each tweet into “mentioning on the organization” (TRUE) or not (FALSE). The categorization is decided by simple heuristic rules defined for each class of queries. The system obtains Part of Speech (POS) tags and Named Entity (NE) labels of the queries in each tweet by using Stanford POS tagger3 and NE Recognizer4 . Each tweet is categorized by rules that use these extracted POS and NE labels. These rules are defined for each class of queries as follows. Class 1: TRUE-Biased Queries Each tweet for this class is categorized into TRUE unless it is strongly suggested that, by the following rules, the query represents something other than organizations. 1. If the NE tag of the query is a “PERSON” or “LOCATION”, label FALSE. 2. Otherwise, label TRUE. Class 2: FALSE-Biased Queries On the contrary to the class 1 rules, the tweet for this class of queries is categorized into FALSE unless it is strongly suggested, by the following rules, that the query does represent the orgznization. 1. If the entity name consists of two or more words (such as “Cisco Systems”), and it is contained in the tweet, label TRUE. 2. If the tweet contains the URL for the entity, label TRUE. 3. Otherwise, label FALSE. Class 3: Neutral Queries Rules for the tweets for the queries of class 3 are the same as the rules for class 1 except that we add another rule (the second one) to detect FALSE tweets because the ratio of FALSE tweets may be larger than the class 1. The rules for class 3 therefore are defined in the following way. 1. If the NE tag of the query is “PERSON” or “LOCATION”, label FALSE. 2. If the POS tag of the query is not a proper noun, label FALSE. 3. Otherwise, label TRUE. We have another version of the rules that replaces the second rule with the following one. This difference of the versions adjusts the filtering power of the ad- ditional rule where the above one is stronger (filtering out (i.e., labeling FALSE) more tweets) and the below one is weaker (filtering out less tweets)5 . We call the original version of rule 2 the strong filter and the alternative one the weak filter. 2. If the POS tag of the query is not a noun, label FALSE. 3 http://nlp.stanford.edu/software/tagger.shtml 4 http://nlp.stanford.edu/software/CRF-NER.shtml 5 Note that proper nouns are also nouns. ITC-UT: Tweet Categorization by Query Categorization 5 Class 4: Weakly FALSE-Biased Queries This class is optional and the following rules are used. The rules for this class are the same as the rules for class 2 except that we add another rule (the third one) to find more TRUE tweets because more TRUE tweets are expected for this class than class 2. 1. If the entity name consists of two or more words and it is contained in the tweet, label TRUE. 2. If the tweet contains the URL for the entity, label TRUE. 3. If the NE tag of the query is “ORGANIZATION”, label TRUE. 4. Otherwise, label FALSE. System Parameters We used four different configurations for submission, re- sulting in four runs and outputs. The four configurations are listed below. ITC-UT 1: used 3 classes and the strong filter (proper noun) for the class 3 rules. ITC-UT 2: used 3 classes and the weak filter (noun) for the class 3 rules. ITC-UT 3: used 4 classes and the strong filter (proper noun) for the class 3 rules. ITC-UT 4: used 4 classes and the weak filter (noun) for the class 3 rules. 4 Experimental Results We participated in the WePS-3 evaluation campaign with the four systems men- tioned above. In this section, we report the performances of our methods. As described above, the systems are different in their rules for tweet categorization and the number of classes for query categorization. These specifications are again shown in Table 1. The accuracy, precision, recall and F-measure of each method were calculated both for positive and negative examples. We show those values of our algorithms and the top system (indicated by “LSIR,EPF 1”) in Table 2. Among our methods, ITC-UT 1 achieved the best accuracy, which took the second position in the evaluation campaign. When we introduced “weakly FALSE-biased class”, the performance degraded in most of the measures while only recall for negative example increased in both cases. It is natural that re- call for negative example increased when we introduced “weakly FALSE-biased class” because tweets in this class are more likely to be classified to FALSE than the neutral class. Performance drop in the other measures suggests that the number of queries categorized to “weakly FALSE-biased class” was unnecessar- ily large, which may be because the conditions to specify “weakly FALSE-biased class” for the training data was too loose. As shown in the table, when the rule 2 for class 3 changed from the strong filter (proper noun) to the weak filter (noun), most of values degraded while only recall for positive example increased. The “weak filter” contributes to save (i.e., label TRUE) more TRUE tweets (i.e., true positives) while it also saves more 6 M. Yoshida et al. Table 1. Specification of Each Methods number of Method rules category ITC-UT 1 NE 3 ITC-UT 2 Noun 3 ITC-UT 3 NE 4 ITC-UT 4 Noun 4 Table 2. Performances of Methods Precision Recall F-measure Precision Recall F-measure Method Accuracy (Positive) (Positive) (Positive) (Negative) (Negative) (Negative) LSIR,EPFL 1 0.83 0.71 0.74 0.63 0.84 0.52 0.56 ITC-UT 1 0.75 0.75 0.54 0.49 0.74 0.60 0.57 ITC-UT 2 0.73 0.74 0.62 0.51 0.74 0.49 0.47 ITC-UT 3 0.67 0.70 0.47 0.41 0.71 0.65 0.56 ITC-UT 4 0.64 0.69 0.55 0.43 0.70 0.55 0.46 FALSE tweets (i.e., false positives.) The result showed that the increase of the former (true positives) was surpassed by the increase of latter (false positives). We also compared our methods with the top system in the campaign (LSIR,EPFL 1). Our algorithm tend to show higher precision for positive examples and higher recall for negative examples, which implies our methods are biased to label- ing FALSE. We think that our tweet classification rules, especially for class 3 (“neutral class”), leaves much room for improvement. In Table 3 we show the classification results in the first stage. Roughly speak- ing, the result indicates that our algorithms could catch the biases of each query. However, it is not fully obvious whether each query was successfully labeled. Note that labeling of the training queries was different between 3-class and 4-class settings because the threshold values are different between them. We show the detailed results of labeling of training queries in Table 4. Currently, we did not perform any adjustment to tune the threshold values for labeling of the training queries to be better fit to the stage-2 rules for each class of queries. We think these threshold values of labeling of training queries can be improved by, for example, cross validation on the training data or simply maximizing accuracy of training data. ITC-UT: Tweet Categorization by Query Categorization 7 5 Conclusions This paper reported the ITC-UT system for tweet categorization for the on-line reputation management task, which uses the 2-stage algorithm that categorizes each query in the first stage, and categorizes each tweet in the second stage using the rules customized for each class of queries. Our categorization rules are rather simple, therefore they still leave for improvement. For example, we can adjust the threshold values used in stage-1 to label the queries more appropriately for fitting to the stage-2 rules. We think we can also improve the results by using more sophisticated rules for tweet categorization for each classified class of queries. References 1. Artiles, J., Gonzalo, J., Sekine, S.: The SemEval-2007 WePS evaluation: Estab- lishing a benchmark for the web people search task. In: Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007). pp. 64–69 (2007) 2. Artiles, J., Sekine, S., Gonzalo, J.: Web people search: results of the first evaluation and the plan for the second. In: Proceeding of the 17th international conference on World Wide Web (WWW ’08). pp. 1071–1072 (2008) 8 M. Yoshida et al. Table 3. The results of stage-1 (query categorization) (left:3-class,right:4-class) query entity Labeled Class Gibson Gibson 1 1 Lexus Lexus 1 1 McDonald’s McDonald’s 1 1 sony sony 1 1 Starbucks Starbucks 1 1 apache apache 1 1 oracle Oracle 1 1 friday’s friday’s 1 3 Amazon Amazon.com 3 3 Blizzard Blizzard Entertainment 3 3 fox fox channel 3 3 jaguar Jaguar Cars Ltd. 3 3 muse muse band 3 3 sharp Sharp Corporation 3 3 Apple Apple 3 3 seat seat 3 3 subway subway 3 3 Cisco Cisco Systems 3 4 ford Ford Motor Company 3 4 McLaren McLaren Group 3 4 stanford Stanford Junior University 3 4 Yale Yale University 3 4 canon Canon inc. 3 4 CVS CVS/pharmacy 3 4 emory Emory University 3 4 GM General Motors 3 4 MTV MTV 3 4 Orange Orange 2 3 scorpions scorpions 2 3 sonic sonic.net 2 3 tesla Tesla Motors 2 3 johnnie Johnnie Walker 2 3 Liverpool Liverpool FC 2 3 mac macintosh 2 4 camel camel 2 2 Denver Denver Nuggets 2 2 Deutsche Deutsche Bank 2 2 kiss kiss band 2 2 jfk John F. Kennedy International Airport 2 2 Lloyd Lloyds Banking Group 2 2 Metro Metro supermarket 2 2 Milan A.C. Milan 2 2 Paramount Paramount Group 2 2 Roma A.S. Roma 2 2 US US Airways 2 2 Virgin Virgin Media 2 2 zoo Zoo Entertainment 2 2 ITC-UT: Tweet Categorization by Query Categorization 9 Table 4. Results of labeling the queries in training set (left:3-class,right:4-class) query entity Labeled Class nikon nikon 1 1 linux linux 1 1 Lufthansa lufthansa 1 1 Foxtel foxtel 1 1 alcatel alcatel 1 1 Renault renault 1 1 lamborghini lamborghini 1 1 Yamaha yamaha 1 1 Fujitsu fujitsu 1 1 Marriott marriott international 1 1 Marvel marvel comics 1 3 philips philips 1 3 Mercedes mercedes-benz 1 3 Mandalay mandalay bay resort and casino 1 3 armani armani 1 3 barclays barclays 1 3 Blockbuster blockbuster inc. 1 3 bayer bayer 3 3 fender fender 3 3 cadillac cadillac 3 3 Rover land rover 3 3 BART bart 3 4 Luxor luxor hotel and casino 3 4 Boingo boingo (wifi for travelers) 3 4 MGM mgm grand hotel and casino 3 4 Harpers harpers 3 4 Edmunds edmunds.com 3 4 MTA mta bike plus (nyc) 3 4 Southwest southwest airlines 2 4 dunlop dunlop 2 4 Amadeus amadeus it group 2 4 pioneer pioner company 2 2 Magnum magnum research 2 2 mdm mdm (event agency) 2 2 MEP mep 2 2 Mercer mercer consulting 2 2 Impulse impulse (records ) 2 2 elf elf corporation 2 2 Apollo apollo hospitals 2 2 Craft craft magazine 2 2 nordic nordic airways 2 2 Emperor emperor entertainment group 2 2 folio folio corporation 2 2 Smarter smarter travel 2 2 Liquid liquid entertainment 2 2 Lynx lynx express 2 2 bulldog bulldog solutions 2 2 shin shin corporation 2 2 pierce pierce manufacturing 2 2 Renaissance renaissance technologies 2 2 Mack mack group 2 2 Delta delta holding 2 2