Extracting Unambiguous Keywords from Microposts Using Web and Query Logs Data Davi de Castro Reis Felipe Goldstein Frederico Quintao Google Engineering Google Engineering Google Engineering Belo Horizonte, Brazil Paris, France Belo Horizonte, Brazil davi@google.com felipeg@google.com quintao@google.com If a lion could talk, we could not understand him. Keywords (Ludwig Wittgenstein) unambiguity, word sense disambiguation, query logs, web, advertising ABSTRACT In the recent years, a new form of content type has become 1. INTRODUCTION ubiquitous in the web. These are small and noisy text snip- The availability of huge Web based corpora has spawned pets, created by users of social networks such as Twitter and a series of important advancements in the Natural Lan- Facebook. The full interpretation of those microposts by guage Processing (NLP) field recently [2, 10]. Moreover, machines impose tremendous challenges, since they strongly the increase of computational power has made it possible rely on context. In this paper we propose a task which is to run simpler algorithms over much more data [7]. How- much simpler than full interpretation of microposts: we aim ever, computer interpretation of natural language is still an to build classification systems to detect keywords that un- unsolved challenge. On the other hand, a very successful ambiguously refer to a single dominant concept, even when set of free text controlled applications have blossomed in taken out of context. For example, in the context of this the last decade: the Web search engines. The natural lan- task, apple would be classified as ambiguous whereas mi- guage processing techniques employed by these systems are crosoft would not. The contribution of this work is twofold. not enough to give users a natural speaking experience, but First, we formalize this novel classification task that can be regardless of that, users type queries in sites like Yahoo!, directly applied for extracting information from microposts. Bing and Google on a daily basis. Instead of teaching the Second, we show how high precision classifiers for this prob- machine how to speak like us, we ended up learning a simple lem can be built out of Web data and search engine logs, yet effective way of expressing our information needs. combining traditional information retrieval metrics, such as In this work we start from the premise that full under- inverted document frequency, and new ones derived from standing of natural language cannot be achieved with the search query logs. Finally, we have proposed and evaluated current technology. In fact, not even a human being is able relevant applications for these classifiers, which were able to fully understand all communication due to either lack of to meet precision ≥ 72% and recall ≥ 56% on unambigu- cultural context or to inherent ambiguity in the language. ous keyword extraction from microposts. We also compare This is especially true in the context of microposts, where those results with closely related systems, none of which both the media culture and technical constraints impose a could outperform those numbers. limit on the size of the messages. Given these limitations, we have aimed to detect parts of natural language that can Categories and Subject Descriptors be unambiguously understood in the lack of any context. The key observation that allowed us to identify those parts H.3.1 [Information Storage and Retrieval]: Content of natural language is that they tend to be used as search Analysis and Indexing—Linguistic Processing; I.2.7 [Artificial engine queries on the Web [17]. Intelligence]: Natural Language Processing—Text Analy- For example, when using a search engine, if the user wants sis to know about Michael Jackson, the American singer, dancer, and entertainer, she expects to type these two words in the- General Terms search box and get relevant results. On the other side, if Algorithms, Experimentation she is looking for Michael Jackson, the social anthropologist from New Zealand, she knows that she will need to further qualify the query. In this work we show that search engine query logs are a Permission to make digital or hard copies of all or part of this work for valuable source of data to build classifiers that can identify personal or classroom use is granted without fee provided that copies are unambiguous concepts in a language. In our work, the fea- not made or distributed for profit or commercial advantage and that copies tures for such classifiers are calculated over a crawl of the bear this notice Copyright c and theheld 2012 full citation on the first page. To copy otherwise, to by author(s)/owner(s). Web and all queries issued in evenly spread days of a large republish, Published to as postpart on servers of theor to redistributeWorkshop #MSM2012 to lists, requires prior specific proceedings, commercial search engine. Classifiers like these seem espe- permission and/or aasfee. available online CEUR Vol-838, at: http://ceur-ws.org/Vol-838 MSM ’12 Lyon,April #MSM2012, France16, 2012, Lyon, France. cially suited to process short, noisy, conversational texts, Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00. which since recently have become widely available on the 18 · #MSM2012 · 2nd Workshop on Making Sense of Microposts · Web. We show experimental results from shares and micro- rize a given text, even if each of the individually extracted posts from both Facebook1 and Twitter2 . We also propose keywords might be ambiguous outside that set. Similarly, two different applications to the classifiers built in this pa- Named Entity Recognition systems look for entities that per. may or may not be unambiguous outside their context, such The rest of this paper is organized as follows. Section 2 as Ford However, in our problem definition, only the key- discusses existing work that is relevant to our system. In words Ford Motors, Henry Ford or Gerard Ford should be Section 3 we discuss in detail how the classifiers proposed extracted. Finally, we have no knowledge of the microposts in this paper are built and in Section 4 we present num- being analyzed, preventing us from using domain specific bers assessing their effectiveness. A discussion of potential features. applications for the system is shown in Section 5. Finally, Section 6 contains our conclusions about this work. 3. DETECTING UNAMBIGUITY The ultimate goal of this work is to develop classifiers 2. RELATED WORK that detect unambiguous keywords in a language. As far Krovetz and Croft observed that 75% of early information as the knowledge of the authors goes, this is the first work retrieval systems queries are unambiguous [8]. This obser- proposing such classifiers. In order to formally present the vation has been later corroborated by a survey from Sander- problem, we will first introduce a few relevant concepts. son [17], where the impact of word sense disambiguation in A common step previous to the processing of any corpus information retrieval systems is studied. Although we do is the segmentation of the text in the documents. This is not rely only on that observation, one of the core hypothesis a complex problem which is beyond the scope of this paper of this paper is that, to a lesser extent, this continues to be and we assume that there is a state-of-the-art segmenter true for modern search engine queries, as long as the query available3 . The output of the segmentation process is a set does not show often in the query logs with further qualifi- of keywords. One keyword can be composed by one word or cation. For example, the query Washington often needs to by a sequence of words – in the latter case we also refer to be refined as George Washington or Washington (state) or it as an n-gram or compound. even Washington, D.C. while the query Canada often does The Merriam-Webster dictionary4 defines ambiguity as not. This hypothesis is the central idea behind the metric (1) doubtful or uncertain especially from obscurity or indis- described in Session 3.4.2, which has shown to be one of the tinctness or (2) capable of being understood in two or more strongest signals described in this paper. possible senses or ways. However, there is a shortcoming in The usage of the Web as an implicit training set for Natu- this definition, since it relies on human interpretation. One ral Language Processing problems and ambiguity resolution person can say that a given word is ambiguous while an- in particular is presented in [15], where the author shows other could disagree. It turns that both could be right since that the usage of simple algorithms and features extracted the interpretation of the senses of a word can be done at from large amounts of data yield competitive results with different granularities. sophisticated unsupervised approaches and close results to Lapata and Keller [11] define ambiguity, or its comple- that of supervised state of the art approaches. ment, unambiguity, as function of a frequency of the senses Wacholder et al. studied the problem of disambiguation that a given word or compound shows in a large corpus. We of proper names in [20]. Nadeu and Sekine conducted a will instead use the terminology of the semiotics model by survey [14] on the related field of Named Entity Recogni- Ferdinand de Saussure [18], which yields a more intuitive tion and Classification (NERC). Finally, the broader area of definition for the scope of our work. Word Sense Disambiguation is discussed in depth by Nav- igli [16]. Definition 1. Let a pair (f, c) be a sign, being f the form The problem of extracting information from microposts of the sign and c the concept it represents.5 Let L be a has gained significant attention recently. In [4], Choudhury language and S the set of all signs used in that language. Let and Breslin have presented a classifier for Twitter posts able the document frequency of a sign df (f, c) in S be the number to detect players and associated micro-events in a sports of documents the sign appear in a large corpus representative match, achieving a f-measure of 87%. Using knowledge of of the language P L. We say that f is unambiguous if and only the domain, such as sports jargon and names of players, they if df (f, c)/ df (f, c0 ) > α. are able to disambiguate the Tweets. Li et al. proposed us- In other words, we say that f is unambiguous if one of ing a keyword extraction system for targeting ads to Face- the concepts it may represent is α times more frequent in book updates in [12], one of the applications we discuss in documents of the corpus than all the others combined. For Section 4. Signals based on capitalization and document fre- our purposes, f is always a word or a compound word, and quency are present in their work, but they did not explore given that restriction, we will use Definition 1 as the basis for any of the query log derived metrics. the problem being discussed through the rest of the paper. Although the problems presented by the works discussed above share similarities with ours, none of their techniques 3.1 Keyword Evaluation Methodology can be directly applied. Word Sense Disambiguation is fo- Given a keyword q, an human evaluator can use Defini- cused on finding senses of ambiguous terms in the local con- tion 1 to rate it as ambiguous or unambiguous. From this text, and does not discuss the properties of a given keyword 3 outside its context. Also, traditional keyword extraction sys- In this paper we use a segmenter developed internally at tems extract a set of keywords that characterize or summa- Google, Inc. 4 www.merriam-webster.com 1 5 www.facebook.com In the original, form and concept are called signifier and 2 www.twitter.com significant, respectively. 19 · #MSM2012 · 2nd Workshop on Making Sense of Microposts · definition, the evaluator should look at all the web docu- more popular a keyword is, the better the signal-to-noise ments containing q to understand the sense of the keyword ratio we have on it for all metrics. Figure 1 shows the IDF in each of them. In practice, we do not need to go through distribution for unigrams and keywords found on the Web all the documents in the corpus to find whether a keyword collection. is unambiguous. Instead, we select, from the web, 16 ran- dom documents that contain q and manually check if all occurrences refer to the same concept, i.e., all positive oc- currences. If yes, we say that the keyword q is unambiguous. The choice of 16 random documents is legitimate. Since the evaluation of each random document that contains q is an experiment that has only two possible answers, we can assume it is a random sampling over a binomial distribution. Then, by using the Wilson method [21], we can calculate the binomial proportion confidence interval for a sample size of 16 with all of them positive occurrences, i.e. p̂ = 1.0, which result is [0.8, 1.0] with center at 0.9. This interval gives us the α of Definition 1, which in this case will have a lower bound of 0.8 and an expected value of 0.9 with a Figure 1: IDF distribution for the web collection. 95% confidence. In other words: Given a keyword that was human-evaluated as unambiguous, we have 95% of chance that this keyword will refer to the same dominant concept in 80% of the corpus, but more likely it will be 90% of the corpus. This is the threshold we decided to use to assume a keyword is unambiguous in our human evaluations. Using this methodology we have built a reference-set with 2634 ambiguous and 426 unambiguous keywords to be used in the analaysis of the metrics presented in the next sections and as training-set input to the Machine Learning approach at Section 3.6. 3.2 Model Generation There are two main source of signals for the unambiguous keywords classifiers presented here. The first is a sample Figure 2: IDF distribution for the reference-set. with billions of documents of Google’s search engine web collection. The second is as sample with billions of query The histograms for IDF distribution among the reference- entries from Google’s query log corpus collected in evenly set is plotted in Figure 2. The top chart shows the histogram spread days. for ambiguous keywords while the bottom shows the unam- The model generation is composed by a chain of off-line biguous. This histogram shows that ambiguous keywords calculations of statistical properties of all signs in these two tends to have lower IDF values. The lowest ones are lan- corpora and takes a few thousands of cpu-hours to complete. guage constructions such as just and this. The misspellings These properties will serve as the basis for the classification and uncommon person names lies in the very high IDF algorithms. This is an one-off work that only needs to be range. While unambiguous keywords tends to have higher redone whenever there is an opportunity and/or the need of IDF values, there is a big overlap with lots of unambiguous improving the performance of the system. ones in the mid-lower IDF range, such as White House and Disney. This overlap makes it hard for the IDF metric to 3.3 Web Collection Metrics be used to separate both sets. However, we can apply it for For every keyword resulting from the segmentation, we filtering language constructions and misspellings. compute several properties using a large MapReduce-like system [5] visiting all the documents of the Web corpus. 3.3.2 Caps First Ratio In the next sections we explain each property and give a Caps First Ratio (CFR) is the ratio that a given keyword histogram of its distribution among the keywords. Addi- shows up on the Web collection with the first letter capital- tionally to the plain histograms, we present two additional ized and we interpret it as strong indicator of names. We complementary histograms, one for the probability density implemented the techniques from [13] to detect capitalized of the metric among the ambiguous and another one for the keywords. unambiguous keywords of the reference-set defined in Sec- The CFR metric has the obvious property of detecting tion 3.1. nouns, but it has another subtle interesting characteristic. Several noun compounds include, as an extra qualifier, sub- 3.3.1 Inverse Document Frequency compounds or unigrams that are unambiguous by them- The first metric, the Inverse Document Frequency (IDF), selves, for example Justin Bieber and Bieber are both unam- is the same found in the traditional information retrieval biguous. In this case, we consider the occurrence of every literature. It is computed over the Web document collection capitalized word not only in the CFR calculation of the com- and is a proxy of the popularity of the keyword. It also serves pound it belongs to – Justin Bieber – , but also in the CFR as a good confidence indicator for all remaining metrics. The score of the sub-compounds and unigrams of that compound 20 · #MSM2012 · 2nd Workshop on Making Sense of Microposts · – Bieber. This helps increasing the CFR score of nouns that much less often in the query stream. Because of that we can act as unambiguous qualifiers. For example, for the Bieber say it is safe to discard anything that is not popular in the unigram, using only the initials for legibility, we calculate: query stream. count(JB) + count(B) CF R(B) = count(jb) + count(b) + count(JB) + count(B) Figure 5: SIDF distribution for an infinite stream of web text. Figure 3: CFR distribution for the web collection. Figure 6: SIDF distribution for the reference-set. Figure 4: CFR distribution for the reference-set. 3.4.2 Sessions Exact Ratio A key metric that comes from the query stream analysis Figure 3 shows, the CFR distribution seen on Web docu- is the Sessions Exact Ratio (SER). It tells how often a given ments. The reference-set histograms at Figure 4, are more keyword shows up by itself in the search box. This is the heterogeneous than the IDF counterpart. The mid-low range strongest indicator that this keyword is unambiguous when of CFR values includes mostly only ambiguous keywords, taken out of context. Figures 7 and 8 shows the histogram while the unambiguous histogram has a sharp growth in the for this metric on the Web collection and the reference-set re- high values. spectively. As can be seen, the ambiguous and unambiguous 3.4 Query Log Metrics reference-set is mostly separable. Some examples of unam- biguous keywords in the very high range of the histogram Query logs have proved to be a valuable source of infor- are: Tom Hicks, Madison Square Garden and Groupon. mation for several fields of computer science [19]. In our work we collected data from three evenly spread days worth of queries in the logs of a large search engine. As with the Web corpus, we compute the following metrics for each key- word generated by the segmentation of the query log corpus. 3.4.1 Sessions Inverse Document Frequency The Sessions Inverse Document Frequency (SIDF) is anal- ogous to the Web metric of the same name, but it is calcu- lated over the search engine query stream. Each session [19] is considered as a document. Figures 5 and 6 presents the distribution of this metric for the query stream and for the reference-set respectively. This signal has similar properties to its Web counterpart, but with a bias towards concepts and against intrinsic language characteristics. By compar- Figure 7: SER distribution for an infinite stream of ing Figures 1 and 5, one can draw an interesting conclu- web text. sion: stopwords and auxiliary language constructions appear 21 · #MSM2012 · 2nd Workshop on Making Sense of Microposts · The histograms shown here are not just a tool to help visualize how each metric may be used to dissociate both sets, but more than that, it is an evidence that the metrics used here can succeed in building an effective classifier. 3.5 A hand-crafted classifier In this section we present a simple hand-crafted algorithm. It was developed upon the discussions and histogram obser- vations of above metrics, regarding the reference-set separa- tion. We use this algorithm to expose the ideas without adding the complexity that inherently comes with tradi- tional machine learning techniques, as well as to avoid hid- Figure 8: SER distribution for the reference-set. ing the interesting properties of the data under analysis. Later in this paper we present a Support Vector Machines (SVM) approach for the same classification task. Refer to 3.4.3 Search Bias Section 3.6 for more details. The last metric, Search Bias (SB), is not directly derived from the query stream, but rather obtained through a com- Algorithm 1: The IsUnambiguous Algorithm. bination of sessions and Web signals. Search Bias can be 1 begin thought as the ratio of appearance of a keyword in the query 2 if sidf > 15 then return false; logs corpus divided by the ratio of appearance of the same 3 if uni ∧ idf > 12 ∧ sidf > 12 then return false; keyword on the Web corpus. 4 if cf r < 0.4 then return false; 5 if ser < 0.35 then return false; 6 if sb < 0.01 then return false; 7 if cf r + ser + sb < 1 then return false; 8 if charcount < 3 then return false; 9 if blacklisted then return false; 10 end Algorithm 1 presents our hand-crafted approach for the unambiguity classification problem. Each line is a filter of ambiguous keywords. In Figure 11 one can see how many keywords are being discarded as the classifier is applied on top of the Web corpus. In the end, only 3.8% of all keywords Figure 9: SB distribution for an infinite stream of occurrences are considered unambiguous. web text. The Sessions IDF funnel component corresponds to Line 2 of the algorithm. Its goal is to remove everything that is too rare, such as misspells. Usually, plain IDF is used for this type of cutting, but bigrams and larger keywords have a very high IDF on the Web corpus. In the query logs corpus, how- ever, large keywords appear much more often and anything that is not too rare will not be filtered by this rule. Figure 10: SB distribution for the reference-set. The naive calculation of this number leads to a value with bad properties due to very common words on the Web cor- pus and the high frequency of compounds in the query log corpus. To avoid those issues, search bias is calculated tak- ing into account only “noble” occurrences of a keyword in Web and query logs corpora. For the Web, we consider that Figure 11: The percentage of the input text key- only capitalized occurrences are noble, while from query logs words each line of the algorithm filters. The most we only consider those occurrences where the keyword ap- important filters are highlighted in gray. pear by itself in the query. The distribution of this metric can be seen on Figures 9 and 10. In Line 3, the Rare Unigrams filter unigrams typos that 22 · #MSM2012 · 2nd Workshop on Making Sense of Microposts · are not rare enough to be discarded by Sessions IDF and and false-negative errors. After doing a simple grid-search also come with all types of capitalization. Since unigrams of both parameters using cross-validation, we picked a value are more frequent than compounds, we can apply a more around 0.05 for γ and 0.1 for C. restrictive threshold. The Low Caps filter comes from Line 4 of the algorithm. Not only it is responsible for restricting the classifier to 4. EXPERIMENTAL RESULTS nouns, it also rejects all the general nouns. For example, In this section we present experimental results obtained the noun ball has a very low caps first ratio, but Wilson with the implementation of the classifier described in Sec- NCAA Reaction Basketball is almost always typed all caps. tion 3.5 and the SVM described in Section 3.6. The most powerful feature for our classifier, the Sessions Exact Ratio (SER) filter, is used in Line 5. It reflects the 4.1 Test set definition for manual evaluation key intuition that our work builds upon: users know that The input of each classifier is a chunk of free form text, search engines have little context to understand their com- from now on referred to as a micropost, and the output is munication and because of that they formulate unambiguous a list of keywords assumed by the classifier to represent un- queries. ambiguous concepts from the input. We decided to use a The derived metric Search Bias is used in Line 6. Some micropost as a unit of evaluation, in opposition to a single keywords, like Unfortunately, tend to have both high CFR keyword, because we believe the results from this analysis – because they are used to start phrases – and high SER represent better the real world applications. In order to – because they have low query volume. This filter detects not favor our classifier, the rater is instructed to mark a those language constructions that are way more common in micropost as False Positive if she finds one ambiguous key- the Web corpus than in the search corpus and discards them. word classified as unambiguous, even if the classifier also The combined Caps+Exact+Bias filter in Line 7 is the correctly extracted another unambiguous keyword from the most complex part of the algorithm. Its goal is allow us to same micropost. reduce the thresholds of the individual filters applied before We selected two different sources of microposts to feed without incurring in a loss of precision. This filter will let the classifier: Twitter and Facebook. This data set and the keywords that score very high in any of the metrics com- reference-set that was used to train the SVM classifier are bined pass, but will discard those that have a low average disjoint to avoid over-fit. Twitter is a social Web site where all around. users can send and read text-messages with up to 140 char- The Character Count is a simplistic filter as can be seen acters, called tweets. We expect this kind of social Web site in Line 8. When dealing with two characters keywords, all to have mostly conversational text and noise, which carry lit- the metrics have bad properties, and we simply discard all tle information by themselves. To feed the classifiers, each of them. In fact, a perfect English classifier limited to two tweet is considered independent from each other and its text character unigrams can be manually built by inspecting all is used as input to the classifier. Facebook is a social net- the 626 possible combinations. work site where users can create virtual connections with Finally, the last step of the algorithm is the Blacklist filter, their friends. One Facebook feature is the status message in Line 9. Some keywords have very extreme metrics and updates. The status message is much like a tweet, but the tend to pass through all the filters, and we simply blacklist 140 characters limit is not imposed. Since users can follow- them. For English, we currently blacklist 40 greetings ex- up on their friends updates, the entire conversation is used pressions, such as Happy birthday and Hello and some very as input for the classifiers. common words like Indeed. In fact, the metrics for those keywords are so extreme that by looking at the top values 4.2 Methodology for manual evaluation for each metric one can easily spot them. We also black- The methodology used for manual evaluation consists of listed the Web sites names Google, Facebook, Twitter and collecting a random set of microposts from each data source Gmail because, although unambiguous, they are so common and feeding each one into the classifiers. The original micro- in our evaluation data that they would positively benefit our post and the output of the classifier are then shown to three results with no interesting characteristics. different raters. The output might be empty, in the case the classifier did not find any unambiguous keyword in the mi- 3.6 A Machine Learning approach cropost. Otherwise it contains at least one keyword, which To challenge the hand-crafted algorithm presented in the was classified as unambiguous. In case of discordance, it is previous section and test if its intuitions were correct, we discussed until consensus is reached. Regardless of the clas- employ a Support Vector Machines (SVM) algorithm to the sifier output, raters must investigate the keywords present same classification task. It is a well-known technique and in the micropost. They use the methodology presented in there is a ready to use state-of-the-art implementation, namely Section 3.1 to rate each keyword q. Based on the output libSVM [3]. The down-side of the machine learning ap- of the classifier and the inspection of the micropost content proach is that labeled data acquisition for this novel clas- carried out by the rater, each micropost is rated as below: sification task is challenging. The training set used was the True Positive (TP): There are unambiguous keywords reference-set explained before, with the 2634 ambiguous and in the micropost and the system has extracted at least one. 426 unambiguous manually classified keywords. Each met- True Negative (TN): There are no unambiguous key- ric shown in sections 3.3 and 3.4 were rescaled to the 0-1 words and the system extracted nothing. range and used as SVM input features. We used only the False Positive (FP): The system has extracted an am- Radial Basis Function kernel present in libSVM: K(xi , xj ) = biguous keyword. exp(−γ × kxi − xj k2 ), γ > 0. By tuning the γ and C pa- False Negative (FN): There are unambiguous keywords rameters we can control the trade-off between false-positive in the micropost but the system has extracted none. 23 · #MSM2012 · 2nd Workshop on Making Sense of Microposts · We use the output of the rating phase to compute the stan- slightly better results. The precision is 80.19% (around 6% dard evaluation metrics in the Information Retrieval field, better), whereas the recall is 55.55%. Again, the True Neg- such as precision, recall, accuracy and F-score [1]. ative Rate is really high, 96.05%. The classifier has an ac- Both models – the hand-crafted and the SVM classifier – curacy of 87% with an F-Score of 0.66. The high value for were built using the context available at the English Web, the True Negative Rate is a sign that most conversations in but it must be considered that people have an incomplete Social Networks like Facebook and Twitter are not proper cultural context and sometimes it may not be obvious that for context-extraction systems such as content-targeted ad- a given keyword is unambiguous. For example, during the vertisement if used without any pre-processing. evaluation one rater could not recognize upfront the key- To compare the results of the two classifiers presented word Doug Flutie, which was extracted by the system. Even above, we also evaluated two known systems: Yahoo! Term though this rater did not recognize this keyword, Doug Flu- Extractor API – aimed to Keyword Extraction tasks – reached tie is indeed an unambiguous keyword because every person 18.98% of precision and 57.69% of recall for Facebook data, with culture about American Football will recognize him as and 14.89% of precision and 77.7% of recall for Twitter data; Douglas Richard “Doug” Flutie, a famous football quarter- and the Stanford Named Entity Recognizer [6] – aimed to back who played professionally in the United States Football Named Entity Recognition and Classification tasks – reached League and, more importantly, the name Doug Flutie is not 35% of precision and 69.99% of recall for Facebook data, and used as an identifier in any other significant context besides 39.65% of precision and 74.19% of recall for Twitter data. that. Our precise definition of unambiguity prevents this Both systems reached a higher recall, but for the real- problem, since the rater will learn the sense(s) of the key- world applications discussed in section 5 we cannot afford word when looking at the 16 randomly sampled documents, extracting a wrong keyword from a noisy text. In these ap- as it was the case in this example. plications precision is more important, and for both systems it is much lower than the two filters developed in this work. 4.3 Numerical results The high recall and low precision result is expected for these Table 1 presents the output of the raters for Facebook and systems, since they were engineered for different tasks and Twitter microposts. do not perform well for the unambiguity detection task de- fined here. Hand Algorithm SVM Twitter Facebook Twitter Facebook 5. APPLICATIONS TP 99 106 85 85 Given the properties of the classifiers presented in this TN 494 480 511 510 paper, we believe they are suited for a set of different appli- FP 38 34 22 21 cations that are becoming more important given last devel- FN 74 64 87 68 opments on the Web industry. Table 1: Break-down of metrics for Twitter and 5.1 Ad targeting in Social Networks Facebook. In Social Networks users keep updating their status mes- sages (or tweets) with what they have in mind. The number Twitter of daily updates in the most prominent networks is huge6 , turning this channel into a potential candidate for input of Following the experimental methodology we analyzed a set content-targeted advertisement systems [12]. For instance, of 705 tweets, which were randomly selected from a set it is just fine to deliver an advertisement piece like Buy tick- of 170k tweets that were crawled by a fairly random walk ets to the coming Jonas Brothers show!, right next to a mi- in the Twitter graph. We used these tweets as input of cropost where a user claims to be the biggest fan of this both classifiers and presented the output to the raters. The music group. However, the conversational text brings even hand-crafted classifier was able to reach precision of 72.26%, more complexity for the already tough task [9] of deliver- and sensitivity (recall) of 56.22%. The True Negative Rate ing content-targeted ads. Feeding these systems with noisy (TNR, or specificity) is high (92.86%), upon what one can text may lead them to return non-relevant ads. One can use conclude that most tweets do not contain unambiguous key- the classifiers proposed in this paper as a filtering layer on words. For Twitter the system reached an accuracy of 84% top of current content-targeted advertisement systems. The with an F-Score of 0.64. The SVM model reached a preci- filter would delegate calls to the ads systems only when it sion of 79.43%, i.e., a performance almost 10% better than is possible to retrieve relevant content from the microposts the achieved by the hand-crafted algorithm. The achieved being targeted. recall is 49.42%, considerably worse than the recall reached by the hand-crafted algorithm. The TNR of the SVM model 5.2 Automatic reference in Social Networks is 95.87%, and the system reached an accuracy of 85% with User profiles in Social Networks could also be classified as an F-Score of 0.61. unambiguous by using the profile name for example. When- ever a micropost has the unambiguous keywords that matches Facebook a profile name, a link to that profile could be added, instead Following a random selection strategy similar to Twitter, of just pure text. This could be done for celebrity profiles, for we collected 684 conversations that took place around Face- example, when a user posts “I just watched the last Quentin book status message updates. For this data set, the hand- Tarantino movie.”, a link to the Quentin Tarantino profile crafted system reached a precision of 75.71% and recall of could be added. 62.36%. The TNR was of 93.39%, whereas the accuracy 6 reached 85% with an F-score of 0.68. The SVM model got http://news.cnet.com/8301-13577 3-10378353-36.html 24 · #MSM2012 · 2nd Workshop on Making Sense of Microposts · 6. CONCLUSIONS [6] J. R. Finkel, T. Grenager, and C. Manning. In this paper we presented a novel classification problem Incorporating non-local information into information aimed at identifying the unambiguous keywords in a lan- extraction systems by gibbs sampling. In Proceedings guage, and formally defined it together with an evaluation of the 43rd Annual Meeting on Association for methodology. We also have presented two different algo- Computational Linguistics, ACL ’05, pages 363–370, rithms for the classification problem and the corresponding Stroudsburg, PA, USA, 2005. Association for numerical results achieved by both of them. The proposed Computational Linguistics. algorithms are built on top of traditional information re- [7] A. Halevy, P. Norvig, and F. Pereira. The trieval metrics and novel metrics based on the query log Unreasonable Effectiveness of Data. IEEE Intelligent corpus. The introduction of these metrics, Sessions IDF, Systems, 24:8–12, 2009. Sessions Exact Ratio and Search Bias, is by itself an impor- [8] R. Krovetz and W. B. Croft. Lexical Ambiguity and tant contribution. We believe these metrics will be useful in Information Retrieval. ACM Transactions on other problem domains as well. Information Systems, 10:115–141, April 1992. Our evaluation have shown that our classifiers were able [9] A. Lacerda, M. Cristo, M. A. Gonçalves, W. Fan, to meet precision ≥ 72%, recall ≥ 49%, accuracy ≥ 84% N. Ziviani, and B. Ribeiro-Neto. Learning to and F-Score ≥ 0.61, even when the input is composed by Advertise. In SIGIR ’06: Proceedings of the 29th the noisy microposts from Facebook and Twitter, two of Annual International ACM SIGIR Conference on the biggest sites in the world nowadays, outperforming two Research and Development in Information Retrieval, known systems from the traditional keyword extraction and pages 549–556, New York, NY, USA, 2006. Named Entity Recognition fields. [10] M. Lapata and F. Keller. Web-based Models for Another interesting aspect of the presented work is that it Natural Language Processing. ACM Transactions on diverges from the bag-of-words analyses that dominate the Speech and Language Processing, 2(1):3, 2005. research in the area. Instead, we have focused on directly [11] M. Lapata and F. Keller. An Information Retrieval finding the specific keyword that define a concept, avoiding Approach to Sense Ranking. In Human Language the shortcomings that come from having a representation Technology Conference of the North American Chapter that cannot be understood by a human or does not meet of the Association of Computational Linguistics, pages the expectations of other systems. This leads immediatelly 348–355, 2007. to our future work proposal of using the extracted keywords [12] Z. Li, D. Zhou, Y.-F. Juan, and J. Han. Keyword as beacons for further qualification of other keywords in the Extraction for Social Snippets. In Proceedings of the text. For example, the extracted keyword Lionel Messi can 19th ACM International Conference on World wide be used to anchor the word goal to the concept of scoring web, pages 1143–1144, New York, NY, USA, 2010. in the soccer sport, instead rather the more general idea [13] A. Mikheev. A Knowledge-free Method for Capitalized of an objective to be achieved. We expect this inside-out Word Disambiguation. In Proceedings of the 37th approach for extracting semantics in microposts to perform Annual Meeting of the Association for Computational better than traditional word collection approches. Linguistics on Computational Linguistics, pages More and more researchers have access to query logs and 159–166, Morristown, NJ, USA, 1999. many may directly benefit from the metrics proposed here [14] D. Nadeau and S. Sekine. A survey of named entity either to tackle the same classification problem or to inno- recognition and classification. Linguisticae vate in their own domains. For the industry, we have shown Investigationes, 30(1):3–26, January 2007. a solution for extracting information from microposts, a type of content that has experienced tremendous growth on the [15] P. Nakov and M. Hearst. Using the Web as an Web in the recent past. Implicit Training Set: Application to Structural Ambiguity Resolution. In HLT ’05: Proceedings of the conference on Human Language Technology and 7. REFERENCES Empirical Methods in Natural Language Processing, [1] R. A. Baeza-Yates and B. Ribeiro-Neto. Modern pages 835–842, Morristown, NJ, USA, 2005. Information Retrieval. Addison-Wesley Longman [16] R. Navigli. Word Sense Disambiguation: A Survey. Publishing Co., Inc., Boston, MA, USA, 1999. ACM Comput. Surv., 41(2):1–69, 2009. [2] M. Banko and E. Brill. Scaling to Very Very Large [17] M. Sanderson. Retrieving with Good Sense. Corpora for Natural Language Disambiguation. In Information Retrieval, 2(1):49–69, 2000. ACL ’01: Proceedings of the 39th Annual Meeting on [18] F. D. Saussure. Course in General Linguistics. Open Association for Computational Linguistics, pages Court Pub Co, 1986. 26–33, Morristown, NJ, USA, 2001. [19] C. Silverstein, H. Marais, M. Henzinger, and [3] C.-C. Chang and C.-J. Lin. LIBSVM: a Library for M. Moricz. Analysis of a Very Large Web Search Support Vector Machines. Software available at Engine Query Log. SIGIR Forum, 33(1):6–12, 1999. http://www.csie.ntu.edu.tw/~cjlin/libsvm, 2001. [20] N. Wacholder, Y. Ravin, and M. Choi. Disambiguation [4] S. Choudhury and J. Breslin. Extracting semantic of Proper Names in Text. In Proceedings of the Fifth entities and events from sports tweets. In Making Conference on Applied Natural Language Processing, Sense of Microposts (#MSM2011), pages 22–32, 2011. pages 202–208, Morristown, NJ, USA, 1997. [5] J. Dean and S. Ghemawat. Mapreduce: Simplified [21] E. B. Wilson. Probable Inference, the Law of Data Processing on Large Clusters. In Sixth Succession, and Statistical Inference. Journal of the Symposium on Operating System Design and American Statistical Association, 22(158):pp. 209–212, Implementation, pages 137–150, 2004. 1927. 25 · #MSM2012 · 2nd Workshop on Making Sense of Microposts ·