=Paper=
{{Paper
|id=None
|storemode=property
|title=Extracting Unambiguous Keywords from Microposts using Web and Query Logs Data
|pdfUrl=https://ceur-ws.org/Vol-838/paper_04.pdf
|volume=Vol-838
|dblpUrl=https://dblp.org/rec/conf/msm/ReisGQ12
}}
==Extracting Unambiguous Keywords from Microposts using Web and Query Logs Data==
Extracting Unambiguous Keywords from
Microposts Using Web and Query Logs Data
Davi de Castro Reis Felipe Goldstein Frederico Quintao
Google Engineering Google Engineering Google Engineering
Belo Horizonte, Brazil Paris, France Belo Horizonte, Brazil
davi@google.com felipeg@google.com quintao@google.com
If a lion could talk, we could not understand him. Keywords
(Ludwig Wittgenstein) unambiguity, word sense disambiguation, query logs, web,
advertising
ABSTRACT
In the recent years, a new form of content type has become 1. INTRODUCTION
ubiquitous in the web. These are small and noisy text snip- The availability of huge Web based corpora has spawned
pets, created by users of social networks such as Twitter and a series of important advancements in the Natural Lan-
Facebook. The full interpretation of those microposts by guage Processing (NLP) field recently [2, 10]. Moreover,
machines impose tremendous challenges, since they strongly the increase of computational power has made it possible
rely on context. In this paper we propose a task which is to run simpler algorithms over much more data [7]. How-
much simpler than full interpretation of microposts: we aim ever, computer interpretation of natural language is still an
to build classification systems to detect keywords that un- unsolved challenge. On the other hand, a very successful
ambiguously refer to a single dominant concept, even when set of free text controlled applications have blossomed in
taken out of context. For example, in the context of this the last decade: the Web search engines. The natural lan-
task, apple would be classified as ambiguous whereas mi- guage processing techniques employed by these systems are
crosoft would not. The contribution of this work is twofold. not enough to give users a natural speaking experience, but
First, we formalize this novel classification task that can be regardless of that, users type queries in sites like Yahoo!,
directly applied for extracting information from microposts. Bing and Google on a daily basis. Instead of teaching the
Second, we show how high precision classifiers for this prob- machine how to speak like us, we ended up learning a simple
lem can be built out of Web data and search engine logs, yet effective way of expressing our information needs.
combining traditional information retrieval metrics, such as In this work we start from the premise that full under-
inverted document frequency, and new ones derived from standing of natural language cannot be achieved with the
search query logs. Finally, we have proposed and evaluated current technology. In fact, not even a human being is able
relevant applications for these classifiers, which were able to fully understand all communication due to either lack of
to meet precision ≥ 72% and recall ≥ 56% on unambigu- cultural context or to inherent ambiguity in the language.
ous keyword extraction from microposts. We also compare This is especially true in the context of microposts, where
those results with closely related systems, none of which both the media culture and technical constraints impose a
could outperform those numbers. limit on the size of the messages. Given these limitations,
we have aimed to detect parts of natural language that can
Categories and Subject Descriptors be unambiguously understood in the lack of any context.
The key observation that allowed us to identify those parts
H.3.1 [Information Storage and Retrieval]: Content of natural language is that they tend to be used as search
Analysis and Indexing—Linguistic Processing; I.2.7 [Artificial engine queries on the Web [17].
Intelligence]: Natural Language Processing—Text Analy- For example, when using a search engine, if the user wants
sis to know about Michael Jackson, the American singer, dancer,
and entertainer, she expects to type these two words in the-
General Terms search box and get relevant results. On the other side, if
Algorithms, Experimentation she is looking for Michael Jackson, the social anthropologist
from New Zealand, she knows that she will need to further
qualify the query.
In this work we show that search engine query logs are a
Permission to make digital or hard copies of all or part of this work for valuable source of data to build classifiers that can identify
personal or classroom use is granted without fee provided that copies are unambiguous concepts in a language. In our work, the fea-
not made or distributed for profit or commercial advantage and that copies tures for such classifiers are calculated over a crawl of the
bear this notice
Copyright c and theheld
2012 full citation on the first page. To copy otherwise, to
by author(s)/owner(s). Web and all queries issued in evenly spread days of a large
republish,
Published to as
postpart
on servers
of theor to redistributeWorkshop
#MSM2012 to lists, requires prior specific
proceedings,
commercial search engine. Classifiers like these seem espe-
permission and/or aasfee.
available online CEUR Vol-838, at: http://ceur-ws.org/Vol-838
MSM ’12 Lyon,April
#MSM2012, France16, 2012, Lyon, France. cially suited to process short, noisy, conversational texts,
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00. which since recently have become widely available on the
18
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
Web. We show experimental results from shares and micro- rize a given text, even if each of the individually extracted
posts from both Facebook1 and Twitter2 . We also propose keywords might be ambiguous outside that set. Similarly,
two different applications to the classifiers built in this pa- Named Entity Recognition systems look for entities that
per. may or may not be unambiguous outside their context, such
The rest of this paper is organized as follows. Section 2 as Ford However, in our problem definition, only the key-
discusses existing work that is relevant to our system. In words Ford Motors, Henry Ford or Gerard Ford should be
Section 3 we discuss in detail how the classifiers proposed extracted. Finally, we have no knowledge of the microposts
in this paper are built and in Section 4 we present num- being analyzed, preventing us from using domain specific
bers assessing their effectiveness. A discussion of potential features.
applications for the system is shown in Section 5. Finally,
Section 6 contains our conclusions about this work. 3. DETECTING UNAMBIGUITY
The ultimate goal of this work is to develop classifiers
2. RELATED WORK that detect unambiguous keywords in a language. As far
Krovetz and Croft observed that 75% of early information as the knowledge of the authors goes, this is the first work
retrieval systems queries are unambiguous [8]. This obser- proposing such classifiers. In order to formally present the
vation has been later corroborated by a survey from Sander- problem, we will first introduce a few relevant concepts.
son [17], where the impact of word sense disambiguation in A common step previous to the processing of any corpus
information retrieval systems is studied. Although we do is the segmentation of the text in the documents. This is
not rely only on that observation, one of the core hypothesis a complex problem which is beyond the scope of this paper
of this paper is that, to a lesser extent, this continues to be and we assume that there is a state-of-the-art segmenter
true for modern search engine queries, as long as the query available3 . The output of the segmentation process is a set
does not show often in the query logs with further qualifi- of keywords. One keyword can be composed by one word or
cation. For example, the query Washington often needs to by a sequence of words – in the latter case we also refer to
be refined as George Washington or Washington (state) or it as an n-gram or compound.
even Washington, D.C. while the query Canada often does The Merriam-Webster dictionary4 defines ambiguity as
not. This hypothesis is the central idea behind the metric (1) doubtful or uncertain especially from obscurity or indis-
described in Session 3.4.2, which has shown to be one of the tinctness or (2) capable of being understood in two or more
strongest signals described in this paper. possible senses or ways. However, there is a shortcoming in
The usage of the Web as an implicit training set for Natu- this definition, since it relies on human interpretation. One
ral Language Processing problems and ambiguity resolution person can say that a given word is ambiguous while an-
in particular is presented in [15], where the author shows other could disagree. It turns that both could be right since
that the usage of simple algorithms and features extracted the interpretation of the senses of a word can be done at
from large amounts of data yield competitive results with different granularities.
sophisticated unsupervised approaches and close results to Lapata and Keller [11] define ambiguity, or its comple-
that of supervised state of the art approaches. ment, unambiguity, as function of a frequency of the senses
Wacholder et al. studied the problem of disambiguation that a given word or compound shows in a large corpus. We
of proper names in [20]. Nadeu and Sekine conducted a will instead use the terminology of the semiotics model by
survey [14] on the related field of Named Entity Recogni- Ferdinand de Saussure [18], which yields a more intuitive
tion and Classification (NERC). Finally, the broader area of definition for the scope of our work.
Word Sense Disambiguation is discussed in depth by Nav-
igli [16]. Definition 1. Let a pair (f, c) be a sign, being f the form
The problem of extracting information from microposts of the sign and c the concept it represents.5 Let L be a
has gained significant attention recently. In [4], Choudhury language and S the set of all signs used in that language. Let
and Breslin have presented a classifier for Twitter posts able the document frequency of a sign df (f, c) in S be the number
to detect players and associated micro-events in a sports of documents the sign appear in a large corpus representative
match, achieving a f-measure of 87%. Using knowledge of of the language
P L. We say that f is unambiguous if and only
the domain, such as sports jargon and names of players, they if df (f, c)/ df (f, c0 ) > α.
are able to disambiguate the Tweets. Li et al. proposed us-
In other words, we say that f is unambiguous if one of
ing a keyword extraction system for targeting ads to Face-
the concepts it may represent is α times more frequent in
book updates in [12], one of the applications we discuss in
documents of the corpus than all the others combined. For
Section 4. Signals based on capitalization and document fre-
our purposes, f is always a word or a compound word, and
quency are present in their work, but they did not explore
given that restriction, we will use Definition 1 as the basis for
any of the query log derived metrics.
the problem being discussed through the rest of the paper.
Although the problems presented by the works discussed
above share similarities with ours, none of their techniques 3.1 Keyword Evaluation Methodology
can be directly applied. Word Sense Disambiguation is fo-
Given a keyword q, an human evaluator can use Defini-
cused on finding senses of ambiguous terms in the local con-
tion 1 to rate it as ambiguous or unambiguous. From this
text, and does not discuss the properties of a given keyword
3
outside its context. Also, traditional keyword extraction sys- In this paper we use a segmenter developed internally at
tems extract a set of keywords that characterize or summa- Google, Inc.
4
www.merriam-webster.com
1 5
www.facebook.com In the original, form and concept are called signifier and
2
www.twitter.com significant, respectively.
19
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
definition, the evaluator should look at all the web docu- more popular a keyword is, the better the signal-to-noise
ments containing q to understand the sense of the keyword ratio we have on it for all metrics. Figure 1 shows the IDF
in each of them. In practice, we do not need to go through distribution for unigrams and keywords found on the Web
all the documents in the corpus to find whether a keyword collection.
is unambiguous. Instead, we select, from the web, 16 ran-
dom documents that contain q and manually check if all
occurrences refer to the same concept, i.e., all positive oc-
currences. If yes, we say that the keyword q is unambiguous.
The choice of 16 random documents is legitimate. Since
the evaluation of each random document that contains q is
an experiment that has only two possible answers, we can
assume it is a random sampling over a binomial distribution.
Then, by using the Wilson method [21], we can calculate
the binomial proportion confidence interval for a sample size
of 16 with all of them positive occurrences, i.e. p̂ = 1.0,
which result is [0.8, 1.0] with center at 0.9. This interval
gives us the α of Definition 1, which in this case will have
a lower bound of 0.8 and an expected value of 0.9 with a Figure 1: IDF distribution for the web collection.
95% confidence. In other words: Given a keyword that was
human-evaluated as unambiguous, we have 95% of chance
that this keyword will refer to the same dominant concept
in 80% of the corpus, but more likely it will be 90% of the
corpus. This is the threshold we decided to use to assume a
keyword is unambiguous in our human evaluations.
Using this methodology we have built a reference-set with
2634 ambiguous and 426 unambiguous keywords to be used
in the analaysis of the metrics presented in the next sections
and as training-set input to the Machine Learning approach
at Section 3.6.
3.2 Model Generation
There are two main source of signals for the unambiguous
keywords classifiers presented here. The first is a sample Figure 2: IDF distribution for the reference-set.
with billions of documents of Google’s search engine web
collection. The second is as sample with billions of query The histograms for IDF distribution among the reference-
entries from Google’s query log corpus collected in evenly set is plotted in Figure 2. The top chart shows the histogram
spread days. for ambiguous keywords while the bottom shows the unam-
The model generation is composed by a chain of off-line biguous. This histogram shows that ambiguous keywords
calculations of statistical properties of all signs in these two tends to have lower IDF values. The lowest ones are lan-
corpora and takes a few thousands of cpu-hours to complete. guage constructions such as just and this. The misspellings
These properties will serve as the basis for the classification and uncommon person names lies in the very high IDF
algorithms. This is an one-off work that only needs to be range. While unambiguous keywords tends to have higher
redone whenever there is an opportunity and/or the need of IDF values, there is a big overlap with lots of unambiguous
improving the performance of the system. ones in the mid-lower IDF range, such as White House and
Disney. This overlap makes it hard for the IDF metric to
3.3 Web Collection Metrics be used to separate both sets. However, we can apply it for
For every keyword resulting from the segmentation, we filtering language constructions and misspellings.
compute several properties using a large MapReduce-like
system [5] visiting all the documents of the Web corpus. 3.3.2 Caps First Ratio
In the next sections we explain each property and give a Caps First Ratio (CFR) is the ratio that a given keyword
histogram of its distribution among the keywords. Addi- shows up on the Web collection with the first letter capital-
tionally to the plain histograms, we present two additional ized and we interpret it as strong indicator of names. We
complementary histograms, one for the probability density implemented the techniques from [13] to detect capitalized
of the metric among the ambiguous and another one for the keywords.
unambiguous keywords of the reference-set defined in Sec- The CFR metric has the obvious property of detecting
tion 3.1. nouns, but it has another subtle interesting characteristic.
Several noun compounds include, as an extra qualifier, sub-
3.3.1 Inverse Document Frequency compounds or unigrams that are unambiguous by them-
The first metric, the Inverse Document Frequency (IDF), selves, for example Justin Bieber and Bieber are both unam-
is the same found in the traditional information retrieval biguous. In this case, we consider the occurrence of every
literature. It is computed over the Web document collection capitalized word not only in the CFR calculation of the com-
and is a proxy of the popularity of the keyword. It also serves pound it belongs to – Justin Bieber – , but also in the CFR
as a good confidence indicator for all remaining metrics. The score of the sub-compounds and unigrams of that compound
20
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
– Bieber. This helps increasing the CFR score of nouns that much less often in the query stream. Because of that we can
act as unambiguous qualifiers. For example, for the Bieber say it is safe to discard anything that is not popular in the
unigram, using only the initials for legibility, we calculate: query stream.
count(JB) + count(B)
CF R(B) =
count(jb) + count(b) + count(JB) + count(B)
Figure 5: SIDF distribution for an infinite stream of
web text.
Figure 3: CFR distribution for the web collection.
Figure 6: SIDF distribution for the reference-set.
Figure 4: CFR distribution for the reference-set. 3.4.2 Sessions Exact Ratio
A key metric that comes from the query stream analysis
Figure 3 shows, the CFR distribution seen on Web docu-
is the Sessions Exact Ratio (SER). It tells how often a given
ments. The reference-set histograms at Figure 4, are more
keyword shows up by itself in the search box. This is the
heterogeneous than the IDF counterpart. The mid-low range
strongest indicator that this keyword is unambiguous when
of CFR values includes mostly only ambiguous keywords,
taken out of context. Figures 7 and 8 shows the histogram
while the unambiguous histogram has a sharp growth in the
for this metric on the Web collection and the reference-set re-
high values.
spectively. As can be seen, the ambiguous and unambiguous
3.4 Query Log Metrics reference-set is mostly separable. Some examples of unam-
biguous keywords in the very high range of the histogram
Query logs have proved to be a valuable source of infor-
are: Tom Hicks, Madison Square Garden and Groupon.
mation for several fields of computer science [19]. In our
work we collected data from three evenly spread days worth
of queries in the logs of a large search engine. As with the
Web corpus, we compute the following metrics for each key-
word generated by the segmentation of the query log corpus.
3.4.1 Sessions Inverse Document Frequency
The Sessions Inverse Document Frequency (SIDF) is anal-
ogous to the Web metric of the same name, but it is calcu-
lated over the search engine query stream. Each session [19]
is considered as a document. Figures 5 and 6 presents the
distribution of this metric for the query stream and for the
reference-set respectively. This signal has similar properties
to its Web counterpart, but with a bias towards concepts
and against intrinsic language characteristics. By compar- Figure 7: SER distribution for an infinite stream of
ing Figures 1 and 5, one can draw an interesting conclu- web text.
sion: stopwords and auxiliary language constructions appear
21
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
The histograms shown here are not just a tool to help
visualize how each metric may be used to dissociate both
sets, but more than that, it is an evidence that the metrics
used here can succeed in building an effective classifier.
3.5 A hand-crafted classifier
In this section we present a simple hand-crafted algorithm.
It was developed upon the discussions and histogram obser-
vations of above metrics, regarding the reference-set separa-
tion. We use this algorithm to expose the ideas without
adding the complexity that inherently comes with tradi-
tional machine learning techniques, as well as to avoid hid-
Figure 8: SER distribution for the reference-set. ing the interesting properties of the data under analysis.
Later in this paper we present a Support Vector Machines
(SVM) approach for the same classification task. Refer to
3.4.3 Search Bias Section 3.6 for more details.
The last metric, Search Bias (SB), is not directly derived
from the query stream, but rather obtained through a com- Algorithm 1: The IsUnambiguous Algorithm.
bination of sessions and Web signals. Search Bias can be 1 begin
thought as the ratio of appearance of a keyword in the query 2 if sidf > 15 then return false;
logs corpus divided by the ratio of appearance of the same 3 if uni ∧ idf > 12 ∧ sidf > 12 then return false;
keyword on the Web corpus. 4 if cf r < 0.4 then return false;
5 if ser < 0.35 then return false;
6 if sb < 0.01 then return false;
7 if cf r + ser + sb < 1 then return false;
8 if charcount < 3 then return false;
9 if blacklisted then return false;
10 end
Algorithm 1 presents our hand-crafted approach for the
unambiguity classification problem. Each line is a filter of
ambiguous keywords. In Figure 11 one can see how many
keywords are being discarded as the classifier is applied on
top of the Web corpus. In the end, only 3.8% of all keywords
Figure 9: SB distribution for an infinite stream of occurrences are considered unambiguous.
web text. The Sessions IDF funnel component corresponds to Line 2
of the algorithm. Its goal is to remove everything that is too
rare, such as misspells. Usually, plain IDF is used for this
type of cutting, but bigrams and larger keywords have a very
high IDF on the Web corpus. In the query logs corpus, how-
ever, large keywords appear much more often and anything
that is not too rare will not be filtered by this rule.
Figure 10: SB distribution for the reference-set.
The naive calculation of this number leads to a value with
bad properties due to very common words on the Web cor-
pus and the high frequency of compounds in the query log
corpus. To avoid those issues, search bias is calculated tak-
ing into account only “noble” occurrences of a keyword in
Web and query logs corpora. For the Web, we consider that Figure 11: The percentage of the input text key-
only capitalized occurrences are noble, while from query logs words each line of the algorithm filters. The most
we only consider those occurrences where the keyword ap- important filters are highlighted in gray.
pear by itself in the query. The distribution of this metric
can be seen on Figures 9 and 10. In Line 3, the Rare Unigrams filter unigrams typos that
22
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
are not rare enough to be discarded by Sessions IDF and and false-negative errors. After doing a simple grid-search
also come with all types of capitalization. Since unigrams of both parameters using cross-validation, we picked a value
are more frequent than compounds, we can apply a more around 0.05 for γ and 0.1 for C.
restrictive threshold.
The Low Caps filter comes from Line 4 of the algorithm.
Not only it is responsible for restricting the classifier to 4. EXPERIMENTAL RESULTS
nouns, it also rejects all the general nouns. For example, In this section we present experimental results obtained
the noun ball has a very low caps first ratio, but Wilson with the implementation of the classifier described in Sec-
NCAA Reaction Basketball is almost always typed all caps. tion 3.5 and the SVM described in Section 3.6.
The most powerful feature for our classifier, the Sessions
Exact Ratio (SER) filter, is used in Line 5. It reflects the 4.1 Test set definition for manual evaluation
key intuition that our work builds upon: users know that The input of each classifier is a chunk of free form text,
search engines have little context to understand their com- from now on referred to as a micropost, and the output is
munication and because of that they formulate unambiguous a list of keywords assumed by the classifier to represent un-
queries. ambiguous concepts from the input. We decided to use a
The derived metric Search Bias is used in Line 6. Some micropost as a unit of evaluation, in opposition to a single
keywords, like Unfortunately, tend to have both high CFR keyword, because we believe the results from this analysis
– because they are used to start phrases – and high SER represent better the real world applications. In order to
– because they have low query volume. This filter detects not favor our classifier, the rater is instructed to mark a
those language constructions that are way more common in micropost as False Positive if she finds one ambiguous key-
the Web corpus than in the search corpus and discards them. word classified as unambiguous, even if the classifier also
The combined Caps+Exact+Bias filter in Line 7 is the correctly extracted another unambiguous keyword from the
most complex part of the algorithm. Its goal is allow us to same micropost.
reduce the thresholds of the individual filters applied before We selected two different sources of microposts to feed
without incurring in a loss of precision. This filter will let the classifier: Twitter and Facebook. This data set and the
keywords that score very high in any of the metrics com- reference-set that was used to train the SVM classifier are
bined pass, but will discard those that have a low average disjoint to avoid over-fit. Twitter is a social Web site where
all around. users can send and read text-messages with up to 140 char-
The Character Count is a simplistic filter as can be seen acters, called tweets. We expect this kind of social Web site
in Line 8. When dealing with two characters keywords, all to have mostly conversational text and noise, which carry lit-
the metrics have bad properties, and we simply discard all tle information by themselves. To feed the classifiers, each
of them. In fact, a perfect English classifier limited to two tweet is considered independent from each other and its text
character unigrams can be manually built by inspecting all is used as input to the classifier. Facebook is a social net-
the 626 possible combinations. work site where users can create virtual connections with
Finally, the last step of the algorithm is the Blacklist filter, their friends. One Facebook feature is the status message
in Line 9. Some keywords have very extreme metrics and updates. The status message is much like a tweet, but the
tend to pass through all the filters, and we simply blacklist 140 characters limit is not imposed. Since users can follow-
them. For English, we currently blacklist 40 greetings ex- up on their friends updates, the entire conversation is used
pressions, such as Happy birthday and Hello and some very as input for the classifiers.
common words like Indeed. In fact, the metrics for those
keywords are so extreme that by looking at the top values 4.2 Methodology for manual evaluation
for each metric one can easily spot them. We also black-
The methodology used for manual evaluation consists of
listed the Web sites names Google, Facebook, Twitter and
collecting a random set of microposts from each data source
Gmail because, although unambiguous, they are so common
and feeding each one into the classifiers. The original micro-
in our evaluation data that they would positively benefit our
post and the output of the classifier are then shown to three
results with no interesting characteristics.
different raters. The output might be empty, in the case the
classifier did not find any unambiguous keyword in the mi-
3.6 A Machine Learning approach cropost. Otherwise it contains at least one keyword, which
To challenge the hand-crafted algorithm presented in the was classified as unambiguous. In case of discordance, it is
previous section and test if its intuitions were correct, we discussed until consensus is reached. Regardless of the clas-
employ a Support Vector Machines (SVM) algorithm to the sifier output, raters must investigate the keywords present
same classification task. It is a well-known technique and in the micropost. They use the methodology presented in
there is a ready to use state-of-the-art implementation, namely Section 3.1 to rate each keyword q. Based on the output
libSVM [3]. The down-side of the machine learning ap- of the classifier and the inspection of the micropost content
proach is that labeled data acquisition for this novel clas- carried out by the rater, each micropost is rated as below:
sification task is challenging. The training set used was the True Positive (TP): There are unambiguous keywords
reference-set explained before, with the 2634 ambiguous and in the micropost and the system has extracted at least one.
426 unambiguous manually classified keywords. Each met- True Negative (TN): There are no unambiguous key-
ric shown in sections 3.3 and 3.4 were rescaled to the 0-1 words and the system extracted nothing.
range and used as SVM input features. We used only the False Positive (FP): The system has extracted an am-
Radial Basis Function kernel present in libSVM: K(xi , xj ) = biguous keyword.
exp(−γ × kxi − xj k2 ), γ > 0. By tuning the γ and C pa- False Negative (FN): There are unambiguous keywords
rameters we can control the trade-off between false-positive in the micropost but the system has extracted none.
23
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
We use the output of the rating phase to compute the stan- slightly better results. The precision is 80.19% (around 6%
dard evaluation metrics in the Information Retrieval field, better), whereas the recall is 55.55%. Again, the True Neg-
such as precision, recall, accuracy and F-score [1]. ative Rate is really high, 96.05%. The classifier has an ac-
Both models – the hand-crafted and the SVM classifier – curacy of 87% with an F-Score of 0.66. The high value for
were built using the context available at the English Web, the True Negative Rate is a sign that most conversations in
but it must be considered that people have an incomplete Social Networks like Facebook and Twitter are not proper
cultural context and sometimes it may not be obvious that for context-extraction systems such as content-targeted ad-
a given keyword is unambiguous. For example, during the vertisement if used without any pre-processing.
evaluation one rater could not recognize upfront the key- To compare the results of the two classifiers presented
word Doug Flutie, which was extracted by the system. Even above, we also evaluated two known systems: Yahoo! Term
though this rater did not recognize this keyword, Doug Flu- Extractor API – aimed to Keyword Extraction tasks – reached
tie is indeed an unambiguous keyword because every person 18.98% of precision and 57.69% of recall for Facebook data,
with culture about American Football will recognize him as and 14.89% of precision and 77.7% of recall for Twitter data;
Douglas Richard “Doug” Flutie, a famous football quarter- and the Stanford Named Entity Recognizer [6] – aimed to
back who played professionally in the United States Football Named Entity Recognition and Classification tasks – reached
League and, more importantly, the name Doug Flutie is not 35% of precision and 69.99% of recall for Facebook data, and
used as an identifier in any other significant context besides 39.65% of precision and 74.19% of recall for Twitter data.
that. Our precise definition of unambiguity prevents this Both systems reached a higher recall, but for the real-
problem, since the rater will learn the sense(s) of the key- world applications discussed in section 5 we cannot afford
word when looking at the 16 randomly sampled documents, extracting a wrong keyword from a noisy text. In these ap-
as it was the case in this example. plications precision is more important, and for both systems
it is much lower than the two filters developed in this work.
4.3 Numerical results The high recall and low precision result is expected for these
Table 1 presents the output of the raters for Facebook and systems, since they were engineered for different tasks and
Twitter microposts. do not perform well for the unambiguity detection task de-
fined here.
Hand Algorithm SVM
Twitter Facebook Twitter Facebook 5. APPLICATIONS
TP 99 106 85 85 Given the properties of the classifiers presented in this
TN 494 480 511 510 paper, we believe they are suited for a set of different appli-
FP 38 34 22 21 cations that are becoming more important given last devel-
FN 74 64 87 68 opments on the Web industry.
Table 1: Break-down of metrics for Twitter and 5.1 Ad targeting in Social Networks
Facebook. In Social Networks users keep updating their status mes-
sages (or tweets) with what they have in mind. The number
Twitter of daily updates in the most prominent networks is huge6 ,
turning this channel into a potential candidate for input of
Following the experimental methodology we analyzed a set content-targeted advertisement systems [12]. For instance,
of 705 tweets, which were randomly selected from a set it is just fine to deliver an advertisement piece like Buy tick-
of 170k tweets that were crawled by a fairly random walk ets to the coming Jonas Brothers show!, right next to a mi-
in the Twitter graph. We used these tweets as input of cropost where a user claims to be the biggest fan of this
both classifiers and presented the output to the raters. The music group. However, the conversational text brings even
hand-crafted classifier was able to reach precision of 72.26%, more complexity for the already tough task [9] of deliver-
and sensitivity (recall) of 56.22%. The True Negative Rate ing content-targeted ads. Feeding these systems with noisy
(TNR, or specificity) is high (92.86%), upon what one can text may lead them to return non-relevant ads. One can use
conclude that most tweets do not contain unambiguous key- the classifiers proposed in this paper as a filtering layer on
words. For Twitter the system reached an accuracy of 84% top of current content-targeted advertisement systems. The
with an F-Score of 0.64. The SVM model reached a preci- filter would delegate calls to the ads systems only when it
sion of 79.43%, i.e., a performance almost 10% better than is possible to retrieve relevant content from the microposts
the achieved by the hand-crafted algorithm. The achieved being targeted.
recall is 49.42%, considerably worse than the recall reached
by the hand-crafted algorithm. The TNR of the SVM model 5.2 Automatic reference in Social Networks
is 95.87%, and the system reached an accuracy of 85% with User profiles in Social Networks could also be classified as
an F-Score of 0.61. unambiguous by using the profile name for example. When-
ever a micropost has the unambiguous keywords that matches
Facebook a profile name, a link to that profile could be added, instead
Following a random selection strategy similar to Twitter, of just pure text. This could be done for celebrity profiles, for
we collected 684 conversations that took place around Face- example, when a user posts “I just watched the last Quentin
book status message updates. For this data set, the hand- Tarantino movie.”, a link to the Quentin Tarantino profile
crafted system reached a precision of 75.71% and recall of could be added.
62.36%. The TNR was of 93.39%, whereas the accuracy
6
reached 85% with an F-score of 0.68. The SVM model got http://news.cnet.com/8301-13577 3-10378353-36.html
24
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
6. CONCLUSIONS [6] J. R. Finkel, T. Grenager, and C. Manning.
In this paper we presented a novel classification problem Incorporating non-local information into information
aimed at identifying the unambiguous keywords in a lan- extraction systems by gibbs sampling. In Proceedings
guage, and formally defined it together with an evaluation of the 43rd Annual Meeting on Association for
methodology. We also have presented two different algo- Computational Linguistics, ACL ’05, pages 363–370,
rithms for the classification problem and the corresponding Stroudsburg, PA, USA, 2005. Association for
numerical results achieved by both of them. The proposed Computational Linguistics.
algorithms are built on top of traditional information re- [7] A. Halevy, P. Norvig, and F. Pereira. The
trieval metrics and novel metrics based on the query log Unreasonable Effectiveness of Data. IEEE Intelligent
corpus. The introduction of these metrics, Sessions IDF, Systems, 24:8–12, 2009.
Sessions Exact Ratio and Search Bias, is by itself an impor- [8] R. Krovetz and W. B. Croft. Lexical Ambiguity and
tant contribution. We believe these metrics will be useful in Information Retrieval. ACM Transactions on
other problem domains as well. Information Systems, 10:115–141, April 1992.
Our evaluation have shown that our classifiers were able [9] A. Lacerda, M. Cristo, M. A. Gonçalves, W. Fan,
to meet precision ≥ 72%, recall ≥ 49%, accuracy ≥ 84% N. Ziviani, and B. Ribeiro-Neto. Learning to
and F-Score ≥ 0.61, even when the input is composed by Advertise. In SIGIR ’06: Proceedings of the 29th
the noisy microposts from Facebook and Twitter, two of Annual International ACM SIGIR Conference on
the biggest sites in the world nowadays, outperforming two Research and Development in Information Retrieval,
known systems from the traditional keyword extraction and pages 549–556, New York, NY, USA, 2006.
Named Entity Recognition fields. [10] M. Lapata and F. Keller. Web-based Models for
Another interesting aspect of the presented work is that it Natural Language Processing. ACM Transactions on
diverges from the bag-of-words analyses that dominate the Speech and Language Processing, 2(1):3, 2005.
research in the area. Instead, we have focused on directly [11] M. Lapata and F. Keller. An Information Retrieval
finding the specific keyword that define a concept, avoiding Approach to Sense Ranking. In Human Language
the shortcomings that come from having a representation Technology Conference of the North American Chapter
that cannot be understood by a human or does not meet of the Association of Computational Linguistics, pages
the expectations of other systems. This leads immediatelly 348–355, 2007.
to our future work proposal of using the extracted keywords [12] Z. Li, D. Zhou, Y.-F. Juan, and J. Han. Keyword
as beacons for further qualification of other keywords in the Extraction for Social Snippets. In Proceedings of the
text. For example, the extracted keyword Lionel Messi can 19th ACM International Conference on World wide
be used to anchor the word goal to the concept of scoring web, pages 1143–1144, New York, NY, USA, 2010.
in the soccer sport, instead rather the more general idea [13] A. Mikheev. A Knowledge-free Method for Capitalized
of an objective to be achieved. We expect this inside-out Word Disambiguation. In Proceedings of the 37th
approach for extracting semantics in microposts to perform
Annual Meeting of the Association for Computational
better than traditional word collection approches.
Linguistics on Computational Linguistics, pages
More and more researchers have access to query logs and 159–166, Morristown, NJ, USA, 1999.
many may directly benefit from the metrics proposed here
[14] D. Nadeau and S. Sekine. A survey of named entity
either to tackle the same classification problem or to inno-
recognition and classification. Linguisticae
vate in their own domains. For the industry, we have shown
Investigationes, 30(1):3–26, January 2007.
a solution for extracting information from microposts, a type
of content that has experienced tremendous growth on the [15] P. Nakov and M. Hearst. Using the Web as an
Web in the recent past. Implicit Training Set: Application to Structural
Ambiguity Resolution. In HLT ’05: Proceedings of the
conference on Human Language Technology and
7. REFERENCES Empirical Methods in Natural Language Processing,
[1] R. A. Baeza-Yates and B. Ribeiro-Neto. Modern pages 835–842, Morristown, NJ, USA, 2005.
Information Retrieval. Addison-Wesley Longman [16] R. Navigli. Word Sense Disambiguation: A Survey.
Publishing Co., Inc., Boston, MA, USA, 1999. ACM Comput. Surv., 41(2):1–69, 2009.
[2] M. Banko and E. Brill. Scaling to Very Very Large [17] M. Sanderson. Retrieving with Good Sense.
Corpora for Natural Language Disambiguation. In Information Retrieval, 2(1):49–69, 2000.
ACL ’01: Proceedings of the 39th Annual Meeting on [18] F. D. Saussure. Course in General Linguistics. Open
Association for Computational Linguistics, pages Court Pub Co, 1986.
26–33, Morristown, NJ, USA, 2001. [19] C. Silverstein, H. Marais, M. Henzinger, and
[3] C.-C. Chang and C.-J. Lin. LIBSVM: a Library for M. Moricz. Analysis of a Very Large Web Search
Support Vector Machines. Software available at Engine Query Log. SIGIR Forum, 33(1):6–12, 1999.
http://www.csie.ntu.edu.tw/~cjlin/libsvm, 2001. [20] N. Wacholder, Y. Ravin, and M. Choi. Disambiguation
[4] S. Choudhury and J. Breslin. Extracting semantic of Proper Names in Text. In Proceedings of the Fifth
entities and events from sports tweets. In Making Conference on Applied Natural Language Processing,
Sense of Microposts (#MSM2011), pages 22–32, 2011. pages 202–208, Morristown, NJ, USA, 1997.
[5] J. Dean and S. Ghemawat. Mapreduce: Simplified [21] E. B. Wilson. Probable Inference, the Law of
Data Processing on Large Clusters. In Sixth Succession, and Statistical Inference. Journal of the
Symposium on Operating System Design and American Statistical Association, 22(158):pp. 209–212,
Implementation, pages 137–150, 2004. 1927.
25
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·