Extracting Unambiguous Keywords from
                         Microposts Using Web and Query Logs Data

                     Davi de Castro Reis                                  Felipe Goldstein                       Frederico Quintao
                        Google Engineering                                Google Engineering                      Google Engineering
                       Belo Horizonte, Brazil                               Paris, France                        Belo Horizonte, Brazil
                       davi@google.com                                felipeg@google.com                       quintao@google.com


           If a lion could talk, we could not understand him.                              Keywords
           (Ludwig Wittgenstein)                                                           unambiguity, word sense disambiguation, query logs, web,
                                                                                           advertising
    ABSTRACT
    In the recent years, a new form of content type has become                             1.   INTRODUCTION
    ubiquitous in the web. These are small and noisy text snip-                               The availability of huge Web based corpora has spawned
    pets, created by users of social networks such as Twitter and                          a series of important advancements in the Natural Lan-
    Facebook. The full interpretation of those microposts by                               guage Processing (NLP) field recently [2, 10]. Moreover,
    machines impose tremendous challenges, since they strongly                             the increase of computational power has made it possible
    rely on context. In this paper we propose a task which is                              to run simpler algorithms over much more data [7]. How-
    much simpler than full interpretation of microposts: we aim                            ever, computer interpretation of natural language is still an
    to build classification systems to detect keywords that un-                            unsolved challenge. On the other hand, a very successful
    ambiguously refer to a single dominant concept, even when                              set of free text controlled applications have blossomed in
    taken out of context. For example, in the context of this                              the last decade: the Web search engines. The natural lan-
    task, apple would be classified as ambiguous whereas mi-                               guage processing techniques employed by these systems are
    crosoft would not. The contribution of this work is twofold.                           not enough to give users a natural speaking experience, but
    First, we formalize this novel classification task that can be                         regardless of that, users type queries in sites like Yahoo!,
    directly applied for extracting information from microposts.                           Bing and Google on a daily basis. Instead of teaching the
    Second, we show how high precision classifiers for this prob-                          machine how to speak like us, we ended up learning a simple
    lem can be built out of Web data and search engine logs,                               yet effective way of expressing our information needs.
    combining traditional information retrieval metrics, such as                              In this work we start from the premise that full under-
    inverted document frequency, and new ones derived from                                 standing of natural language cannot be achieved with the
    search query logs. Finally, we have proposed and evaluated                             current technology. In fact, not even a human being is able
    relevant applications for these classifiers, which were able                           to fully understand all communication due to either lack of
    to meet precision ≥ 72% and recall ≥ 56% on unambigu-                                  cultural context or to inherent ambiguity in the language.
    ous keyword extraction from microposts. We also compare                                This is especially true in the context of microposts, where
    those results with closely related systems, none of which                              both the media culture and technical constraints impose a
    could outperform those numbers.                                                        limit on the size of the messages. Given these limitations,
                                                                                           we have aimed to detect parts of natural language that can
    Categories and Subject Descriptors                                                     be unambiguously understood in the lack of any context.
                                                                                           The key observation that allowed us to identify those parts
    H.3.1 [Information Storage and Retrieval]: Content                                     of natural language is that they tend to be used as search
    Analysis and Indexing—Linguistic Processing; I.2.7 [Artificial                         engine queries on the Web [17].
    Intelligence]: Natural Language Processing—Text Analy-                                    For example, when using a search engine, if the user wants
    sis                                                                                    to know about Michael Jackson, the American singer, dancer,
                                                                                           and entertainer, she expects to type these two words in the-
    General Terms                                                                          search box and get relevant results. On the other side, if
    Algorithms, Experimentation                                                            she is looking for Michael Jackson, the social anthropologist
                                                                                           from New Zealand, she knows that she will need to further
                                                                                           qualify the query.
                                                                                              In this work we show that search engine query logs are a
    Permission to make digital or hard copies of all or part of this work for              valuable source of data to build classifiers that can identify
    personal or classroom use is granted without fee provided that copies are              unambiguous concepts in a language. In our work, the fea-
    not made or distributed for profit or commercial advantage and that copies             tures for such classifiers are calculated over a crawl of the
    bear this notice
    Copyright      c and theheld
                      2012    full citation on the first page. To copy otherwise, to
                                   by author(s)/owner(s).                                  Web and all queries issued in evenly spread days of a large
    republish,
    Published  to as
                  postpart
                       on servers
                           of theor    to redistributeWorkshop
                                    #MSM2012           to lists, requires prior specific
                                                                    proceedings,
                                                                                           commercial search engine. Classifiers like these seem espe-
    permission  and/or aasfee.
    available online        CEUR Vol-838, at: http://ceur-ws.org/Vol-838
    MSM   ’12 Lyon,April
    #MSM2012,         France16, 2012, Lyon, France.                                        cially suited to process short, noisy, conversational texts,
    Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$10.00.                                       which since recently have become widely available on the


                                                                                                                                                        18
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
    Web. We show experimental results from shares and micro-          rize a given text, even if each of the individually extracted
    posts from both Facebook1 and Twitter2 . We also propose          keywords might be ambiguous outside that set. Similarly,
    two different applications to the classifiers built in this pa-   Named Entity Recognition systems look for entities that
    per.                                                              may or may not be unambiguous outside their context, such
      The rest of this paper is organized as follows. Section 2       as Ford However, in our problem definition, only the key-
    discusses existing work that is relevant to our system. In        words Ford Motors, Henry Ford or Gerard Ford should be
    Section 3 we discuss in detail how the classifiers proposed       extracted. Finally, we have no knowledge of the microposts
    in this paper are built and in Section 4 we present num-          being analyzed, preventing us from using domain specific
    bers assessing their effectiveness. A discussion of potential     features.
    applications for the system is shown in Section 5. Finally,
    Section 6 contains our conclusions about this work.               3.   DETECTING UNAMBIGUITY
                                                                         The ultimate goal of this work is to develop classifiers
    2. RELATED WORK                                                   that detect unambiguous keywords in a language. As far
       Krovetz and Croft observed that 75% of early information       as the knowledge of the authors goes, this is the first work
    retrieval systems queries are unambiguous [8]. This obser-        proposing such classifiers. In order to formally present the
    vation has been later corroborated by a survey from Sander-       problem, we will first introduce a few relevant concepts.
    son [17], where the impact of word sense disambiguation in           A common step previous to the processing of any corpus
    information retrieval systems is studied. Although we do          is the segmentation of the text in the documents. This is
    not rely only on that observation, one of the core hypothesis     a complex problem which is beyond the scope of this paper
    of this paper is that, to a lesser extent, this continues to be   and we assume that there is a state-of-the-art segmenter
    true for modern search engine queries, as long as the query       available3 . The output of the segmentation process is a set
    does not show often in the query logs with further qualifi-       of keywords. One keyword can be composed by one word or
    cation. For example, the query Washington often needs to          by a sequence of words – in the latter case we also refer to
    be refined as George Washington or Washington (state) or          it as an n-gram or compound.
    even Washington, D.C. while the query Canada often does              The Merriam-Webster dictionary4 defines ambiguity as
    not. This hypothesis is the central idea behind the metric        (1) doubtful or uncertain especially from obscurity or indis-
    described in Session 3.4.2, which has shown to be one of the      tinctness or (2) capable of being understood in two or more
    strongest signals described in this paper.                        possible senses or ways. However, there is a shortcoming in
       The usage of the Web as an implicit training set for Natu-     this definition, since it relies on human interpretation. One
    ral Language Processing problems and ambiguity resolution         person can say that a given word is ambiguous while an-
    in particular is presented in [15], where the author shows        other could disagree. It turns that both could be right since
    that the usage of simple algorithms and features extracted        the interpretation of the senses of a word can be done at
    from large amounts of data yield competitive results with         different granularities.
    sophisticated unsupervised approaches and close results to           Lapata and Keller [11] define ambiguity, or its comple-
    that of supervised state of the art approaches.                   ment, unambiguity, as function of a frequency of the senses
       Wacholder et al. studied the problem of disambiguation         that a given word or compound shows in a large corpus. We
    of proper names in [20]. Nadeu and Sekine conducted a             will instead use the terminology of the semiotics model by
    survey [14] on the related field of Named Entity Recogni-         Ferdinand de Saussure [18], which yields a more intuitive
    tion and Classification (NERC). Finally, the broader area of      definition for the scope of our work.
    Word Sense Disambiguation is discussed in depth by Nav-
    igli [16].                                                           Definition 1. Let a pair (f, c) be a sign, being f the form
       The problem of extracting information from microposts          of the sign and c the concept it represents.5 Let L be a
    has gained significant attention recently. In [4], Choudhury      language and S the set of all signs used in that language. Let
    and Breslin have presented a classifier for Twitter posts able    the document frequency of a sign df (f, c) in S be the number
    to detect players and associated micro-events in a sports         of documents the sign appear in a large corpus representative
    match, achieving a f-measure of 87%. Using knowledge of           of the language
                                                                                   P L. We say that f is unambiguous if and only
    the domain, such as sports jargon and names of players, they      if df (f, c)/ df (f, c0 ) > α.
    are able to disambiguate the Tweets. Li et al. proposed us-
                                                                        In other words, we say that f is unambiguous if one of
    ing a keyword extraction system for targeting ads to Face-
                                                                      the concepts it may represent is α times more frequent in
    book updates in [12], one of the applications we discuss in
                                                                      documents of the corpus than all the others combined. For
    Section 4. Signals based on capitalization and document fre-
                                                                      our purposes, f is always a word or a compound word, and
    quency are present in their work, but they did not explore
                                                                      given that restriction, we will use Definition 1 as the basis for
    any of the query log derived metrics.
                                                                      the problem being discussed through the rest of the paper.
       Although the problems presented by the works discussed
    above share similarities with ours, none of their techniques      3.1 Keyword Evaluation Methodology
    can be directly applied. Word Sense Disambiguation is fo-
                                                                         Given a keyword q, an human evaluator can use Defini-
    cused on finding senses of ambiguous terms in the local con-
                                                                      tion 1 to rate it as ambiguous or unambiguous. From this
    text, and does not discuss the properties of a given keyword
                                                                      3
    outside its context. Also, traditional keyword extraction sys-      In this paper we use a segmenter developed internally at
    tems extract a set of keywords that characterize or summa-        Google, Inc.
                                                                      4
                                                                        www.merriam-webster.com
    1                                                                 5
        www.facebook.com                                                In the original, form and concept are called signifier and
    2
        www.twitter.com                                               significant, respectively.


                                                                                                                                      19
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
    definition, the evaluator should look at all the web docu-         more popular a keyword is, the better the signal-to-noise
    ments containing q to understand the sense of the keyword          ratio we have on it for all metrics. Figure 1 shows the IDF
    in each of them. In practice, we do not need to go through         distribution for unigrams and keywords found on the Web
    all the documents in the corpus to find whether a keyword          collection.
    is unambiguous. Instead, we select, from the web, 16 ran-
    dom documents that contain q and manually check if all
    occurrences refer to the same concept, i.e., all positive oc-
    currences. If yes, we say that the keyword q is unambiguous.
       The choice of 16 random documents is legitimate. Since
    the evaluation of each random document that contains q is
    an experiment that has only two possible answers, we can
    assume it is a random sampling over a binomial distribution.
    Then, by using the Wilson method [21], we can calculate
    the binomial proportion confidence interval for a sample size
    of 16 with all of them positive occurrences, i.e. p̂ = 1.0,
    which result is [0.8, 1.0] with center at 0.9. This interval
    gives us the α of Definition 1, which in this case will have
    a lower bound of 0.8 and an expected value of 0.9 with a            Figure 1: IDF distribution for the web collection.
    95% confidence. In other words: Given a keyword that was
    human-evaluated as unambiguous, we have 95% of chance
    that this keyword will refer to the same dominant concept
    in 80% of the corpus, but more likely it will be 90% of the
    corpus. This is the threshold we decided to use to assume a
    keyword is unambiguous in our human evaluations.
       Using this methodology we have built a reference-set with
    2634 ambiguous and 426 unambiguous keywords to be used
    in the analaysis of the metrics presented in the next sections
    and as training-set input to the Machine Learning approach
    at Section 3.6.

    3.2      Model Generation
      There are two main source of signals for the unambiguous
    keywords classifiers presented here. The first is a sample           Figure 2: IDF distribution for the reference-set.
    with billions of documents of Google’s search engine web
    collection. The second is as sample with billions of query            The histograms for IDF distribution among the reference-
    entries from Google’s query log corpus collected in evenly         set is plotted in Figure 2. The top chart shows the histogram
    spread days.                                                       for ambiguous keywords while the bottom shows the unam-
      The model generation is composed by a chain of off-line          biguous. This histogram shows that ambiguous keywords
    calculations of statistical properties of all signs in these two   tends to have lower IDF values. The lowest ones are lan-
    corpora and takes a few thousands of cpu-hours to complete.        guage constructions such as just and this. The misspellings
    These properties will serve as the basis for the classification    and uncommon person names lies in the very high IDF
    algorithms. This is an one-off work that only needs to be          range. While unambiguous keywords tends to have higher
    redone whenever there is an opportunity and/or the need of         IDF values, there is a big overlap with lots of unambiguous
    improving the performance of the system.                           ones in the mid-lower IDF range, such as White House and
                                                                       Disney. This overlap makes it hard for the IDF metric to
    3.3      Web Collection Metrics                                    be used to separate both sets. However, we can apply it for
       For every keyword resulting from the segmentation, we           filtering language constructions and misspellings.
    compute several properties using a large MapReduce-like
    system [5] visiting all the documents of the Web corpus.           3.3.2    Caps First Ratio
    In the next sections we explain each property and give a              Caps First Ratio (CFR) is the ratio that a given keyword
    histogram of its distribution among the keywords. Addi-            shows up on the Web collection with the first letter capital-
    tionally to the plain histograms, we present two additional        ized and we interpret it as strong indicator of names. We
    complementary histograms, one for the probability density          implemented the techniques from [13] to detect capitalized
    of the metric among the ambiguous and another one for the          keywords.
    unambiguous keywords of the reference-set defined in Sec-             The CFR metric has the obvious property of detecting
    tion 3.1.                                                          nouns, but it has another subtle interesting characteristic.
                                                                       Several noun compounds include, as an extra qualifier, sub-
     3.3.1    Inverse Document Frequency                               compounds or unigrams that are unambiguous by them-
       The first metric, the Inverse Document Frequency (IDF),         selves, for example Justin Bieber and Bieber are both unam-
    is the same found in the traditional information retrieval         biguous. In this case, we consider the occurrence of every
    literature. It is computed over the Web document collection        capitalized word not only in the CFR calculation of the com-
    and is a proxy of the popularity of the keyword. It also serves    pound it belongs to – Justin Bieber – , but also in the CFR
    as a good confidence indicator for all remaining metrics. The      score of the sub-compounds and unigrams of that compound


                                                                                                                                   20
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
    – Bieber. This helps increasing the CFR score of nouns that      much less often in the query stream. Because of that we can
    act as unambiguous qualifiers. For example, for the Bieber       say it is safe to discard anything that is not popular in the
    unigram, using only the initials for legibility, we calculate:   query stream.

                            count(JB) + count(B)
    CF R(B) =
                 count(jb) + count(b) + count(JB) + count(B)


                                                                     Figure 5: SIDF distribution for an infinite stream of
                                                                     web text.


     Figure 3: CFR distribution for the web collection.


                                                                      Figure 6: SIDF distribution for the reference-set.


      Figure 4: CFR distribution for the reference-set.              3.4.2    Sessions Exact Ratio
                                                                        A key metric that comes from the query stream analysis
      Figure 3 shows, the CFR distribution seen on Web docu-
                                                                     is the Sessions Exact Ratio (SER). It tells how often a given
    ments. The reference-set histograms at Figure 4, are more
                                                                     keyword shows up by itself in the search box. This is the
    heterogeneous than the IDF counterpart. The mid-low range
                                                                     strongest indicator that this keyword is unambiguous when
    of CFR values includes mostly only ambiguous keywords,
                                                                     taken out of context. Figures 7 and 8 shows the histogram
    while the unambiguous histogram has a sharp growth in the
                                                                     for this metric on the Web collection and the reference-set re-
    high values.
                                                                     spectively. As can be seen, the ambiguous and unambiguous
    3.4      Query Log Metrics                                       reference-set is mostly separable. Some examples of unam-
                                                                     biguous keywords in the very high range of the histogram
      Query logs have proved to be a valuable source of infor-
                                                                     are: Tom Hicks, Madison Square Garden and Groupon.
    mation for several fields of computer science [19]. In our
    work we collected data from three evenly spread days worth
    of queries in the logs of a large search engine. As with the
    Web corpus, we compute the following metrics for each key-
    word generated by the segmentation of the query log corpus.

     3.4.1    Sessions Inverse Document Frequency
       The Sessions Inverse Document Frequency (SIDF) is anal-
    ogous to the Web metric of the same name, but it is calcu-
    lated over the search engine query stream. Each session [19]
    is considered as a document. Figures 5 and 6 presents the
    distribution of this metric for the query stream and for the
    reference-set respectively. This signal has similar properties
    to its Web counterpart, but with a bias towards concepts
    and against intrinsic language characteristics. By compar-       Figure 7: SER distribution for an infinite stream of
    ing Figures 1 and 5, one can draw an interesting conclu-         web text.
    sion: stopwords and auxiliary language constructions appear


                                                                                                                                   21
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
                                                                       The histograms shown here are not just a tool to help
                                                                     visualize how each metric may be used to dissociate both
                                                                     sets, but more than that, it is an evidence that the metrics
                                                                     used here can succeed in building an effective classifier.
                                                                     3.5 A hand-crafted classifier
                                                                        In this section we present a simple hand-crafted algorithm.
                                                                     It was developed upon the discussions and histogram obser-
                                                                     vations of above metrics, regarding the reference-set separa-
                                                                     tion. We use this algorithm to expose the ideas without
                                                                     adding the complexity that inherently comes with tradi-
                                                                     tional machine learning techniques, as well as to avoid hid-
      Figure 8: SER distribution for the reference-set.              ing the interesting properties of the data under analysis.
                                                                     Later in this paper we present a Support Vector Machines
                                                                     (SVM) approach for the same classification task. Refer to
     3.4.3   Search Bias                                             Section 3.6 for more details.
       The last metric, Search Bias (SB), is not directly derived
    from the query stream, but rather obtained through a com-         Algorithm 1: The IsUnambiguous Algorithm.
    bination of sessions and Web signals. Search Bias can be         1 begin
    thought as the ratio of appearance of a keyword in the query     2    if sidf > 15 then return false;
    logs corpus divided by the ratio of appearance of the same       3    if uni ∧ idf > 12 ∧ sidf > 12 then return false;
    keyword on the Web corpus.                                       4    if cf r < 0.4 then return false;
                                                                     5    if ser < 0.35 then return false;
                                                                     6    if sb < 0.01 then return false;
                                                                     7    if cf r + ser + sb < 1 then return false;
                                                                     8    if charcount < 3 then return false;
                                                                     9    if blacklisted then return false;
                                                                    10 end


                                                                        Algorithm 1 presents our hand-crafted approach for the
                                                                     unambiguity classification problem. Each line is a filter of
                                                                     ambiguous keywords. In Figure 11 one can see how many
                                                                     keywords are being discarded as the classifier is applied on
                                                                     top of the Web corpus. In the end, only 3.8% of all keywords
    Figure 9: SB distribution for an infinite stream of              occurrences are considered unambiguous.
    web text.                                                           The Sessions IDF funnel component corresponds to Line 2
                                                                     of the algorithm. Its goal is to remove everything that is too
                                                                     rare, such as misspells. Usually, plain IDF is used for this
                                                                     type of cutting, but bigrams and larger keywords have a very
                                                                     high IDF on the Web corpus. In the query logs corpus, how-
                                                                     ever, large keywords appear much more often and anything
                                                                     that is not too rare will not be filtered by this rule.


      Figure 10: SB distribution for the reference-set.

      The naive calculation of this number leads to a value with
    bad properties due to very common words on the Web cor-
    pus and the high frequency of compounds in the query log
    corpus. To avoid those issues, search bias is calculated tak-
    ing into account only “noble” occurrences of a keyword in
    Web and query logs corpora. For the Web, we consider that        Figure 11: The percentage of the input text key-
    only capitalized occurrences are noble, while from query logs    words each line of the algorithm filters. The most
    we only consider those occurrences where the keyword ap-         important filters are highlighted in gray.
    pear by itself in the query. The distribution of this metric
    can be seen on Figures 9 and 10.                                   In Line 3, the Rare Unigrams filter unigrams typos that


                                                                                                                                  22
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
    are not rare enough to be discarded by Sessions IDF and               and false-negative errors. After doing a simple grid-search
    also come with all types of capitalization. Since unigrams            of both parameters using cross-validation, we picked a value
    are more frequent than compounds, we can apply a more                 around 0.05 for γ and 0.1 for C.
    restrictive threshold.
       The Low Caps filter comes from Line 4 of the algorithm.
    Not only it is responsible for restricting the classifier to          4.    EXPERIMENTAL RESULTS
    nouns, it also rejects all the general nouns. For example,               In this section we present experimental results obtained
    the noun ball has a very low caps first ratio, but Wilson             with the implementation of the classifier described in Sec-
    NCAA Reaction Basketball is almost always typed all caps.             tion 3.5 and the SVM described in Section 3.6.
       The most powerful feature for our classifier, the Sessions
    Exact Ratio (SER) filter, is used in Line 5. It reflects the          4.1   Test set definition for manual evaluation
    key intuition that our work builds upon: users know that                 The input of each classifier is a chunk of free form text,
    search engines have little context to understand their com-           from now on referred to as a micropost, and the output is
    munication and because of that they formulate unambiguous             a list of keywords assumed by the classifier to represent un-
    queries.                                                              ambiguous concepts from the input. We decided to use a
       The derived metric Search Bias is used in Line 6. Some             micropost as a unit of evaluation, in opposition to a single
    keywords, like Unfortunately, tend to have both high CFR              keyword, because we believe the results from this analysis
    – because they are used to start phrases – and high SER               represent better the real world applications. In order to
    – because they have low query volume. This filter detects             not favor our classifier, the rater is instructed to mark a
    those language constructions that are way more common in              micropost as False Positive if she finds one ambiguous key-
    the Web corpus than in the search corpus and discards them.           word classified as unambiguous, even if the classifier also
       The combined Caps+Exact+Bias filter in Line 7 is the               correctly extracted another unambiguous keyword from the
    most complex part of the algorithm. Its goal is allow us to           same micropost.
    reduce the thresholds of the individual filters applied before           We selected two different sources of microposts to feed
    without incurring in a loss of precision. This filter will let        the classifier: Twitter and Facebook. This data set and the
    keywords that score very high in any of the metrics com-              reference-set that was used to train the SVM classifier are
    bined pass, but will discard those that have a low average            disjoint to avoid over-fit. Twitter is a social Web site where
    all around.                                                           users can send and read text-messages with up to 140 char-
       The Character Count is a simplistic filter as can be seen          acters, called tweets. We expect this kind of social Web site
    in Line 8. When dealing with two characters keywords, all             to have mostly conversational text and noise, which carry lit-
    the metrics have bad properties, and we simply discard all            tle information by themselves. To feed the classifiers, each
    of them. In fact, a perfect English classifier limited to two         tweet is considered independent from each other and its text
    character unigrams can be manually built by inspecting all            is used as input to the classifier. Facebook is a social net-
    the 626 possible combinations.                                        work site where users can create virtual connections with
       Finally, the last step of the algorithm is the Blacklist filter,   their friends. One Facebook feature is the status message
    in Line 9. Some keywords have very extreme metrics and                updates. The status message is much like a tweet, but the
    tend to pass through all the filters, and we simply blacklist         140 characters limit is not imposed. Since users can follow-
    them. For English, we currently blacklist 40 greetings ex-            up on their friends updates, the entire conversation is used
    pressions, such as Happy birthday and Hello and some very             as input for the classifiers.
    common words like Indeed. In fact, the metrics for those
    keywords are so extreme that by looking at the top values             4.2   Methodology for manual evaluation
    for each metric one can easily spot them. We also black-
                                                                             The methodology used for manual evaluation consists of
    listed the Web sites names Google, Facebook, Twitter and
                                                                          collecting a random set of microposts from each data source
    Gmail because, although unambiguous, they are so common
                                                                          and feeding each one into the classifiers. The original micro-
    in our evaluation data that they would positively benefit our
                                                                          post and the output of the classifier are then shown to three
    results with no interesting characteristics.
                                                                          different raters. The output might be empty, in the case the
                                                                          classifier did not find any unambiguous keyword in the mi-
    3.6    A Machine Learning approach                                    cropost. Otherwise it contains at least one keyword, which
       To challenge the hand-crafted algorithm presented in the           was classified as unambiguous. In case of discordance, it is
    previous section and test if its intuitions were correct, we          discussed until consensus is reached. Regardless of the clas-
    employ a Support Vector Machines (SVM) algorithm to the               sifier output, raters must investigate the keywords present
    same classification task. It is a well-known technique and            in the micropost. They use the methodology presented in
    there is a ready to use state-of-the-art implementation, namely       Section 3.1 to rate each keyword q. Based on the output
    libSVM [3]. The down-side of the machine learning ap-                 of the classifier and the inspection of the micropost content
    proach is that labeled data acquisition for this novel clas-          carried out by the rater, each micropost is rated as below:
    sification task is challenging. The training set used was the            True Positive (TP): There are unambiguous keywords
    reference-set explained before, with the 2634 ambiguous and           in the micropost and the system has extracted at least one.
    426 unambiguous manually classified keywords. Each met-                  True Negative (TN): There are no unambiguous key-
    ric shown in sections 3.3 and 3.4 were rescaled to the 0-1            words and the system extracted nothing.
    range and used as SVM input features. We used only the                   False Positive (FP): The system has extracted an am-
    Radial Basis Function kernel present in libSVM: K(xi , xj ) =         biguous keyword.
    exp(−γ × kxi − xj k2 ), γ > 0. By tuning the γ and C pa-                 False Negative (FN): There are unambiguous keywords
    rameters we can control the trade-off between false-positive          in the micropost but the system has extracted none.


                                                                                                                                       23
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
       We use the output of the rating phase to compute the stan-    slightly better results. The precision is 80.19% (around 6%
    dard evaluation metrics in the Information Retrieval field,      better), whereas the recall is 55.55%. Again, the True Neg-
    such as precision, recall, accuracy and F-score [1].             ative Rate is really high, 96.05%. The classifier has an ac-
       Both models – the hand-crafted and the SVM classifier –       curacy of 87% with an F-Score of 0.66. The high value for
    were built using the context available at the English Web,       the True Negative Rate is a sign that most conversations in
    but it must be considered that people have an incomplete         Social Networks like Facebook and Twitter are not proper
    cultural context and sometimes it may not be obvious that        for context-extraction systems such as content-targeted ad-
    a given keyword is unambiguous. For example, during the          vertisement if used without any pre-processing.
    evaluation one rater could not recognize upfront the key-           To compare the results of the two classifiers presented
    word Doug Flutie, which was extracted by the system. Even        above, we also evaluated two known systems: Yahoo! Term
    though this rater did not recognize this keyword, Doug Flu-      Extractor API – aimed to Keyword Extraction tasks – reached
    tie is indeed an unambiguous keyword because every person        18.98% of precision and 57.69% of recall for Facebook data,
    with culture about American Football will recognize him as       and 14.89% of precision and 77.7% of recall for Twitter data;
    Douglas Richard “Doug” Flutie, a famous football quarter-        and the Stanford Named Entity Recognizer [6] – aimed to
    back who played professionally in the United States Football     Named Entity Recognition and Classification tasks – reached
    League and, more importantly, the name Doug Flutie is not        35% of precision and 69.99% of recall for Facebook data, and
    used as an identifier in any other significant context besides   39.65% of precision and 74.19% of recall for Twitter data.
    that. Our precise definition of unambiguity prevents this           Both systems reached a higher recall, but for the real-
    problem, since the rater will learn the sense(s) of the key-     world applications discussed in section 5 we cannot afford
    word when looking at the 16 randomly sampled documents,          extracting a wrong keyword from a noisy text. In these ap-
    as it was the case in this example.                              plications precision is more important, and for both systems
                                                                     it is much lower than the two filters developed in this work.
    4.3   Numerical results                                          The high recall and low precision result is expected for these
     Table 1 presents the output of the raters for Facebook and      systems, since they were engineered for different tasks and
    Twitter microposts.                                              do not perform well for the unambiguity detection task de-
                                                                     fined here.
           Hand Algorithm                SVM
           Twitter Facebook        Twitter Facebook                  5.     APPLICATIONS
     TP      99       106            85        85                      Given the properties of the classifiers presented in this
     TN     494       480           511       510                    paper, we believe they are suited for a set of different appli-
     FP      38        34            22        21                    cations that are becoming more important given last devel-
     FN      74        64            87        68                    opments on the Web industry.
    Table 1: Break-down of metrics for Twitter and                   5.1 Ad targeting in Social Networks
    Facebook.                                                           In Social Networks users keep updating their status mes-
                                                                     sages (or tweets) with what they have in mind. The number
    Twitter                                                          of daily updates in the most prominent networks is huge6 ,
                                                                     turning this channel into a potential candidate for input of
    Following the experimental methodology we analyzed a set         content-targeted advertisement systems [12]. For instance,
    of 705 tweets, which were randomly selected from a set           it is just fine to deliver an advertisement piece like Buy tick-
    of 170k tweets that were crawled by a fairly random walk         ets to the coming Jonas Brothers show!, right next to a mi-
    in the Twitter graph. We used these tweets as input of           cropost where a user claims to be the biggest fan of this
    both classifiers and presented the output to the raters. The     music group. However, the conversational text brings even
    hand-crafted classifier was able to reach precision of 72.26%,   more complexity for the already tough task [9] of deliver-
    and sensitivity (recall) of 56.22%. The True Negative Rate       ing content-targeted ads. Feeding these systems with noisy
    (TNR, or specificity) is high (92.86%), upon what one can        text may lead them to return non-relevant ads. One can use
    conclude that most tweets do not contain unambiguous key-        the classifiers proposed in this paper as a filtering layer on
    words. For Twitter the system reached an accuracy of 84%         top of current content-targeted advertisement systems. The
    with an F-Score of 0.64. The SVM model reached a preci-          filter would delegate calls to the ads systems only when it
    sion of 79.43%, i.e., a performance almost 10% better than       is possible to retrieve relevant content from the microposts
    the achieved by the hand-crafted algorithm. The achieved         being targeted.
    recall is 49.42%, considerably worse than the recall reached
    by the hand-crafted algorithm. The TNR of the SVM model          5.2 Automatic reference in Social Networks
    is 95.87%, and the system reached an accuracy of 85% with           User profiles in Social Networks could also be classified as
    an F-Score of 0.61.                                              unambiguous by using the profile name for example. When-
                                                                     ever a micropost has the unambiguous keywords that matches
    Facebook                                                         a profile name, a link to that profile could be added, instead
    Following a random selection strategy similar to Twitter,        of just pure text. This could be done for celebrity profiles, for
    we collected 684 conversations that took place around Face-      example, when a user posts “I just watched the last Quentin
    book status message updates. For this data set, the hand-        Tarantino movie.”, a link to the Quentin Tarantino profile
    crafted system reached a precision of 75.71% and recall of       could be added.
    62.36%. The TNR was of 93.39%, whereas the accuracy
                                                                     6
    reached 85% with an F-score of 0.68. The SVM model got               http://news.cnet.com/8301-13577 3-10378353-36.html


                                                                                                                                     24
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·
    6.   CONCLUSIONS                                                  [6] J. R. Finkel, T. Grenager, and C. Manning.
       In this paper we presented a novel classification problem          Incorporating non-local information into information
    aimed at identifying the unambiguous keywords in a lan-               extraction systems by gibbs sampling. In Proceedings
    guage, and formally defined it together with an evaluation            of the 43rd Annual Meeting on Association for
    methodology. We also have presented two different algo-               Computational Linguistics, ACL ’05, pages 363–370,
    rithms for the classification problem and the corresponding           Stroudsburg, PA, USA, 2005. Association for
    numerical results achieved by both of them. The proposed              Computational Linguistics.
    algorithms are built on top of traditional information re-        [7] A. Halevy, P. Norvig, and F. Pereira. The
    trieval metrics and novel metrics based on the query log              Unreasonable Effectiveness of Data. IEEE Intelligent
    corpus. The introduction of these metrics, Sessions IDF,              Systems, 24:8–12, 2009.
    Sessions Exact Ratio and Search Bias, is by itself an impor-      [8] R. Krovetz and W. B. Croft. Lexical Ambiguity and
    tant contribution. We believe these metrics will be useful in         Information Retrieval. ACM Transactions on
    other problem domains as well.                                        Information Systems, 10:115–141, April 1992.
       Our evaluation have shown that our classifiers were able       [9] A. Lacerda, M. Cristo, M. A. Gonçalves, W. Fan,
    to meet precision ≥ 72%, recall ≥ 49%, accuracy ≥ 84%                 N. Ziviani, and B. Ribeiro-Neto. Learning to
    and F-Score ≥ 0.61, even when the input is composed by                Advertise. In SIGIR ’06: Proceedings of the 29th
    the noisy microposts from Facebook and Twitter, two of                Annual International ACM SIGIR Conference on
    the biggest sites in the world nowadays, outperforming two            Research and Development in Information Retrieval,
    known systems from the traditional keyword extraction and             pages 549–556, New York, NY, USA, 2006.
    Named Entity Recognition fields.                                 [10] M. Lapata and F. Keller. Web-based Models for
       Another interesting aspect of the presented work is that it        Natural Language Processing. ACM Transactions on
    diverges from the bag-of-words analyses that dominate the             Speech and Language Processing, 2(1):3, 2005.
    research in the area. Instead, we have focused on directly       [11] M. Lapata and F. Keller. An Information Retrieval
    finding the specific keyword that define a concept, avoiding          Approach to Sense Ranking. In Human Language
    the shortcomings that come from having a representation               Technology Conference of the North American Chapter
    that cannot be understood by a human or does not meet                 of the Association of Computational Linguistics, pages
    the expectations of other systems. This leads immediatelly            348–355, 2007.
    to our future work proposal of using the extracted keywords      [12] Z. Li, D. Zhou, Y.-F. Juan, and J. Han. Keyword
    as beacons for further qualification of other keywords in the         Extraction for Social Snippets. In Proceedings of the
    text. For example, the extracted keyword Lionel Messi can             19th ACM International Conference on World wide
    be used to anchor the word goal to the concept of scoring             web, pages 1143–1144, New York, NY, USA, 2010.
    in the soccer sport, instead rather the more general idea        [13] A. Mikheev. A Knowledge-free Method for Capitalized
    of an objective to be achieved. We expect this inside-out             Word Disambiguation. In Proceedings of the 37th
    approach for extracting semantics in microposts to perform
                                                                          Annual Meeting of the Association for Computational
    better than traditional word collection approches.
                                                                          Linguistics on Computational Linguistics, pages
       More and more researchers have access to query logs and            159–166, Morristown, NJ, USA, 1999.
    many may directly benefit from the metrics proposed here
                                                                     [14] D. Nadeau and S. Sekine. A survey of named entity
    either to tackle the same classification problem or to inno-
                                                                          recognition and classification. Linguisticae
    vate in their own domains. For the industry, we have shown
                                                                          Investigationes, 30(1):3–26, January 2007.
    a solution for extracting information from microposts, a type
    of content that has experienced tremendous growth on the         [15] P. Nakov and M. Hearst. Using the Web as an
    Web in the recent past.                                               Implicit Training Set: Application to Structural
                                                                          Ambiguity Resolution. In HLT ’05: Proceedings of the
                                                                          conference on Human Language Technology and
    7. REFERENCES                                                         Empirical Methods in Natural Language Processing,
     [1] R. A. Baeza-Yates and B. Ribeiro-Neto. Modern                    pages 835–842, Morristown, NJ, USA, 2005.
         Information Retrieval. Addison-Wesley Longman               [16] R. Navigli. Word Sense Disambiguation: A Survey.
         Publishing Co., Inc., Boston, MA, USA, 1999.                     ACM Comput. Surv., 41(2):1–69, 2009.
     [2] M. Banko and E. Brill. Scaling to Very Very Large           [17] M. Sanderson. Retrieving with Good Sense.
         Corpora for Natural Language Disambiguation. In                  Information Retrieval, 2(1):49–69, 2000.
         ACL ’01: Proceedings of the 39th Annual Meeting on          [18] F. D. Saussure. Course in General Linguistics. Open
         Association for Computational Linguistics, pages                 Court Pub Co, 1986.
         26–33, Morristown, NJ, USA, 2001.                           [19] C. Silverstein, H. Marais, M. Henzinger, and
     [3] C.-C. Chang and C.-J. Lin. LIBSVM: a Library for                 M. Moricz. Analysis of a Very Large Web Search
         Support Vector Machines. Software available at                   Engine Query Log. SIGIR Forum, 33(1):6–12, 1999.
         http://www.csie.ntu.edu.tw/~cjlin/libsvm, 2001.             [20] N. Wacholder, Y. Ravin, and M. Choi. Disambiguation
     [4] S. Choudhury and J. Breslin. Extracting semantic                 of Proper Names in Text. In Proceedings of the Fifth
         entities and events from sports tweets. In Making                Conference on Applied Natural Language Processing,
         Sense of Microposts (#MSM2011), pages 22–32, 2011.               pages 202–208, Morristown, NJ, USA, 1997.
     [5] J. Dean and S. Ghemawat. Mapreduce: Simplified              [21] E. B. Wilson. Probable Inference, the Law of
         Data Processing on Large Clusters. In Sixth                      Succession, and Statistical Inference. Journal of the
         Symposium on Operating System Design and                         American Statistical Association, 22(158):pp. 209–212,
         Implementation, pages 137–150, 2004.                             1927.


                                                                                                                               25
· #MSM2012 · 2nd Workshop on Making Sense of Microposts ·