Twice - Twitter Content Embeddings
Xianjing Liu1 , Behzad Golshan1 , Kenny Leung1 , Aman Saini1 , Vivek Kulkarni1 ,
Ali Mollahosseini1 and Jeff Mo1
1
    Twitter Cortex


                                          Abstract
                                          In this short paper, we describe a new model for learning content-based tweet embeddings that serve to be generically useful
                                          as signals for a variety of down-stream predictive tasks. In contrast to prior approaches that only leverage cues from the raw
                                          text, we take a holistic approach and propose Twice, a model for learning tweet embeddings that (a) leverages cues beyond the
                                          raw text (including media) and (b) attempts to yield representations optimized for overall similarity, a combination of topical,
                                          semantic, and engagement similarity. Offline evaluations suggest that our model yields richer and superior embeddings
                                          compared to the benchmark models in the tasks evaluating on both academia dataset and twitter products.

                                          Keywords
                                          Deep Learning, Recommendation, Embedding, NLP


1. Introduction                                                                                     Similarity – tweets which are similar in meaning should
                                                                                                    be close in the embedding space. An example pair of
A rich representation of a tweet that captures nuances                                              semantically similar sentences is “The quick brown fox
in meaning is critical for most predictive models at Twit-                                          jumped over the lazy dog” and “The brown fox leaped
ter (including Topics, Health, Recommendations etc.).                                               over the lazy dog”. (b) Topic Similarity – tweets that are
Consequently, there is an urgent need for models that                                               about the same topic should be close in the embedding
can encode or summarize a tweet’s content into a dense                                              space. An example pair here would be “Don Bradman is
representation – a representation that can then be used                                             the greatest cricket player of all time” and “India won the
in various downstream models. Some key surface areas                                                Cricket World Cup” since both tweets are about “Cricket”.
which will be using tweet embeddings include Home                                                   (c) Engagement Similarity – tweets that share engagement
Timeline, Notifications, Topics, and potentially Health                                             audiences are deemed similar.
models. An important requirement is that the tweet em-                                                 In this paper, we present Twice – a model that at-
bedding be generically useful on a variety of predictive                                            tempts to capture the above notions of tweet similarity.
tasks and not necessarily be useful for only a specific                                             In contrast to most prior work which only seeks to em-
apriori task. Finally, we expect downstream models to                                               bed raw tweet text using standard pre-trained language
only consume the embeddings, without having to worry                                                models, Twice models tweets holistically leveraging not
about the inner workings of the underlying model.                                                   only raw tweet text, but also incorporating cues from the
   Taking a bird’s eye view, the need is simply for a                                               associated media, and hyperlinks. We evaluate Twice on
tweet representation/embedding that captures similarity                                             a suite of offline benchmark tasks and demonstrate that
between tweets by embedding them in a vector space.                                                 our proposed model significantly outperforms several
Specifically, tweets that are “similar” must be close in the                                        baseline approaches.
vector space and tweets that are not “similar” should ide-
ally be far in this vector space with respect to a suitable
metric. Zooming in, for practical modeling it is useful to 2. Related Work
attempt to operationalize this vague notion of “similar”
and attempt to be more specific here. We can attempt to Our work is very closely related to work in the area
capture the following notions of similarity: (a) Semantic of learning sentence embeddings which seeks to learn
                                                                                                       dense representations of sentences and capture sentence
                                                                                                       similarity. One of the earliest works on learning dense
DL4SR’22: Workshop on Deep Learning for Search and Recommen-
dation, co-located with the 31st ACM International Conference on embeddings of sentences is the work of Le and Mikolov
Information and Knowledge Management (CIKM), October 17-21, 2022, [1] which generalized the Skipgram word-embedding
Atlanta, USA                                                                                           models [2] to learn sentence embeddings and paragraph
$ xianjingabbyl@twitter.com (X. Liu); bgolshan@twitter.com                                             embeddings. With the rise of convolutional and recur-
(B. Golshan); kennyleung@twitter.com (K. Leung);
                                                                                                       rent neural network models, several approaches to learn
amansaini@twitter.com (A. Saini); vkulkarni@twitter.com
(V. Kulkarni); amollahosseini@twitter.com (A. Mollahosseini);                                          sentence embeddings were proposed [3, 4, 5, 6, 7]. Addi-
jeffm@twitter.com (J. Mo)                                                                              tionally, a couple of these works sought to learn represen-
          © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
          Attribution 4.0 International (CC BY 4.0).                                                   tations of tweets by applying these networks to Twitter
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
text [5, 4]. Finally, with the introduction of Transform-            of the tweet and use the standard cross-entropy
ers and pretrained language models, the current state                loss function for this task.
of art approaches now use pre-trained language models         The full loss function is simply the average of the above
coupled with contrastive loss functions to learn sentence     three loss functions. Finally, to obtain a dense represen-
embeddings [8, 9, 10, 11, 12, 13, 14, 15]. All of these ap-   tation of the tweet, we simply use the representation of
proaches only look at embedding generic sentences and         the [CLS] token. Twice leverages cues from the entire
are not attuned to embedding tweets where deeper se-          tweet and not just the raw tweet text. In particular, in
mantic cues, and multi-modal content can be used to           addition to the raw text, we also leverage media cues by
obtain rich representations. Our model Twice, in-turn         obtaining media annotations for any associated media as
builds on these works, but also incorporates cues spe-        described in [18]. These media annotations are simply
cific to tweets (like media and hyperlink) to yield rich      concatenated to the raw text via a separator token before
embeddings of tweets.                                         being input to the model. Similarly, when a tweet has
                                                              hyperlinks, we extract the first 100 tokens of the web-
3. Twice                                                      page title and description as encoded in the linked HTML
                                                              page. These features are also appended to the input.
At its core, Twice is a Bert model [16] trained on a multi-
task loss function. Figure 1 shows the model architecture   Training procedure. Twice is trained on a dataset of
of Twice model. In particular, we consider the follow-      200 million tweets sampled over a 90 day interval. We
ing three tasks, each attempting to capture notions of      also associate with these tweets the users who engaged
similarity as noted in Section 1. More specifically, we     with them (which the Clip task component requires). The
optimize a standard Bert model on the following tasks:      model was optimized using standard Adam with weight
                                                            decay as the optimization procedure and trained for 5
     • Topic Prediction: The task is to predict the con- epochs until convergence.
       cept topics associated with the tweet. This en-
       ables the representation to capture topical sim-
       ilarity. In particular, we optimize binary cross 4. Experiments
       entropy loss since this is a multi-label prediction
       setting where a tweet may be associated with We evaluate Twice both quantitatively and qualitatively
       multiple topics. For instance, the tweet "America each of which we discuss below.
       is heading back to the Moon, folks. No astro-
       nauts, but likely to glean loads of data." belongs 4.1. Quantitative Evaluation
       to "Space" and "Science" topics. The total number
       of concept topics in this task is 419.               Setup. We quantify the effectiveness of tweet embed-
                                                            dings in capturing content similarity via measuring their
     • Engagement prediction: Given a representa-
                                                            performance on three benchmark tasks – tasks that re-
       tion of the user (obtained by encoding a user bi-
                                                            flect how well embeddings capture notions of similarity
       ography) and a tweet, the task is to predict if
                                                            noted in Section 1:
       the user engages with the tweet. This task is es-
       sentially identical to the task in the well-known          • SemEvalPIT [19]. This benchmark is an aca-
       Clip (Contrastive Language-Image Pre-Training)               demic benchmark and consists of about 1000
       model [17] except that instead of embedding im-              pairs of tweets with similarity scores obtained
       ages using an image encoder, we embed the user               by human judgments. We measure the perfor-
       biographies using a standard Bert encoder. The               mance of embeddings on this task by computing
       loss function used is identical to the one described         the Spearman correlation of similarity scores ob-
       in [17]. Training on this task enables us to cap-            tained for these tweet pairs in embedding space
       ture tweet similarity based on user engagement               with human judgements. Higher correlations sug-
       patterns and may be particularly useful to model             gest better embeddings reflecting better align-
       especially when down-stream products may want                ment with human-derived similarity judgments.
       to maximize user engagements.                              • Recalling Favorites (Favs). In order to mea-
     • Language prediction: Because we desire multi-                sure the effectiveness of these embeddings in
       lingual support, we would like tweet representa-             down-stream predictive models of engagement,
       tions to also encode language cues, so that tweets           we  consider the task of recalling (based on just
       of the same language tend to be closer than ones             a top 𝑘 nearest neighbor lookup) which tweets a
       from different languages. Therefore, we explic-              given user favorites from more than 5k candidate
       itly train on the task of predicting the language            of tweets given their past engagement history.
                                                                    Higher scores reflect better embeddings.
Figure 1: Twice model is a BERT based model trained on a multi-task loss function.


     • Topic Assignment Precision (Topics). We                       The objective to maximize cosine similarity in the
       compute the precision of topic assignments on a               representations of the positive pair and minimize
       test set of topical tweets. Higher precision sug-             this between 𝑋 and the negative examples. It is
       gest better encoding of topical similarity. Once              to be noted here that we train Simcse on Twitter
       again, here we only base our decisions using a                data.
       𝑘-NN classifier.                                            • Hashspace: Hashspace is a Bert based model
                                                                     trained on the task of hashtag prediction where
Our rationale for restricting ourselves to using very sim-
                                                                     the model is optimized to predict the correct hash-
ple nearest neighbor approach based on cosine similarity
                                                                     tag associated with a tweet from a set of 100K
of tweet embeddings is based on the intuition that higher
                                                                     hashtags. Hashspace currently as deployed at
quality representations would inherently demonstrate
                                                                     Twitter only uses the raw tweet text as cues to
a higher-degree of “ease of extraction” of the predictive
                                                                     make predictions, and does not model tweets
information. It is for this reason precisely that one needs
                                                                     holistically.
to use simple models as opposed to very complex deep
predictive models. We made a design choice to use a                • Topicspace: Topicspace addresses two main
NN-based approach which supported quick implementa-                  limitations of Hashspace. First, in contrast to
tion but simple shallow models are another alternative.              Hashspace, we model tweets holistically and
Finally, we summarize model performance by reporting                 leverage cues from media, and hyperlinks as well.
the harmonic mean over tasks.                                        Second, we simplify the predictive task. Instead
                                                                     of learning to predict a label from a universe of
                                                                     100K labels (hashtags), we only learn to predict
Baselines. We consider the following baseline models:
                                                                     one or more topics from a space of 419 concept
     • Bert: This is just the standard pre-trained Bert              topics. The intuition is that to capture similarity,
       model [16] and serves as the simplest but strong              it is sufficient to capture fairly broad topics than
       baseline that one could use to embed tweets.                  seek to capture extremely fine-grained hashtags.
     • Simcse: Simcse [13] is a state-of-the-art sentence            By making these two changes, we note that we
       embedding approach that learns sentence embed-                can learn a model with a better fit to the data
       ding using an unsupervised approach. The main                 yielding richer representations.
       idea of Simcse is to pass a tweet 𝑋 twice through           • Clip: The original Clip in [17] is a neural net-
       Bert (with dropout enabled). This yields two dif-             work trained on 400 millions of (image, text) pairs.
       ferent (noisy) representations of 𝑋. The idea is to           Our Clip model is identical to the original Clip
       these as positive examples. 𝑋 and all other exam-             model in the model architecture. Our Clip model
       ples in the batch are treated as negative examples.           is trained using a multi-modal method in which
                                              SemEvalPIT       Favs     Topics   Mean (HM)
                                  Bert           0.025         0.043     0.063     0.038
                                 Simcse          0.218         0.022     0.205     0.055
                               Hashspace         0.336         0.064     0.230     0.131
                               Topicspace        0.264         0.075     0.599     0.160
                                  Clip           0.225         0.136     0.271     0.194
                                 Twice           0.302         0.102     0.429     0.194
Table 1
Quantitative performance of various tweet embedding approaches on our benchmark suite. We report the Spearman correlation
for SemEvalPIT, Top-K recall for Favs and Precision for Topics. Note that our model Twice outperforms most standard baselines
including the currently deployed Hashspace.


        the model attempts to predict whether a piece of        spam or not. The ToS violation prediction is a multi-label
        media and text come from the same tweet or not.         classification task. There are 8 labels in total. Some exam-
        In this setting we replace the image encoder in         ples of ToS violation labels include ’violence’, ’threaten
        the original Clip model with a user-biography           someone’, ’suicide’, etc.
        encoder.
                                                                Results. Table 2 shows the average precision score of
Results. Table 1 shows the results of our evaluation on         using the Twice embeddings on the tasks of Spam de-
our benchmark suite. Based on these results we can make         tection and ToS violation classification. We compare the
the following conclusions: (a) First, observe that just us-     result with the benchmark models Bert and Hashspace.
ing standard models like Bert for embedding tweets does         The results show that Twice outperforms both the stan-
not yield superior embeddings. It is imperative to learn        dard Bert and Hashspace models in both the Spam de-
embeddings from Twitter data. (b) Second, state-of-the-         tection and ToS violation tasks.
art unsupervised methods for sentence embedding per-
form worse than supervised methods which is inline with         4.3. Qualitative Evaluation and Analysis
prior work on sentence embeddings as well. (c) Topic-
                                                                     of Usage in Content Recommenders
space significantly outperforms Hashspace overall. This
is because Topicspace leverages cues from beyond tweet          In order to evaluate our model qualitatively, we also built
text and also uses a simpler but more intuitive task. (d)       a web page where one can enter a tweet ID and see the
Models like Topicspace and Clip which solely optimize           nearest neighbors to the given tweet from a given pre-
for a specific notion of similarity tend to perform signifi-    determined universe of tweets – a scenario that reflects
cantly better on the corresponding evaluation tasks than        the usage of embeddings for candidate generation in con-
other models simply because the underlying represen-            tent recommenders. Figure 2 shows the nearest neighbors
tations are optimized to capture that specific similarity       for a couple of seed tweets as a demonstration. Note that
notion over others. (e) Finally, note that our proposed         the nearest neighbors reflect the broad topic of the seed
model Twice generally outperforms all of these base-            tweet and are similar in content to the seed tweet suggest-
line approaches overall. While the mean performance of          ing that our model is able to capture content similarity
Twice and Clip are identical, note that Twice outper-           between tweets.
forms Clip on both SemEvalPIT and the Topics tasks                 While the process outlined in this section can be used
significantly with a slight drop on the Favs task.              for candidate generation in content recommenders (by
   To summarize, all in all Twice demonstrates superior         finding tweets similar to a user’s interests and past en-
performance over prior production models and yields im-         gagements), through a qualitative analysis conducted
proved embeddings of tweets by leveraging cues beyond           offline, we have identified a list of challenges that need to
the tweet text.                                                 be addressed in-order to ensure good quality candidates
                                                                are returned.
4.2. Quantitative Evaluation and Analysis                              • Seed Selection. Using tweets with which users
     of Usage in Health Products                                         have positively engaged as seeds to fetch more
                                                                         interesting content is a natural choice. However,
Setup. We evaluate the performance of content embed-                     some seed tweets may have very little content
dings on our health platform. We use Twice embeddings                    which makes them unsuitable for retrieving candi-
as features in a shallow model to predict Spam and Terms                 date tweets. Examples include tweets that contain
of Service (ToS) violations. The spam prediction is a bi-                frequent phrases like “good morning”, everyday
nary classification task to predict whether a tweet is a                 greetings, and daily life updates.
                                                             Spam      ToS
                                                 Bert         0.41    0.328
                                              Hashspace       0.44    0.286
                                                Twice         0.47    0.347
Table 2
The average precision score of using Twice embeddings for the Spam detection and TOS violation classification tasks.


Figure 2: Nearest neighbors to a couple of seed tweets in the Twice embedding space.
     • Content Candidates Quality. Some tweets may            [4] S. Vosoughi, P. Vijayaraghavan, D. Roy, Tweet2vec:
       have very little to no content and are non-topical.        Learning tweet embeddings using character-level
       This includes tweets that may only have single             cnn-lstm encoder-decoder, in: Proceedings of the
       emojis, may be very short in length, or only con-          39th International ACM SIGIR conference on Re-
       sist of a shortened URL. Such candidates should            search and Development in Information Retrieval,
       not be part of the universe. This is by far the            2016, pp. 1041–1044.
       most pervasive and immediate challenge to be           [5] B. Dhingra, Z. Zhou, D. Fitzpatrick, M. Muehl,
       addressed.                                                 W. Cohen, Tweet2vec: Character-based distributed
     • Health Considerations. Some tweets can be                  representations for social media, in: Proceedings
       spam, violence, and etc. Recommending such                 of the 54th Annual Meeting of the Association for
       content is problematic and needs to be addressed           Computational Linguistics (Volume 2: Short Pa-
       before tweet representations can be used for con-          pers), 2016, pp. 269–274.
       tent recommendations.                                  [6] F. Hill, K. Cho, A. Korhonen, Learning distributed
     • Recency. Some tweets returned may be irrele-               representations of sentences from unlabelled data,
       vant (or at least non-engaging) simply because             in: Proceedings of the 2016 Conference of the North
       they are old and outdated. Candidate generators            American Chapter of the Association for Computa-
       need to adapt their responses to the recency re-           tional Linguistics: Human Language Technologies,
       quirements of the product.                                 2016, pp. 1367–1377.
                                                              [7] A. Conneau, D. Kiela, H. Schwenk, L. Barrault,
  Note that these challenges are independent of the un-           A. Bordes, Supervised learning of universal sen-
derlying tweet representation itself and may significantly        tence representations from natural language infer-
hinder the quality of candidates even if the tweet repre-         ence data, in: EMNLP, 2017.
sentation model is of a superior quality. To address these    [8] D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. S.
challenges, we build various filters and apply to the can-        John, N. Constant, M. Guajardo-Cespedes, S. Yuan,
didate pool, we also carefully design and pick desirable          C. Tar, et al., Universal sentence encoder, arXiv
seed tweets to generate high quality tweets for the user.         preprint arXiv:1803.11175 (2018).
                                                              [9] N. Reimers, I. Gurevych, Sentence-bert: Sentence
                                                                  embeddings using siamese bert-networks, in: Pro-
5. Conclusion                                                     ceedings of the 2019 Conference on Empirical Meth-
In this paper, we proposed a model for embedding tweets           ods in Natural Language Processing and the 9th In-
that goes beyond just modeling tweet text. Our goal               ternational Joint Conference on Natural Language
has been to develop generically useful rich representa-           Processing (EMNLP-IJCNLP), 2019, pp. 3982–3992.
tions of tweets that can be used in a variety of down-       [10] L. Wang, C. Gao, J. Wei, W. Ma, R. Liu, S. Vosoughi,
stream predictive models at Twitter. To that end, we              An empirical survey of unsupervised text represen-
have demonstrated through offline evaluation that our             tation methods on twitter data, in: Proceedings of
proposed model outperforms the benchmark models on                the Sixth Workshop on Noisy User-generated Text
various tweet products. As next steps, we seek to vali-           (W-NUT 2020), 2020, pp. 209–214.
date our model using online A/B tests in various product     [11] H. Yin, X. Song, S. Yang, G. Huang, J. Li, Represen-
surfaces which serve as the ultimate litmus test.                 tation learning for short text clustering, in: Inter-
                                                                  national Conference on Web Information Systems
                                                                  Engineering, Springer, 2021, pp. 321–335.
References                                                   [12] S. Kayal, Unsupervised sentence-embeddings by
                                                                  manifold approximation and projection, in: Pro-
 [1] Q. Le, T. Mikolov, Distributed representations of            ceedings of the 16th Conference of the European
     sentences and documents, in: International confer-           Chapter of the Association for Computational Lin-
     ence on machine learning, PMLR, 2014, pp. 1188–              guistics: Main Volume, 2021, pp. 1–11.
     1196.                                                   [13] T. Gao, X. Yao, D. Chen, Simcse: Simple contrastive
 [2] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado,            learning of sentence embeddings, in: EMNLP (1),
     J. Dean, Distributed representations of words and            2021.
     phrases and their compositionality, Advances in         [14] J. Huang, D. Tang, W. Zhong, S. Lu, L. Shou,
     neural information processing systems 26 (2013).             M. Gong, D. Jiang, N. Duan, Whiteningbert: An
 [3] J. Wieting, M. Bansal, K. Gimpel, K. Livescu, To-            easy unsupervised sentence embedding approach,
     wards universal paraphrastic sentence embeddings,            in: Findings of the Association for Computational
     arXiv preprint arXiv:1511.08198 (2015).                      Linguistics: EMNLP 2021, 2021, pp. 238–244.
                                                             [15] D. Liao, Sentence embeddings using supervised con-
     trastive learning, arXiv preprint arXiv:2106.04791
     (2021).
[16] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,
     Bert: Pre-training of deep bidirectional transform-
     ers for language understanding, arXiv preprint
     arXiv:1810.04805 (2018).
[17] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh,
     G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
     J. Clark, et al., Learning transferable visual mod-
     els from natural language supervision, in: Inter-
     national Conference on Machine Learning, PMLR,
     2021, pp. 8748–8763.
[18] V. Kulkarni, K. Leung, A. Haghighi, CTM - a
     model for large-scale multi-view tweet topic clas-
     sification, in: Proceedings of the 2022 Confer-
     ence of the North American Chapter of the As-
     sociation for Computational Linguistics: Human
     Language Technologies: Industry Track, Associa-
     tion for Computational Linguistics, Hybrid: Seattle,
     Washington + Online, 2022, pp. 247–258. URL: https:
     //aclanthology.org/2022.naacl-industry.28. doi:10.
     18653/v1/2022.naacl-industry.28.
[19] W. Xu, C. Callison-Burch, B. Dolan, SemEval-2015
     Task 1: Paraphrase and semantic similarity in Twit-
     ter (PIT), in: Proceedings of the 9th International
     Workshop on Semantic Evaluation (SemEval 2015),
     Association for Computational Linguistics, 2015,
     pp. 1–11. URL: https://www.aclweb.org/anthology/
     S15-2001. doi:10.18653/v1/S15-2001.