Twice - Twitter Content Embeddings Xianjing Liu1 , Behzad Golshan1 , Kenny Leung1 , Aman Saini1 , Vivek Kulkarni1 , Ali Mollahosseini1 and Jeff Mo1 1 Twitter Cortex Abstract In this short paper, we describe a new model for learning content-based tweet embeddings that serve to be generically useful as signals for a variety of down-stream predictive tasks. In contrast to prior approaches that only leverage cues from the raw text, we take a holistic approach and propose Twice, a model for learning tweet embeddings that (a) leverages cues beyond the raw text (including media) and (b) attempts to yield representations optimized for overall similarity, a combination of topical, semantic, and engagement similarity. Offline evaluations suggest that our model yields richer and superior embeddings compared to the benchmark models in the tasks evaluating on both academia dataset and twitter products. Keywords Deep Learning, Recommendation, Embedding, NLP 1. Introduction Similarity – tweets which are similar in meaning should be close in the embedding space. An example pair of A rich representation of a tweet that captures nuances semantically similar sentences is “The quick brown fox in meaning is critical for most predictive models at Twit- jumped over the lazy dog” and “The brown fox leaped ter (including Topics, Health, Recommendations etc.). over the lazy dog”. (b) Topic Similarity – tweets that are Consequently, there is an urgent need for models that about the same topic should be close in the embedding can encode or summarize a tweet’s content into a dense space. An example pair here would be “Don Bradman is representation – a representation that can then be used the greatest cricket player of all time” and “India won the in various downstream models. Some key surface areas Cricket World Cup” since both tweets are about “Cricket”. which will be using tweet embeddings include Home (c) Engagement Similarity – tweets that share engagement Timeline, Notifications, Topics, and potentially Health audiences are deemed similar. models. An important requirement is that the tweet em- In this paper, we present Twice – a model that at- bedding be generically useful on a variety of predictive tempts to capture the above notions of tweet similarity. tasks and not necessarily be useful for only a specific In contrast to most prior work which only seeks to em- apriori task. Finally, we expect downstream models to bed raw tweet text using standard pre-trained language only consume the embeddings, without having to worry models, Twice models tweets holistically leveraging not about the inner workings of the underlying model. only raw tweet text, but also incorporating cues from the Taking a bird’s eye view, the need is simply for a associated media, and hyperlinks. We evaluate Twice on tweet representation/embedding that captures similarity a suite of offline benchmark tasks and demonstrate that between tweets by embedding them in a vector space. our proposed model significantly outperforms several Specifically, tweets that are “similar” must be close in the baseline approaches. vector space and tweets that are not “similar” should ide- ally be far in this vector space with respect to a suitable metric. Zooming in, for practical modeling it is useful to 2. Related Work attempt to operationalize this vague notion of “similar” and attempt to be more specific here. We can attempt to Our work is very closely related to work in the area capture the following notions of similarity: (a) Semantic of learning sentence embeddings which seeks to learn dense representations of sentences and capture sentence similarity. One of the earliest works on learning dense DL4SR’22: Workshop on Deep Learning for Search and Recommen- dation, co-located with the 31st ACM International Conference on embeddings of sentences is the work of Le and Mikolov Information and Knowledge Management (CIKM), October 17-21, 2022, [1] which generalized the Skipgram word-embedding Atlanta, USA models [2] to learn sentence embeddings and paragraph $ xianjingabbyl@twitter.com (X. Liu); bgolshan@twitter.com embeddings. With the rise of convolutional and recur- (B. Golshan); kennyleung@twitter.com (K. Leung); rent neural network models, several approaches to learn amansaini@twitter.com (A. Saini); vkulkarni@twitter.com (V. Kulkarni); amollahosseini@twitter.com (A. Mollahosseini); sentence embeddings were proposed [3, 4, 5, 6, 7]. Addi- jeffm@twitter.com (J. Mo) tionally, a couple of these works sought to learn represen- © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). tations of tweets by applying these networks to Twitter CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) text [5, 4]. Finally, with the introduction of Transform- of the tweet and use the standard cross-entropy ers and pretrained language models, the current state loss function for this task. of art approaches now use pre-trained language models The full loss function is simply the average of the above coupled with contrastive loss functions to learn sentence three loss functions. Finally, to obtain a dense represen- embeddings [8, 9, 10, 11, 12, 13, 14, 15]. All of these ap- tation of the tweet, we simply use the representation of proaches only look at embedding generic sentences and the [CLS] token. Twice leverages cues from the entire are not attuned to embedding tweets where deeper se- tweet and not just the raw tweet text. In particular, in mantic cues, and multi-modal content can be used to addition to the raw text, we also leverage media cues by obtain rich representations. Our model Twice, in-turn obtaining media annotations for any associated media as builds on these works, but also incorporates cues spe- described in [18]. These media annotations are simply cific to tweets (like media and hyperlink) to yield rich concatenated to the raw text via a separator token before embeddings of tweets. being input to the model. Similarly, when a tweet has hyperlinks, we extract the first 100 tokens of the web- 3. Twice page title and description as encoded in the linked HTML page. These features are also appended to the input. At its core, Twice is a Bert model [16] trained on a multi- task loss function. Figure 1 shows the model architecture Training procedure. Twice is trained on a dataset of of Twice model. In particular, we consider the follow- 200 million tweets sampled over a 90 day interval. We ing three tasks, each attempting to capture notions of also associate with these tweets the users who engaged similarity as noted in Section 1. More specifically, we with them (which the Clip task component requires). The optimize a standard Bert model on the following tasks: model was optimized using standard Adam with weight decay as the optimization procedure and trained for 5 • Topic Prediction: The task is to predict the con- epochs until convergence. cept topics associated with the tweet. This en- ables the representation to capture topical sim- ilarity. In particular, we optimize binary cross 4. Experiments entropy loss since this is a multi-label prediction setting where a tweet may be associated with We evaluate Twice both quantitatively and qualitatively multiple topics. For instance, the tweet "America each of which we discuss below. is heading back to the Moon, folks. No astro- nauts, but likely to glean loads of data." belongs 4.1. Quantitative Evaluation to "Space" and "Science" topics. The total number of concept topics in this task is 419. Setup. We quantify the effectiveness of tweet embed- dings in capturing content similarity via measuring their • Engagement prediction: Given a representa- performance on three benchmark tasks – tasks that re- tion of the user (obtained by encoding a user bi- flect how well embeddings capture notions of similarity ography) and a tweet, the task is to predict if noted in Section 1: the user engages with the tweet. This task is es- sentially identical to the task in the well-known • SemEvalPIT [19]. This benchmark is an aca- Clip (Contrastive Language-Image Pre-Training) demic benchmark and consists of about 1000 model [17] except that instead of embedding im- pairs of tweets with similarity scores obtained ages using an image encoder, we embed the user by human judgments. We measure the perfor- biographies using a standard Bert encoder. The mance of embeddings on this task by computing loss function used is identical to the one described the Spearman correlation of similarity scores ob- in [17]. Training on this task enables us to cap- tained for these tweet pairs in embedding space ture tweet similarity based on user engagement with human judgements. Higher correlations sug- patterns and may be particularly useful to model gest better embeddings reflecting better align- especially when down-stream products may want ment with human-derived similarity judgments. to maximize user engagements. • Recalling Favorites (Favs). In order to mea- • Language prediction: Because we desire multi- sure the effectiveness of these embeddings in lingual support, we would like tweet representa- down-stream predictive models of engagement, tions to also encode language cues, so that tweets we consider the task of recalling (based on just of the same language tend to be closer than ones a top 𝑘 nearest neighbor lookup) which tweets a from different languages. Therefore, we explic- given user favorites from more than 5k candidate itly train on the task of predicting the language of tweets given their past engagement history. Higher scores reflect better embeddings. Figure 1: Twice model is a BERT based model trained on a multi-task loss function. • Topic Assignment Precision (Topics). We The objective to maximize cosine similarity in the compute the precision of topic assignments on a representations of the positive pair and minimize test set of topical tweets. Higher precision sug- this between 𝑋 and the negative examples. It is gest better encoding of topical similarity. Once to be noted here that we train Simcse on Twitter again, here we only base our decisions using a data. 𝑘-NN classifier. • Hashspace: Hashspace is a Bert based model trained on the task of hashtag prediction where Our rationale for restricting ourselves to using very sim- the model is optimized to predict the correct hash- ple nearest neighbor approach based on cosine similarity tag associated with a tweet from a set of 100K of tweet embeddings is based on the intuition that higher hashtags. Hashspace currently as deployed at quality representations would inherently demonstrate Twitter only uses the raw tweet text as cues to a higher-degree of “ease of extraction” of the predictive make predictions, and does not model tweets information. It is for this reason precisely that one needs holistically. to use simple models as opposed to very complex deep predictive models. We made a design choice to use a • Topicspace: Topicspace addresses two main NN-based approach which supported quick implementa- limitations of Hashspace. First, in contrast to tion but simple shallow models are another alternative. Hashspace, we model tweets holistically and Finally, we summarize model performance by reporting leverage cues from media, and hyperlinks as well. the harmonic mean over tasks. Second, we simplify the predictive task. Instead of learning to predict a label from a universe of 100K labels (hashtags), we only learn to predict Baselines. We consider the following baseline models: one or more topics from a space of 419 concept • Bert: This is just the standard pre-trained Bert topics. The intuition is that to capture similarity, model [16] and serves as the simplest but strong it is sufficient to capture fairly broad topics than baseline that one could use to embed tweets. seek to capture extremely fine-grained hashtags. • Simcse: Simcse [13] is a state-of-the-art sentence By making these two changes, we note that we embedding approach that learns sentence embed- can learn a model with a better fit to the data ding using an unsupervised approach. The main yielding richer representations. idea of Simcse is to pass a tweet 𝑋 twice through • Clip: The original Clip in [17] is a neural net- Bert (with dropout enabled). This yields two dif- work trained on 400 millions of (image, text) pairs. ferent (noisy) representations of 𝑋. The idea is to Our Clip model is identical to the original Clip these as positive examples. 𝑋 and all other exam- model in the model architecture. Our Clip model ples in the batch are treated as negative examples. is trained using a multi-modal method in which SemEvalPIT Favs Topics Mean (HM) Bert 0.025 0.043 0.063 0.038 Simcse 0.218 0.022 0.205 0.055 Hashspace 0.336 0.064 0.230 0.131 Topicspace 0.264 0.075 0.599 0.160 Clip 0.225 0.136 0.271 0.194 Twice 0.302 0.102 0.429 0.194 Table 1 Quantitative performance of various tweet embedding approaches on our benchmark suite. We report the Spearman correlation for SemEvalPIT, Top-K recall for Favs and Precision for Topics. Note that our model Twice outperforms most standard baselines including the currently deployed Hashspace. the model attempts to predict whether a piece of spam or not. The ToS violation prediction is a multi-label media and text come from the same tweet or not. classification task. There are 8 labels in total. Some exam- In this setting we replace the image encoder in ples of ToS violation labels include ’violence’, ’threaten the original Clip model with a user-biography someone’, ’suicide’, etc. encoder. Results. Table 2 shows the average precision score of Results. Table 1 shows the results of our evaluation on using the Twice embeddings on the tasks of Spam de- our benchmark suite. Based on these results we can make tection and ToS violation classification. We compare the the following conclusions: (a) First, observe that just us- result with the benchmark models Bert and Hashspace. ing standard models like Bert for embedding tweets does The results show that Twice outperforms both the stan- not yield superior embeddings. It is imperative to learn dard Bert and Hashspace models in both the Spam de- embeddings from Twitter data. (b) Second, state-of-the- tection and ToS violation tasks. art unsupervised methods for sentence embedding per- form worse than supervised methods which is inline with 4.3. Qualitative Evaluation and Analysis prior work on sentence embeddings as well. (c) Topic- of Usage in Content Recommenders space significantly outperforms Hashspace overall. This is because Topicspace leverages cues from beyond tweet In order to evaluate our model qualitatively, we also built text and also uses a simpler but more intuitive task. (d) a web page where one can enter a tweet ID and see the Models like Topicspace and Clip which solely optimize nearest neighbors to the given tweet from a given pre- for a specific notion of similarity tend to perform signifi- determined universe of tweets – a scenario that reflects cantly better on the corresponding evaluation tasks than the usage of embeddings for candidate generation in con- other models simply because the underlying represen- tent recommenders. Figure 2 shows the nearest neighbors tations are optimized to capture that specific similarity for a couple of seed tweets as a demonstration. Note that notion over others. (e) Finally, note that our proposed the nearest neighbors reflect the broad topic of the seed model Twice generally outperforms all of these base- tweet and are similar in content to the seed tweet suggest- line approaches overall. While the mean performance of ing that our model is able to capture content similarity Twice and Clip are identical, note that Twice outper- between tweets. forms Clip on both SemEvalPIT and the Topics tasks While the process outlined in this section can be used significantly with a slight drop on the Favs task. for candidate generation in content recommenders (by To summarize, all in all Twice demonstrates superior finding tweets similar to a user’s interests and past en- performance over prior production models and yields im- gagements), through a qualitative analysis conducted proved embeddings of tweets by leveraging cues beyond offline, we have identified a list of challenges that need to the tweet text. be addressed in-order to ensure good quality candidates are returned. 4.2. Quantitative Evaluation and Analysis • Seed Selection. Using tweets with which users of Usage in Health Products have positively engaged as seeds to fetch more interesting content is a natural choice. However, Setup. We evaluate the performance of content embed- some seed tweets may have very little content dings on our health platform. We use Twice embeddings which makes them unsuitable for retrieving candi- as features in a shallow model to predict Spam and Terms date tweets. Examples include tweets that contain of Service (ToS) violations. The spam prediction is a bi- frequent phrases like “good morning”, everyday nary classification task to predict whether a tweet is a greetings, and daily life updates. Spam ToS Bert 0.41 0.328 Hashspace 0.44 0.286 Twice 0.47 0.347 Table 2 The average precision score of using Twice embeddings for the Spam detection and TOS violation classification tasks. Figure 2: Nearest neighbors to a couple of seed tweets in the Twice embedding space. • Content Candidates Quality. Some tweets may [4] S. Vosoughi, P. Vijayaraghavan, D. Roy, Tweet2vec: have very little to no content and are non-topical. Learning tweet embeddings using character-level This includes tweets that may only have single cnn-lstm encoder-decoder, in: Proceedings of the emojis, may be very short in length, or only con- 39th International ACM SIGIR conference on Re- sist of a shortened URL. Such candidates should search and Development in Information Retrieval, not be part of the universe. This is by far the 2016, pp. 1041–1044. most pervasive and immediate challenge to be [5] B. Dhingra, Z. Zhou, D. Fitzpatrick, M. Muehl, addressed. W. Cohen, Tweet2vec: Character-based distributed • Health Considerations. Some tweets can be representations for social media, in: Proceedings spam, violence, and etc. Recommending such of the 54th Annual Meeting of the Association for content is problematic and needs to be addressed Computational Linguistics (Volume 2: Short Pa- before tweet representations can be used for con- pers), 2016, pp. 269–274. tent recommendations. [6] F. Hill, K. Cho, A. Korhonen, Learning distributed • Recency. Some tweets returned may be irrele- representations of sentences from unlabelled data, vant (or at least non-engaging) simply because in: Proceedings of the 2016 Conference of the North they are old and outdated. Candidate generators American Chapter of the Association for Computa- need to adapt their responses to the recency re- tional Linguistics: Human Language Technologies, quirements of the product. 2016, pp. 1367–1377. [7] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, Note that these challenges are independent of the un- A. Bordes, Supervised learning of universal sen- derlying tweet representation itself and may significantly tence representations from natural language infer- hinder the quality of candidates even if the tweet repre- ence data, in: EMNLP, 2017. sentation model is of a superior quality. To address these [8] D. Cer, Y. Yang, S.-y. Kong, N. Hua, N. Limtiaco, R. S. challenges, we build various filters and apply to the can- John, N. Constant, M. Guajardo-Cespedes, S. Yuan, didate pool, we also carefully design and pick desirable C. Tar, et al., Universal sentence encoder, arXiv seed tweets to generate high quality tweets for the user. preprint arXiv:1803.11175 (2018). [9] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, in: Pro- 5. Conclusion ceedings of the 2019 Conference on Empirical Meth- In this paper, we proposed a model for embedding tweets ods in Natural Language Processing and the 9th In- that goes beyond just modeling tweet text. Our goal ternational Joint Conference on Natural Language has been to develop generically useful rich representa- Processing (EMNLP-IJCNLP), 2019, pp. 3982–3992. tions of tweets that can be used in a variety of down- [10] L. Wang, C. Gao, J. Wei, W. Ma, R. Liu, S. Vosoughi, stream predictive models at Twitter. To that end, we An empirical survey of unsupervised text represen- have demonstrated through offline evaluation that our tation methods on twitter data, in: Proceedings of proposed model outperforms the benchmark models on the Sixth Workshop on Noisy User-generated Text various tweet products. As next steps, we seek to vali- (W-NUT 2020), 2020, pp. 209–214. date our model using online A/B tests in various product [11] H. Yin, X. Song, S. Yang, G. Huang, J. Li, Represen- surfaces which serve as the ultimate litmus test. tation learning for short text clustering, in: Inter- national Conference on Web Information Systems Engineering, Springer, 2021, pp. 321–335. References [12] S. Kayal, Unsupervised sentence-embeddings by manifold approximation and projection, in: Pro- [1] Q. Le, T. Mikolov, Distributed representations of ceedings of the 16th Conference of the European sentences and documents, in: International confer- Chapter of the Association for Computational Lin- ence on machine learning, PMLR, 2014, pp. 1188– guistics: Main Volume, 2021, pp. 1–11. 1196. [13] T. Gao, X. Yao, D. Chen, Simcse: Simple contrastive [2] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, learning of sentence embeddings, in: EMNLP (1), J. Dean, Distributed representations of words and 2021. phrases and their compositionality, Advances in [14] J. Huang, D. Tang, W. Zhong, S. Lu, L. Shou, neural information processing systems 26 (2013). M. Gong, D. Jiang, N. Duan, Whiteningbert: An [3] J. Wieting, M. Bansal, K. Gimpel, K. Livescu, To- easy unsupervised sentence embedding approach, wards universal paraphrastic sentence embeddings, in: Findings of the Association for Computational arXiv preprint arXiv:1511.08198 (2015). Linguistics: EMNLP 2021, 2021, pp. 238–244. [15] D. Liao, Sentence embeddings using supervised con- trastive learning, arXiv preprint arXiv:2106.04791 (2021). [16] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transform- ers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [17] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual mod- els from natural language supervision, in: Inter- national Conference on Machine Learning, PMLR, 2021, pp. 8748–8763. [18] V. Kulkarni, K. Leung, A. Haghighi, CTM - a model for large-scale multi-view tweet topic clas- sification, in: Proceedings of the 2022 Confer- ence of the North American Chapter of the As- sociation for Computational Linguistics: Human Language Technologies: Industry Track, Associa- tion for Computational Linguistics, Hybrid: Seattle, Washington + Online, 2022, pp. 247–258. URL: https: //aclanthology.org/2022.naacl-industry.28. doi:10. 18653/v1/2022.naacl-industry.28. [19] W. Xu, C. Callison-Burch, B. Dolan, SemEval-2015 Task 1: Paraphrase and semantic similarity in Twit- ter (PIT), in: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Association for Computational Linguistics, 2015, pp. 1–11. URL: https://www.aclweb.org/anthology/ S15-2001. doi:10.18653/v1/S15-2001.