1. Introduction

Twice - Twitter Content Embeddings

Xianjing Liu

Behzad Golshan

Kenny Leung

Aman Saini

Vivek Kulkarni

Ali Mollahosseini

Jef Mo

In this short paper, we describe a new model for learning content-based tweet embeddings that serve to be generically useful as signals for a variety of down-stream predictive tasks. In contrast to prior approaches that only leverage cues from the raw text, we take a holistic approach and propose Twice, a model for learning tweet embeddings that (a) leverages cues beyond the raw text (including media) and (b) attempts to yield representations optimized for overall similarity, a combination of topical, semantic, and engagement similarity. Ofline evaluations suggest that our model yields richer and superior embeddings compared to the benchmark models in the tasks evaluating on both academia dataset and twitter products.

eol>Deep Learning Recommendation Embedding NLP

1. Introduction Similarity – tweets which are similar in meaning should

be close in the embedding space. An example pair of A rich representation of a tweet that captures nuances semantically similar sentences is “The quick brown fox in meaning is critical for most predictive models at Twit- jumped over the lazy dog” and “The brown fox leaped ter (including Topics, Health, Recommendations etc.). over the lazy dog”. (b) Topic Similarity – tweets that are Consequently, there is an urgent need for models that about the same topic should be close in the embedding can encode or summarize a tweet’s content into a dense space. An example pair here would be “Don Bradman is representation – a representation that can then be used the greatest cricket player of all time” and “India won the in various downstream models. Some key surface areas Cricket World Cup” since both tweets are about “Cricket”. which will be using tweet embeddings include Home (c) Engagement Similarity – tweets that share engagement Timeline, Notifications, Topics, and potentially Health audiences are deemed similar. models. An important requirement is that the tweet em- In this paper, we present Twice – a model that atbedding be generically useful on a variety of predictive tempts to capture the above notions of tweet similarity. tasks and not necessarily be useful for only a specific In contrast to most prior work which only seeks to emapriori task. Finally, we expect downstream models to bed raw tweet text using standard pre-trained language only consume the embeddings, without having to worry models, Twice models tweets holistically leveraging not about the inner workings of the underlying model. only raw tweet text, but also incorporating cues from the

Taking a bird’s eye view, the need is simply for a associated media, and hyperlinks. We evaluate Twice on tweet representation/embedding that captures similarity a suite of ofline benchmark tasks and demonstrate that between tweets by embedding them in a vector space. our proposed model significantly outperforms several Specifically, tweets that are “similar” must be close in the baseline approaches. vector space and tweets that are not “similar” should ideally be far in this vector space with respect to a suitable metric. Zooming in, for practical modeling it is useful to 2. Related Work attempt to operationalize this vague notion of “similar” and attempt to be more specific here. We can attempt to capture the following notions of similarity: (a) Semantic

Our work is very closely related to work in the area

of learning sentence embeddings which seeks to learn dense representations of sentences and capture sentence similarity. One of the earliest works on learning dense dDaLt4ioSnR,’2c2o:-lWocoartkesdhowpitohnthDeee3p1sLteAarCnMingInftoerrnSaeatirocnhaalnCdonRfeecroemncme eonn- embeddings of sentences is the work of Le and Mikolov Information and Knowledge Management (CIKM), October 17-21, 2022, [ 1 ] which generalized the Skipgram word-embedding Atlanta, USA models [ 2 ] to learn sentence embeddings and paragraph $ xianjingabbyl@twitter.com (X. Liu); bgolshan@twitter.com embeddings. With the rise of convolutional and recur(B. Golshan); kennyleung@twitter.com (K. Leung); rent neural network models, several approaches to learn (aVm. aKnuslakianrin@i)t;waimttoerll.caohmoss(Aei.nSi@aitnwi)i;tvtekru.clokmarn(Ai@. tMwoitlltaehr.ocsosmeini); sentence embeddings were proposed [ 3, 4, 5, 6, 7 ]. Addijefm@twitter.com (J. Mo) tionally, a couple of these works sought to learn represen© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License tations of tweets by applying these networks to Twitter CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) text [ 5, 4 ]. Finally, with the introduction of Transformers and pretrained language models, the current state of art approaches now use pre-trained language models coupled with contrastive loss functions to learn sentence embeddings [ 8, 9, 10, 11, 12, 13, 14, 15 ]. All of these approaches only look at embedding generic sentences and are not attuned to embedding tweets where deeper semantic cues, and multi-modal content can be used to obtain rich representations. Our model Twice, in-turn builds on these works, but also incorporates cues specific to tweets (like media and hyperlink) to yield rich embeddings of tweets.

3. Twice

At its core, Twice is a Bert model [ 16 ] trained on a multitask loss function. Figure 1 shows the model architecture of Twice model. In particular, we consider the following three tasks, each attempting to capture notions of similarity as noted in Section 1. More specifically, we optimize a standard Bert model on the following tasks: • Topic Prediction: The task is to predict the concept topics associated with the tweet. This enables the representation to capture topical similarity. In particular, we optimize binary cross entropy loss since this is a multi-label prediction setting where a tweet may be associated with multiple topics. For instance, the tweet "America is heading back to the Moon, folks. No astronauts, but likely to glean loads of data." belongs to "Space" and "Science" topics. The total number of concept topics in this task is 419. • Engagement prediction: Given a representation of the user (obtained by encoding a user biography) and a tweet, the task is to predict if the user engages with the tweet. This task is essentially identical to the task in the well-known Clip (Contrastive Language-Image Pre-Training) model [ 17 ] except that instead of embedding images using an image encoder, we embed the user biographies using a standard Bert encoder. The loss function used is identical to the one described in [ 17 ]. Training on this task enables us to capture tweet similarity based on user engagement patterns and may be particularly useful to model especially when down-stream products may want to maximize user engagements. • Language prediction: Because we desire multilingual support, we would like tweet representations to also encode language cues, so that tweets of the same language tend to be closer than ones from diferent languages. Therefore, we explicitly train on the task of predicting the language of the tweet and use the standard cross-entropy loss function for this task.

The full loss function is simply the average of the above three loss functions. Finally, to obtain a dense representation of the tweet, we simply use the representation of the [CLS] token. Twice leverages cues from the entire tweet and not just the raw tweet text. In particular, in addition to the raw text, we also leverage media cues by obtaining media annotations for any associated media as described in [ 18 ]. These media annotations are simply concatenated to the raw text via a separator token before being input to the model. Similarly, when a tweet has hyperlinks, we extract the first 100 tokens of the webpage title and description as encoded in the linked HTML page. These features are also appended to the input. Training procedure. Twice is trained on a dataset of 200 million tweets sampled over a 90 day interval. We also associate with these tweets the users who engaged with them (which the Clip task component requires). The model was optimized using standard Adam with weight decay as the optimization procedure and trained for 5 epochs until convergence.

4. Experiments We evaluate Twice both quantitatively and qualitatively each of which we discuss below. 4.1. Quantitative Evaluation

Setup. We quantify the efectiveness of tweet embeddings in capturing content similarity via measuring their performance on three benchmark tasks – tasks that relfect how well embeddings capture notions of similarity noted in Section 1: • SemEvalPIT [ 19 ]. This benchmark is an academic benchmark and consists of about 1000 pairs of tweets with similarity scores obtained by human judgments. We measure the performance of embeddings on this task by computing the Spearman correlation of similarity scores obtained for these tweet pairs in embedding space with human judgements. Higher correlations suggest better embeddings reflecting better alignment with human-derived similarity judgments. • Recalling Favorites (Favs). In order to measure the efectiveness of these embeddings in down-stream predictive models of engagement, we consider the task of recalling (based on just a top nearest neighbor lookup) which tweets a given user favorites from more than 5k candidate of tweets given their past engagement history. Higher scores reflect better embeddings. • Topic Assignment Precision (Topics). We compute the precision of topic assignments on a test set of topical tweets. Higher precision suggest better encoding of topical similarity. Once again, here we only base our decisions using a -NN classifier.

Our rationale for restricting ourselves to using very simple nearest neighbor approach based on cosine similarity of tweet embeddings is based on the intuition that higher quality representations would inherently demonstrate a higher-degree of “ease of extraction” of the predictive information. It is for this reason precisely that one needs to use simple models as opposed to very complex deep predictive models. We made a design choice to use a NN-based approach which supported quick implementation but simple shallow models are another alternative. Finally, we summarize model performance by reporting the harmonic mean over tasks.

Baselines.

We consider the following baseline models: • Bert: This is just the standard pre-trained Bert model [ 16 ] and serves as the simplest but strong baseline that one could use to embed tweets. • Simcse: Simcse [ 13 ] is a state-of-the-art sentence embedding approach that learns sentence embedding using an unsupervised approach. The main idea of Simcse is to pass a tweet twice through Bert (with dropout enabled). This yields two different (noisy) representations of . The idea is to these as positive examples. and all other examples in the batch are treated as negative examples. The objective to maximize cosine similarity in the representations of the positive pair and minimize this between and the negative examples. It is to be noted here that we train Simcse on Twitter data. • Hashspace: Hashspace is a Bert based model trained on the task of hashtag prediction where the model is optimized to predict the correct hashtag associated with a tweet from a set of 100K hashtags. Hashspace currently as deployed at Twitter only uses the raw tweet text as cues to make predictions, and does not model tweets holistically. • Topicspace: Topicspace addresses two main limitations of Hashspace. First, in contrast to Hashspace, we model tweets holistically and leverage cues from media, and hyperlinks as well. Second, we simplify the predictive task. Instead of learning to predict a label from a universe of 100K labels (hashtags), we only learn to predict one or more topics from a space of 419 concept topics. The intuition is that to capture similarity, it is suficient to capture fairly broad topics than seek to capture extremely fine-grained hashtags. By making these two changes, we note that we can learn a model with a better fit to the data yielding richer representations. • Clip: The original Clip in [ 17 ] is a neural network trained on 400 millions of (image, text) pairs. Our Clip model is identical to the original Clip model in the model architecture. Our Clip model is trained using a multi-modal method in which Bert

Simcse Hashspace Topicspace

Clip Twice

Results. Table 2 shows the average precision score of

Results. Table 1 shows the results of our evaluation on using the Twice embeddings on the tasks of Spam deour benchmark suite. Based on these results we can make tection and ToS violation classification. We compare the the following conclusions: (a) First, observe that just us- result with the benchmark models Bert and Hashspace. ing standard models like Bert for embedding tweets does The results show that Twice outperforms both the stannot yield superior embeddings. It is imperative to learn dard Bert and Hashspace models in both the Spam deembeddings from Twitter data. (b) Second, state-of-the- tection and ToS violation tasks. art unsupervised methods for sentence embedding perform worse than supervised methods which is inline with 4.3. Qualitative Evaluation and Analysis prior work on sentence embeddings as well. (c) Topic- of Usage in Content Recommenders space significantly outperforms Hashspace overall. This is because Topicspace leverages cues from beyond tweet In order to evaluate our model qualitatively, we also built text and also uses a simpler but more intuitive task. (d) a web page where one can enter a tweet ID and see the Models like Topicspace and Clip which solely optimize nearest neighbors to the given tweet from a given prefor a specific notion of similarity tend to perform signifi- determined universe of tweets – a scenario that reflects cantly better on the corresponding evaluation tasks than the usage of embeddings for candidate generation in conother models simply because the underlying represen- tent recommenders. Figure 2 shows the nearest neighbors tations are optimized to capture that specific similarity for a couple of seed tweets as a demonstration. Note that notion over others. (e) Finally, note that our proposed the nearest neighbors reflect the broad topic of the seed model Twice generally outperforms all of these base- tweet and are similar in content to the seed tweet suggestline approaches overall. While the mean performance of ing that our model is able to capture content similarity Twice and Clip are identical, note that Twice outper- between tweets. forms Clip on both SemEvalPIT and the Topics tasks While the process outlined in this section can be used significantly with a slight drop on the Favs task. for candidate generation in content recommenders (by

To summarize, all in all Twice demonstrates superior ifnding tweets similar to a user’s interests and past enperformance over prior production models and yields im- gagements), through a qualitative analysis conducted proved embeddings of tweets by leveraging cues beyond ofline, we have identified a list of challenges that need to the tweet text. be addressed in-order to ensure good quality candidates are returned.

4.2. Quantitative Evaluation and Analysis of Usage in Health Products

Setup. We evaluate the performance of content embeddings on our health platform. We use Twice embeddings as features in a shallow model to predict Spam and Terms of Service (ToS) violations. The spam prediction is a binary classification task to predict whether a tweet is a • Seed Selection. Using tweets with which users have positively engaged as seeds to fetch more interesting content is a natural choice. However, some seed tweets may have very little content which makes them unsuitable for retrieving candidate tweets. Examples include tweets that contain frequent phrases like “good morning”, everyday greetings, and daily life updates.

Bert Hashspace

Twice

Note that these challenges are independent of the underlying tweet representation itself and may significantly hinder the quality of candidates even if the tweet representation model is of a superior quality. To address these challenges, we build various filters and apply to the candidate pool, we also carefully design and pick desirable seed tweets to generate high quality tweets for the user.

5. Conclusion

In this paper, we proposed a model for embedding tweets that goes beyond just modeling tweet text. Our goal has been to develop generically useful rich representations of tweets that can be used in a variety of downstream predictive models at Twitter. To that end, we have demonstrated through ofline evaluation that our proposed model outperforms the benchmark models on various tweet products. As next steps, we seek to validate our model using online A/B tests in various product surfaces which serve as the ultimate litmus test.

[1]

Le , T. Mikolov, Distributed representations of sentences and documents , in: International conference on machine learning, PMLR , 2014 , pp. 1188 - 1196 .

[2]

Mikolov , I. Sutskever,

Chen ,

G. S.

Corrado ,

Dean , Distributed representations of words and phrases and their compositionality , Advances in neural information processing systems 26 ( 2013 ).

[3]

Wieting ,

Bansal ,

Gimpel ,

Livescu , Towards universal paraphrastic sentence embeddings , arXiv preprint arXiv:1511.08198 ( 2015 ).

[4]

Vosoughi ,

Vijayaraghavan ,

Roy , Tweet2vec: Learning tweet embeddings using character-level cnn-lstm encoder-decoder , in: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval , 2016 , pp. 1041 - 1044 .

[5]

Dhingra ,

Zhou ,

Fitzpatrick ,

Muehl , W. Cohen, Tweet2vec: Character-based distributed representations for social media , in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2 : Short

Papers)

, 2016 , pp. 269 - 274 .

[6]

Hill ,

Cho ,

Korhonen , Learning distributed representations of sentences from unlabelled data , in: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , 2016 , pp. 1367 - 1377 .

[7]

Conneau ,

Kiela ,

Schwenk ,

Barrault ,

Bordes , Supervised learning of universal sentence representations from natural language inference data , in: EMNLP , 2017 .

[8]

Cer ,

Yang , S.-y. Kong,

Hua ,

Limtiaco , R. S. John, N. Constant ,

Guajardo-Cespedes ,

Yuan ,

Tar , et al., Universal sentence encoder , arXiv preprint arXiv: 1803 . 11175 ( 2018 ).

[9]

Reimers , I. Gurevych , Sentence-bert: Sentence embeddings using siamese bert-networks , in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , 2019 , pp. 3982 - 3992 .

[10]

Wang ,

Gao ,

Wei , W. Ma, R. Liu,

Vosoughi , An empirical survey of unsupervised text representation methods on twitter data , in: Proceedings of the Sixth Workshop on Noisy User-generated Text ( W-NUT 2020 ), 2020 , pp. 209 - 214 .

[11]

Yin ,

Song ,

Yang ,

Huang ,

Li , Representation learning for short text clustering , in: International Conference on Web Information Systems Engineering , Springer, 2021 , pp. 321 - 335 .

[12]

Kayal , Unsupervised sentence-embeddings by manifold approximation and projection , in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2021 , pp. 1 - 11 .

[13]

Gao ,

Yao ,

Chen , Simcse: Simple contrastive learning of sentence embeddings , in: EMNLP (1) , 2021 .

[14]

Huang ,

Tang ,

Zhong ,

Lu ,

Shou ,

Gong ,

Jiang ,

Duan , Whiteningbert: An easy unsupervised sentence embedding approach , in: Findings of the Association for Computational Linguistics: EMNLP 2021 , 2021 , pp. 238 - 244 .

[15]

Liao , Sentence embeddings using supervised contrastive learning , arXiv preprint arXiv:2106.04791 ( 2021 ).

[16]

Devlin , M.-

Chang ,

Lee ,

Toutanova , Bert: Pre-training of deep bidirectional transformers for language understanding , arXiv preprint arXiv: 1810 . 04805 ( 2018 ).

[17]

Radford ,

J. W.

Kim ,

Hallacy ,

Ramesh , G. Goh,

Agarwal ,

Sastry ,

Askell ,

Mishkin ,

Clark , et al., Learning transferable visual models from natural language supervision , in: International Conference on Machine Learning, PMLR , 2021 , pp. 8748 - 8763 .

[18]

Kulkarni ,

Leung ,

Haghighi , CTM - a model for large-scale multi-view tweet topic classification, in: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track, Association for Computational Linguistics , Hybrid: Seattle, Washington + Online, 2022 , pp. 247 - 258 . URL: https: //aclanthology.org/ 2022 .naacl-industry. 28 . doi: 10 . 18653/v1/ 2022 .naacl-industry. 28 .

[19]

Xu ,

Callison-Burch ,

Dolan , SemEval -2015 Task 1: Paraphrase and semantic similarity in Twitter (PIT) , in: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015 ), Association for Computational Linguistics, 2015 , pp. 1 - 11 . URL: https://www.aclweb.org/anthology/ S15-2001. doi: 10 .18653/v1/ S15 -2001.