Robust Training Objectives Improve Embedding-Based Retrieval in Industrial Recommendation Systems Matthew Kolodner1,* , Mingxuan Ju1 , Zihao Fan1 , Tong Zhao1 , Elham Ghazizadeh1 , Yan Wu1 , Neil Shah1 and Yozen Liu1 1 Snap, Inc., 2772 Donald Douglas Loop N, Santa Monica, CA 90405, USA Abstract Improving recommendation systems (RS) can greatly enhance the user experience across many domains, such as social media. Many RS utilize embedding-based retrieval (EBR) approaches to retrieve candidates for recommendation. In an EBR system, the embedding quality is key. According to recent literature, self-supervised multitask learning (SSMTL) has showed strong performance on academic benchmarks in embedding learning and resulted in an overall improvement in multiple downstream tasks, demonstrating a larger resilience to the adverse conditions between each downstream task and thereby increased robustness and task generalization ability through the training objective. However, whether or not the success of SSMTL in academia as a robust training objectives translates to large-scale (i.e., over hundreds of million users and interactions in-between) industrial RS still requires verification. Simply adopting academic setups in industrial RS might entail two issues. Firstly, many self-supervised objectives require data augmentations (e.g., embedding masking/corruption) over a large portion of users and items, which is prohibitively expensive in industrial RS. Furthermore, some self-supervised objectives might not align with the recommendation task, which might lead to redundant computational overheads or negative transfer. In light of these two challenges, we evaluate using a robust training objective, specifically SSMTL, through a large-scale friend recommendation system on a social media platform in the tech sector, identifying whether this increase in robustness can work at scale in enhancing retrieval in the production setting. Through online A/B testing with SSMTL-based EBR, we observe statistically significant increases in key metrics in the friend recommendations, with up to 5.45% improvements in new friends made and 1.91% improvements in new friends made with cold-start users. Besides, with a dedicated case study, the benefits of robust training objectives are demonstrated through SSMTL on large-scale graphs with gains in both retrieval and end-to-end friend recommendation. 1. Introduction end-to-end recommendation. In this work, we specifically focus on the friend recommendation EBR setting, where Recommendation systems (RS) have become a crucial com- vast amounts of topological information relating users are ponent for user experience [1, 2]. Most industrial RS ex- readily available. Recent works [12, 13, 14] have shown plore a two-stage process [3]. During the first stage (i.e., the that including this relational information can improve the retrieval phase), among hundreds of millions of candidate embedding quality. The relational information is commonly users/items, the RS usually utilizes several models optimized modeled with graph neural networks (GNNs), producing for recall to select a small set of candidate users/items (e.g., embeddings that leverage neighbor information in graphs, 1,000 candidates). Whereas during the second stage (i.e., the such as co-friend relationships. For graph-aware EBR in ranking phase), within the candidate subset, the RS can ex- particular, link prediction has seen success for generating plore complicated expensive models that are optimized for high-quality embeddings [15], where we look to predict precision to select top 𝐾 candidates for the final recommen- the presence of an edge between a query node and set of dation. Such two-stage process enables recommendation candidate nodes. over large quantities of possible users/items and allows for While link prediction is effective in learning nuanced sim- greater flexibility towards key recommendation metrics. ilarities and distinctions between candidates, there are sev- In this two-stage scheme, the retrieval stage is especially eral other self-supervised graph learning philosophies that important, as it acts as the bottleneck for possible candidates can provide high-quality embeddings, such as mutual infor- provided to the ranker in the second stage. One common ap- mation maximization [16], generative reconstruction [17], proach [4, 5] for the retrieval step is to leverage embedding- or whitening decorrelation [18]. Based on these general based retrieval (EBR). Specifically, EBR learns embeddings philosophies, many graph-based approaches have been pro- for all users and items as vectors in a low-dimensional latent posed and used to learning embeddings directly, achieving space. These embeddings are learned in a way such that desirable properties of embeddings without requiring ex- the distance between them is reflective of their similarity, plicit labels. Recently, Ju et al. [19] evaluated combining with more similar items being closer together in the latent these self-supervised learning approaches with link predic- space. As a result, candidates can be retrieved through a tion in a multitask (MTL) setting, demonstrating a larger re- nearest-neighbor search across the latent space. In practice, silience to the adverse conditions between each downstream this is done using an approximate nearest neighbor methods task and thereby increased robustness and generalization optimized for large-scale retrieval, such as FAISS [6] and ability through the training objective HNSW [7]. However, whether or not using SSMTL in academia as Many methods [8, 9, 10, 11] have been proposed for gen- a robust training objective translates to large-scale (i.e., erating high-quality embeddings for EBR, which lead to over hundreds of millions of users and interactions in- more relevant candidates and improved metrics after the between) industrial RSs still requires verification. Simply adopting academic setups in industrial RSs might result RobustRecSys: Design, Evaluation, and Deployment of Robust Recom- in several issues. Firstly, many self-supervised objectives mender Systems Workshop @ RecSys 2024, 18 October, 2024, Bari, Italy. * Corresponding author. require data augmentations (e.g., embedding masking/cor- $ mkolodner@snap.com (M. Kolodner); mju@snap.com (M. Ju); ruption) over a large portion of users and items, which zfan3@snap.com (Z. Fan); tong@snap.com (T. Zhao); is prohibitively expensive in industrial RSs. Furthermore, eghazizadeh@snap.com (E. Ghazizadeh); ywu@snap.com (Y. Wu); some self-supervised objectives might not align with the rec- nshah@snap.com (N. Shah); yliu2@snap.com (Y. Liu) Β© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License ommendation task, which might lead to redundant compu- Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Figure 1: In our proposed SSMTL framework, we combine the CCA and MAE SSL methods with the retrieval task in our embedding generation scheme for EBR. CCA looks to maximize the correlation of two augmented views of the input subgraph while decorrelating features of a single view. MAE seeks to reconstruct the query user nodes after being propagated through the GNN encoder backbone. Finally, the retrieval task seeks to predict which candidates share a link with the query user using a categorical cross entropy loss. The loss of each subtask is weighted and summed to measure the final loss. Embeddings can be generated through the GNN encoder for EBR. tational overheads or negative transfer [20], a phenomenon 2. Background where performance can worsen as a result of the complexity and potentially opposing nature of the various tasks. 2.1. Graph-Aware Embedding-based In this work, we investigate whether robust SSMTL train- Retrieval ing objectives are able to improve the link prediction re- trieval performance on large-scale graphs with over hun- In a two-stage recommendation system with a retrieval then dreds of millions of nodes and edges. Specifically, we look to ranking phase, the retrieval phase plays an important role fil- find what combination of SSL approaches can improve over- tering out the most relevant candidates to lighten the load of all robustness and thereby augment retrieval through com- the ranker. Since the ranking result is largely dependent on plementary yet disjoint information. In our experiments, items retrieved in the retrieval phase, a good quality retrieval we find two SSL approaches, based on philosophies from model can drastically improve the final ranking. Embedding whitening decorrelation (e.g., Canonical Correlation Anal- based retrieval (EBR) is a method that’s recently adopted ysis [21]) and generative reconstruction (e.g., Masked Au- and deployed in many content, product, and friend recom- toencoders [22]), that are able to augment the performance mendation systems[4, 23, 24, 12], and proved to achieve of link prediction without negative transfer. We deploy superior results. EBR transform users and items into em- the proposed framework on an industrial large-scale friend beddings, changing the retrieval problem into a nearest- recommendation system to a community of hundreds of neighbor search problem in a low-dimensional latent space. millions of users. In the online A/B testing, we observe These embeddings can be determined in advance and in- significant improvements in key metrics like new friends dexed using an approximate nearest neighbor search such made, especially with cold-start users on the platform. Our as FAISS [6] and HNSW [7] in order to retrieve the top-π‘˜ contributions are summarized as follows: most relevant items efficiently at serving. When applying EBR to RS problems, the quality of em- β€’ We demonstrate the effectiveness of robust training beddings is of upmost importance. In this paper, we use a objectives such as SSMTL in a large-scale industrial friend recommendation system as our subject. In scenarios recommendation system. like friend recommendation where vast amounts of topolog- ical information relating users and items is readily available, β€’ We conduct an online study of SSMTL on a massive these embeddings can be augmented with GNNs. Previous real-world recommendation system, and observe a work showed that EBR for friend recommendation systems statistically significant increase in key metrics, with see benefits leveraging graph-aware embeddings[12]. In up to 5.45% improvements in new friends made this setting, nodes would contain individual user features and 1.91% improvements in new friends made with while edges map to user-user interactions. This approach cold-start users. compliments commonly used graph traversal approaches (eg. friend-of-friend (FoF) [25]), allowing for retrieval of candidates from any number of hops away from the target. Here we describe GNNs for generating graph-aware em- beddings for EBR. GNNs have demonstrated state-of-the-art performance in many problems containing rich topological information within the graph data [26], such as recommen- dation and forecasting. Formally, we define 𝐺 = (𝒱, β„°, 𝑋), where 𝒱 is the set of 𝑛 nodes (|𝒱| = 𝑛), β„° is the set of 3. Self-Supervised Multitask edges (β„° ∈ 𝒱 βŠ† 𝒱), and 𝑋 is a feature matrix of dimension 𝑑 where 𝑋 ∈ R𝑛×𝑑 . Many modern GNNs also employ Learning for EBR a message-passing structure, consisting of an aggregation In the following sections, we describe details of the SSL (AGG) and update (UPD) function. The goal of this paradigm methods used in our SSMTL approach, our experiment set is for nodes to receive information from their neighbors, up and results, highlighting the benefits and impact of in- collecting messages using its AGG function before updating cluding SSMTL based embedding in EBR for large-scale their own messages with the UPD function, both of which industrial recommendation systems. are learnable and permutation-invariant. For some node 𝑒 at layer π‘˜, the next message-passing layer can be written as 3.1. Self-Supervised Learning Methods (︁ (︁ )︁)︁ h(π‘˜+1) 𝑒 = UPD(π‘˜) h(π‘˜) 𝑒 , AGG (π‘˜) {h𝑣(π‘˜) , βˆ€π‘£ ∈ 𝒩 (𝑒)} We identify two self-supervised learning approaches that (1) are scalable and lead to improvements in the large-scale where 𝒩 (𝑒) is the neighborhood nodes of node 𝑒. Dif- recommendation setting through a more robust training ferent message-passing GNN models use different combi- objective. nations of AGG and UPD functions. An example of a more complex GNN, Graph-attention networks (GATs) [27], use Canonical Correlation Analysis. Based on work from an attention mechanism for each pair of nodes 𝑖 and 𝑗 [21], Canonical Correlation Analysis (CCA) deploys a non- contrastive, non-discriminitive SSL method to train the 𝛼𝑖𝑗 = softmax𝑗 (𝑓att (Wβ„Žπ‘– , Wβ„Žπ‘— )) (2) GNN. The self-supervised training objective is described in Equation 3. First, given a subgraph with 𝑛 nodes, two aug- where W is a linear transformation applied to every node mented views of the subgraph are created and fed through and 𝑓att is the attention function parameterized by a weight the GNN, producing ZA and ZB where ZA , ZB ∈ Rπ‘›Γ—π‘˜ . vector and a non-linearity function. The AGG function is Each of these embeddings are fed through a task-specific then a attention-weighted sum of its neighbors features head, and then are normalized so that each feature has 0 while the UPD function is implicitly defined in W and the mean and √1𝑛 standard deviation, resulting in ZΜƒA and ZΜƒB . non-linearity function. Typically, to generate graph-aware The loss is then computed from Equation 3. The first term embeddings from GNNs, a margin based ranking loss[13, 12] in the equation seeks to minimize the distance of the same or contrastive[28] loss can be used, to encourage items that nodes between the two views. The second term enforces are closer in the graph to be closer in the embedding space. that the feature-wise covariance of all nodes is equal to the identity matrix. 2.2. Multitask Learning Multitask learning (MTL) is an approach in machine learn- ⃦ ⃦2 (︂⃦ ⃦2 ⃦ ⃦2 )οΈ‚ βƒ¦Λœ βƒ¦Λœπ‘‡ βƒ¦Λœπ‘‡ ing where a model is trained simultaneously on several tasks. β„’CCA = ⃦Z βˆ’ ZΜƒ +πœ† ZΜƒ βˆ’ I + ZΜƒ βˆ’ I ⃦ ⃦ ⃦ 𝐴 𝐡⃦ ⃦Z𝐴 𝐴 ⃦ ⃦Z𝐡 𝐡 ⃦ 𝐹 𝐹 𝐹 MTL has been extensively explored in recommendation as a (3) way to improve key metrics [29, 30, 31, 32]. Thus, the core Masked Autoencoders. Based on work from [22], this idea behind multitask learning is to improve the robustness approach leverages a graph masked autoencoder (MAE) of the model by leveraging the domain-specific information that focuses on feature reconstruction. First, an augmented contained in the training signals of related tasks [33, 34]. view of the subgraph is created and the features of the query Hard parameter sharing, one of the most fundamental forms users are masked out. This augmented graph is then fed of MTL, uses a shared representation which then branches through the GNN and a task-specific head. The features of into multiple heads capable of learning task-specific infor- the query users are then re-masked and passed through a mation [35, 36, 37]. graph convolution layer. As described in Equation 4, for For graph-aware EBR in particular, self-supervised multi- all masked nodes 𝒱, the final loss is equal to the average task learning (SSMTL) has been proposed as a new approach of the scaled cosine error between the original features X to MTL, optimizing the embeddings directly to achieve de- and generated features Z. This approach only relies on the sirable embedding properties without the use of positive local neighborhood surrounding the query node, making it or negative labels. In this setting, we combine several self- a good option for large-scale SSMTL. supervised learning (SSL) methods with a downstream re- )︂𝑦 trieval task to learn both direct and indirect embedding fea- x𝑇𝑖 z𝑖 (οΈ‚ 1 βˆ‘οΈ β„’MAE = 1βˆ’ , 𝑦 β‰₯ 1 (4) tures. Recent work [19] has shown that SSMTL can lead to |𝒱| 𝑣 βˆˆπ’± β€–x𝑖 β€– Β· β€–z𝑖 β€– improved task generalization and embedding quality on sev- 𝑖 eral academic benchmarks through the increasingly robust We note that these two approaches both utilize non- training objective. However, many of the SSL approaches contrastive methods. While experimenting with different used are constrained to the assumption that global graph in- SSL tasks, we find that contrastive SSL approaches do not formation can be inferred within the graph structure. This is perform very well in the production setting due to their not valid in the large-scale recommendation setting, where assumption that global information is readily available in graphs are constrained to some 𝐾-hop around a query user the original and augmented graphs. This is not necessarily in order to fit in memory. As a result, many of these SSL true for large-scale recommendation, where subgraphs are methods may lead to negative transfer due to SSL task con- constrained to the K-hop neighborhood surrounding each flict with the target link prediction task, and there remains query node. work to be done to investigate which methods perform best in this large-scale setting. 3.2. Experimental Setup 4. Conclusion 3.2.1. Problem Breakdown In this paper, we evaluate the effectiveness of a robust self- We evaluate the SSMTL as a robust training objective on an supervised multitask learning objective in embedding-based industrial friend recommendation system with hundreds of retrieval. Through online evaluation, we demonstrate that millions of users and connections. To handle this scale of self-supervised methods used in a multi task setting are able training, we sample subgraphs containing the π‘˜-hop neigh- to augment the performance of the underlying retrieval task borhood around each query user. Following training, the on the scale of over 800 million nodes and edges, providing embeddings for EBR can be via propagation through the complementary yet disjoint information to enhance the em- encoder backbone. bedding quality. We observe statistically significant gains in the number of friendships made for both high and low 3.2.2. Retrieval Baseline degree users. The baseline model uses a supervised single-task setup for embedding-based retrieval. We use a GAT as the GNN en- References coder backbone to obtain embeddings for the query user and each candidate, producing a candidate embedding matrix z. [1] Y. Li, K. Liu, R. Satapathy, S. Wang, E. Cambria, Re- We can then compute the dot product between the query cent developments in recommender systems: A survey, user and each candidate and apply Softmax to generate the 2023. arXiv:2306.12680. logits. We then calculate the Categorical Cross Entropy [2] A. Sun, Y. Peng, A survey on modern recommendation Loss with the true labels y across the 𝑁 = 2 classes and 𝑀 system based on big data, 2024. arXiv:2206.02631. candidates, outlined in Equation 5. [3] P. Covington, J. Adams, E. Sargin, Deep neural networks for youtube recommendations, in: Pro- ceedings of the 10th ACM Conference on Recom- 𝑁 βˆ‘οΈπ‘€ (οΈƒ )οΈƒ βˆ‘οΈ 𝑒𝑧𝑖𝑗 β„’retrieval = βˆ’ 𝑦𝑖𝑗 log βˆ‘οΈ€π‘€ (5) mender Systems, RecSys ’16, Association for Comput- π‘§π‘–π‘˜ 𝑖=1 𝑗=1 π‘˜=1 𝑒 ing Machinery, New York, NY, USA, 2016, p. 191–198. URL: https://doi.org/10.1145/2959100.2959190. doi:10. 3.2.3. SSMTL Implementation Details 1145/2959100.2959190. In our SSMTL approach, we use both CCA and MAE in com- [4] J. Huang, A. Sharma, S. Sun, L. Xia, D. Zhang, bination with the retrieval baseline as the training objec- P. Pronin, J. Padmanabhan, G. Ottaviano, L. Yang, tives. All three methods share the same GAT GNN backbone. Embedding-based retrieval in facebook search, CoRR The augmented views for CCA and MAE occur separately, abs/2006.11632 (2020). URL: https://arxiv.org/abs/2006. with CCA performing edge and feature drop augmentations 11632. arXiv:2006.11632. while MAE performs edge drop and query node masking. [5] Y. Gan, Y. Ge, C. Zhou, S. Su, Z. Xu, X. Xu, Q. Hui, The task-specific head for CCA is a Linear-ReLU-Linear X. Chen, Y. Wang, Y. Shan, Binary embedding-based block while the task-specific head for MAE is one linear retrieval at tencent, 2023. arXiv:2302.08714. layer. The final loss with SSMTL is a weighted sum of the [6] J. Johnson, M. Douze, H. JΓ©gou, Billion-scale sim- losses. ilarity search with gpus, CoRR abs/1702.08734 (2017). URL: http://arxiv.org/abs/1702.08734. arXiv:1702.08734. β„’combined = 𝛼ℒretrieval + 𝛽ℒCCA + 𝛾ℒMAE (6) [7] Y. A. Malkov, D. A. Yashunin, Efficient and ro- where 𝛼 is the weight for the retrieval loss, 𝛽 is the weight bust approximate nearest neighbor search using hi- of the CCA loss, and 𝛾 is the weight of the MAE loss. In erarchical navigable small world graphs, CoRR practice, we observed best performance when the retrieval abs/1603.09320 (2016). URL: http://arxiv.org/abs/1603. weight was several orders of magnitude larger than the 09320. arXiv:1603.09320. other loss weights. [8] Y. Zhang, X. Dong, W. Ding, B. Li, P. Jiang, K. Gai, Divide and conquer: Towards better embedding-based 3.3. Results retrieval for recommender systems from a multi-task perspective, 2023. arXiv:2302.02657. We evaluated the effectiveness of SSMTL for end-to-end [9] G. Linden, B. Smith, J. York, Amazon.com recommen- friend recommendation with online A/B testing. The con- dations: item-to-item collaborative filtering, IEEE In- trol group used candidates retrieved from the production ternet Computing 7 (2003) 76–80. doi:10.1109/MIC. model trained with retrieval baseline, while the treatment 2003.1167344. group instead used candidates retrieved with the new robust [10] R. Jha, S. Subramaniyam, E. Benjamin, T. Taula, Unified training objective in the SSMTL setting, specifically combin- embedding based personalized retrieval in etsy search, ing the previous retrieval loss with whitening decorrelation 2023. arXiv:2306.04833. and generative reconstruction objectives. [11] R. Peng, K. Liu, P. Yang, Z. Yuan, S. Li, Embedding- In the A/B experimental results, we saw statistically signif- based retrieval with llm for effective agriculture in- icant improvements across several friend recommendation formation extracting from unstructured data, 2023. metrics. Specifically, we observed up to 5.45% improve- arXiv:2308.03107. ments in new friends made and +1.91% new friends made [12] J. Shi, V. Chaurasiya, Y. Liu, S. Vij, Y. Wu, S. Kanduri, with low-degree users in various markets. Overall, from N. Shah, P. Yu, N. Srivastava, L. Shi, G. Venkatara- these results, we see that SSMTL is able to provide improved man, J. Yu, Embedding based retrieval in friend rec- recommendation compared with the single-task setting, in ommendation, in: Proceedings of the 46th Interna- particular helping with candidate generation for low-degree tional ACM SIGIR Conference on Research and De- users. velopment in Information Retrieval, SIGIR ’23, As- [28] Z. Liu, Y. Ma, Y. Ouyang, Z. Xiong, Con- sociation for Computing Machinery, New York, NY, trastive learning for recommender system, CoRR USA, 2023, p. 3330–3334. URL: https://doi.org/10.1145/ abs/2101.01317 (2021). URL: https://arxiv.org/abs/2101. 3539618.3591848. doi:10.1145/3539618.3591848. 01317. arXiv:2101.01317. [13] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamil- [29] Y. Lu, R. Dong, B. Smyth, Why i like it: multi-task ton, J. Leskovec, Graph convolutional neural net- learning for recommendation and explanation, 2018, works for web-scale recommender systems, CoRR pp. 4–12. doi:10.1145/3240323.3240365. abs/1806.01973 (2018). URL: http://arxiv.org/abs/1806. [30] J. Ma, Z. Zhao, X. Yi, J. Chen, L. Hong, E. H. Chi, Mod- 01973. arXiv:1806.01973. eling task relationships in multi-task learning with [14] P. P.-H. Kung, Z. Fan, T. Zhao, Y. Liu, Z. Lai, J. Shi, multi-gate mixture-of-experts, in: Proceedings of Y. Wu, J. Yu, N. Shah, G. Venkataraman, Improving the 24th ACM SIGKDD International Conference on embedding-based retrieval in friend recommendation Knowledge Discovery & Data Mining, KDD ’18, As- with ann query expansion, in: Proceedings of the 47th sociation for Computing Machinery, New York, NY, International ACM SIGIR Conference on Research and USA, 2018, p. 1930–1939. URL: https://doi.org/10.1145/ Development in Information Retrieval, 2024, pp. 2930– 3219819.3220007. doi:10.1145/3219819.3220007. 2934. [31] X. Ma, L. Zhao, G. Huang, Z. Wang, Z. Hu, X. Zhu, [15] C. Li, X. Peng, Y. Niu, S. Zhang, H. Peng, C. Zhou, J. Li, K. Gai, Entire space multi-task model: An effective Learning graph attention-aware knowledge graph approach for estimating post-click conversion rate, embedding, Neurocomputing 461 (2021) 516–529. 2018. arXiv:1804.07931. URL: https://www.sciencedirect.com/science/article/ [32] H. Tang, J. Liu, M. Zhao, X. Gong, Progressive lay- pii/S0925231221010961. doi:https://doi.org/10. ered extraction (ple): A novel multi-task learning 1016/j.neucom.2021.01.139. (mtl) model for personalized recommendations, in: [16] A. v. d. Oord, Y. Li, O. Vinyals, Representation learning Proceedings of the 14th ACM Conference on Recom- with contrastive predictive coding, arXiv preprint mender Systems, RecSys ’20, Association for Comput- arXiv:1807.03748 (2018). ing Machinery, New York, NY, USA, 2020, p. 269–278. [17] K. He, X. Chen, S. Xie, Y. Li, P. DollΓ‘r, R. Girshick, URL: https://doi.org/10.1145/3383313.3412236. doi:10. Masked autoencoders are scalable vision learners, in: 1145/3383313.3412236. Proceedings of the IEEE/CVF conference on computer [33] A. Argyriou, T. Evgeniou, M. Pontil, Multi-task vision and pattern recognition, 2022, pp. 16000–16009. feature learning, in: B. SchΓΆlkopf, J. Platt, T. Hoffman [18] A. Ermolov, A. Siarohin, E. Sangineto, N. Sebe, Whiten- (Eds.), Advances in Neural Information Processing ing for self-supervised representation learning, in: In- Systems, volume 19, MIT Press, 2006. URL: https: ternational Conference on Machine Learning, PMLR, //proceedings.neurips.cc/paper_files/paper/2006/file/ 2021, pp. 3015–3024. 0afa92fc0f8a9cf051bf2961b06ac56b-Paper.pdf. [19] M. Ju, T. Zhao, Q. Wen, W. Yu, N. Shah, Y. Ye, C. Zhang, [34] R. Caruana, Multitask learning, Machine Learning 28 Multi-task self-supervised graph neural networks en- (1997) 41–75. able stronger task generalization, in: The Eleventh [35] P. Guo, C.-Y. Lee, D. Ulbricht, Learning to branch for International Conference on Learning Representa- multi-task learning, 2020. arXiv:2006.01895. tions, 2023. URL: https://openreview.net/forum?id= [36] X. Sun, R. Panda, R. S. Feris, Adashare: Learning what 1tHAZRqftM. to share for efficient deep multi-task learning, CoRR [20] L. Torrey, J. Shavlik, Transfer Learning, IGI Global, abs/1911.12423 (2019). URL: http://arxiv.org/abs/1911. 2010, pp. 242–264. 12423. arXiv:1911.12423. [21] H. Zhang, Q. Wu, J. Yan, D. Wipf, P. S. Yu, From canon- [37] S. Vandenhende, S. Georgoulis, B. D. Brabandere, L. V. ical correlation analysis to self-supervised graph neu- Gool, Branched multi-task networks: Deciding what ral networks, CoRR abs/2106.12484 (2021). URL: https: layers to share, 2020. arXiv:1904.02920. //arxiv.org/abs/2106.12484. arXiv:2106.12484. [22] Z. Hou, X. Liu, Y. Cen, Y. Dong, H. Yang, C. Wang, J. Tang, Graphmae: Self-supervised masked graph au- toencoders, 2022. arXiv:2205.10803. [23] P. Covington, J. Adams, E. Sargin, Deep neural net- works for youtube recommendations, in: Proceedings of the 10th ACM Conference on Recommender Sys- tems, New York, NY, USA, 2016. [24] T. Koh, G. Wu, , M. Mi, Manas hnsw realtime: Power- ing realtime embedding-based retrieval, 2021. [25] M. E. J. Newman, Clustering and preferential at- tachment in growing networks, Physical Review E 64 (2001). URL: http://dx.doi.org/10.1103/PhysRevE.64. 025102. doi:10.1103/physreve.64.025102. [26] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, M. Sun, Graph neural networks: A review of methods and applications, CoRR abs/1812.08434 (2018). URL: http: //arxiv.org/abs/1812.08434. arXiv:1812.08434. [27] P. VeličkoviΔ‡, G. Cucurull, A. Casanova, A. Romero, P. LiΓ², Y. Bengio, Graph attention networks, 2018. arXiv:1710.10903.