=Paper= {{Paper |id=Vol-3924/short6 |storemode=property |title=Robust Training Objectives Improve Embedding-based Retrieval in Industrial Recommendation Systems |pdfUrl=https://ceur-ws.org/Vol-3924/short6.pdf |volume=Vol-3924 |authors=Matthew Kolodner,Mingxuan Ju,Zihao Fan,Tong Zhao,Elham Ghazizadeh,Yan Wu,Neil Shah,Yozen Liu |dblpUrl=https://dblp.org/rec/conf/robustrecsys/KolodnerJF0GWSL24 }} ==Robust Training Objectives Improve Embedding-based Retrieval in Industrial Recommendation Systems== https://ceur-ws.org/Vol-3924/short6.pdf
                         Robust Training Objectives Improve Embedding-Based Retrieval
                         in Industrial Recommendation Systems
                         Matthew Kolodner1,* , Mingxuan Ju1 , Zihao Fan1 , Tong Zhao1 , Elham Ghazizadeh1 , Yan Wu1 ,
                         Neil Shah1 and Yozen Liu1
                         1
                             Snap, Inc., 2772 Donald Douglas Loop N, Santa Monica, CA 90405, USA


                                           Abstract
                                           Improving recommendation systems (RS) can greatly enhance the user experience across many domains, such as social media. Many
                                           RS utilize embedding-based retrieval (EBR) approaches to retrieve candidates for recommendation. In an EBR system, the embedding
                                           quality is key. According to recent literature, self-supervised multitask learning (SSMTL) has showed strong performance on academic
                                           benchmarks in embedding learning and resulted in an overall improvement in multiple downstream tasks, demonstrating a larger
                                           resilience to the adverse conditions between each downstream task and thereby increased robustness and task generalization ability
                                           through the training objective. However, whether or not the success of SSMTL in academia as a robust training objectives translates to
                                           large-scale (i.e., over hundreds of million users and interactions in-between) industrial RS still requires verification. Simply adopting
                                           academic setups in industrial RS might entail two issues. Firstly, many self-supervised objectives require data augmentations (e.g.,
                                           embedding masking/corruption) over a large portion of users and items, which is prohibitively expensive in industrial RS. Furthermore,
                                           some self-supervised objectives might not align with the recommendation task, which might lead to redundant computational overheads
                                           or negative transfer. In light of these two challenges, we evaluate using a robust training objective, specifically SSMTL, through a
                                           large-scale friend recommendation system on a social media platform in the tech sector, identifying whether this increase in robustness
                                           can work at scale in enhancing retrieval in the production setting. Through online A/B testing with SSMTL-based EBR, we observe
                                           statistically significant increases in key metrics in the friend recommendations, with up to 5.45% improvements in new friends made
                                           and 1.91% improvements in new friends made with cold-start users. Besides, with a dedicated case study, the benefits of robust training
                                           objectives are demonstrated through SSMTL on large-scale graphs with gains in both retrieval and end-to-end friend recommendation.


                         1. Introduction                                                                                              end-to-end recommendation. In this work, we specifically
                                                                                                                                      focus on the friend recommendation EBR setting, where
                         Recommendation systems (RS) have become a crucial com-                                                       vast amounts of topological information relating users are
                         ponent for user experience [1, 2]. Most industrial RS ex-                                                    readily available. Recent works [12, 13, 14] have shown
                         plore a two-stage process [3]. During the first stage (i.e., the                                             that including this relational information can improve the
                         retrieval phase), among hundreds of millions of candidate                                                    embedding quality. The relational information is commonly
                         users/items, the RS usually utilizes several models optimized                                                modeled with graph neural networks (GNNs), producing
                         for recall to select a small set of candidate users/items (e.g.,                                             embeddings that leverage neighbor information in graphs,
                         1,000 candidates). Whereas during the second stage (i.e., the                                                such as co-friend relationships. For graph-aware EBR in
                         ranking phase), within the candidate subset, the RS can ex-                                                  particular, link prediction has seen success for generating
                         plore complicated expensive models that are optimized for                                                    high-quality embeddings [15], where we look to predict
                         precision to select top 𝐾 candidates for the final recommen-                                                 the presence of an edge between a query node and set of
                         dation. Such two-stage process enables recommendation                                                        candidate nodes.
                         over large quantities of possible users/items and allows for                                                    While link prediction is effective in learning nuanced sim-
                         greater flexibility towards key recommendation metrics.                                                      ilarities and distinctions between candidates, there are sev-
                            In this two-stage scheme, the retrieval stage is especially                                               eral other self-supervised graph learning philosophies that
                         important, as it acts as the bottleneck for possible candidates                                              can provide high-quality embeddings, such as mutual infor-
                         provided to the ranker in the second stage. One common ap-                                                   mation maximization [16], generative reconstruction [17],
                         proach [4, 5] for the retrieval step is to leverage embedding-                                               or whitening decorrelation [18]. Based on these general
                         based retrieval (EBR). Specifically, EBR learns embeddings                                                   philosophies, many graph-based approaches have been pro-
                         for all users and items as vectors in a low-dimensional latent                                               posed and used to learning embeddings directly, achieving
                         space. These embeddings are learned in a way such that                                                       desirable properties of embeddings without requiring ex-
                         the distance between them is reflective of their similarity,                                                 plicit labels. Recently, Ju et al. [19] evaluated combining
                         with more similar items being closer together in the latent                                                  these self-supervised learning approaches with link predic-
                         space. As a result, candidates can be retrieved through a                                                    tion in a multitask (MTL) setting, demonstrating a larger re-
                         nearest-neighbor search across the latent space. In practice,                                                silience to the adverse conditions between each downstream
                         this is done using an approximate nearest neighbor methods                                                   task and thereby increased robustness and generalization
                         optimized for large-scale retrieval, such as FAISS [6] and                                                   ability through the training objective
                         HNSW [7].                                                                                                       However, whether or not using SSMTL in academia as
                            Many methods [8, 9, 10, 11] have been proposed for gen-                                                   a robust training objective translates to large-scale (i.e.,
                         erating high-quality embeddings for EBR, which lead to                                                       over hundreds of millions of users and interactions in-
                         more relevant candidates and improved metrics after the                                                      between) industrial RSs still requires verification. Simply
                                                                                                                                      adopting academic setups in industrial RSs might result
                          RobustRecSys: Design, Evaluation, and Deployment of Robust Recom-
                                                                                                                                      in several issues. Firstly, many self-supervised objectives
                          mender Systems Workshop @ RecSys 2024, 18 October, 2024, Bari, Italy.
                         *
                           Corresponding author.                                                                                      require data augmentations (e.g., embedding masking/cor-
                          $ mkolodner@snap.com (M. Kolodner); mju@snap.com (M. Ju);                                                   ruption) over a large portion of users and items, which
                          zfan3@snap.com (Z. Fan); tong@snap.com (T. Zhao);                                                           is prohibitively expensive in industrial RSs. Furthermore,
                          eghazizadeh@snap.com (E. Ghazizadeh); ywu@snap.com (Y. Wu);                                                 some self-supervised objectives might not align with the rec-
                          nshah@snap.com (N. Shah); yliu2@snap.com (Y. Liu)
                                       Β© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                                                                                                                      ommendation task, which might lead to redundant compu-
                                       Attribution 4.0 International (CC BY 4.0).



CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Figure 1: In our proposed SSMTL framework, we combine the CCA and MAE SSL methods with the retrieval task in our embedding
generation scheme for EBR. CCA looks to maximize the correlation of two augmented views of the input subgraph while decorrelating
features of a single view. MAE seeks to reconstruct the query user nodes after being propagated through the GNN encoder backbone.
Finally, the retrieval task seeks to predict which candidates share a link with the query user using a categorical cross entropy loss. The
loss of each subtask is weighted and summed to measure the final loss. Embeddings can be generated through the GNN encoder for EBR.



tational overheads or negative transfer [20], a phenomenon              2. Background
where performance can worsen as a result of the complexity
and potentially opposing nature of the various tasks.                   2.1. Graph-Aware Embedding-based
   In this work, we investigate whether robust SSMTL train-                  Retrieval
ing objectives are able to improve the link prediction re-
trieval performance on large-scale graphs with over hun-                In a two-stage recommendation system with a retrieval then
dreds of millions of nodes and edges. Specifically, we look to          ranking phase, the retrieval phase plays an important role fil-
find what combination of SSL approaches can improve over-               tering out the most relevant candidates to lighten the load of
all robustness and thereby augment retrieval through com-               the ranker. Since the ranking result is largely dependent on
plementary yet disjoint information. In our experiments,                items retrieved in the retrieval phase, a good quality retrieval
we find two SSL approaches, based on philosophies from                  model can drastically improve the final ranking. Embedding
whitening decorrelation (e.g., Canonical Correlation Anal-              based retrieval (EBR) is a method that’s recently adopted
ysis [21]) and generative reconstruction (e.g., Masked Au-              and deployed in many content, product, and friend recom-
toencoders [22]), that are able to augment the performance              mendation systems[4, 23, 24, 12], and proved to achieve
of link prediction without negative transfer. We deploy                 superior results. EBR transform users and items into em-
the proposed framework on an industrial large-scale friend              beddings, changing the retrieval problem into a nearest-
recommendation system to a community of hundreds of                     neighbor search problem in a low-dimensional latent space.
millions of users. In the online A/B testing, we observe                These embeddings can be determined in advance and in-
significant improvements in key metrics like new friends                dexed using an approximate nearest neighbor search such
made, especially with cold-start users on the platform. Our             as FAISS [6] and HNSW [7] in order to retrieve the top-π‘˜
contributions are summarized as follows:                                most relevant items efficiently at serving.
                                                                           When applying EBR to RS problems, the quality of em-
     β€’ We demonstrate the effectiveness of robust training              beddings is of upmost importance. In this paper, we use a
       objectives such as SSMTL in a large-scale industrial             friend recommendation system as our subject. In scenarios
       recommendation system.                                           like friend recommendation where vast amounts of topolog-
                                                                        ical information relating users and items is readily available,
     β€’ We conduct an online study of SSMTL on a massive
                                                                        these embeddings can be augmented with GNNs. Previous
       real-world recommendation system, and observe a
                                                                        work showed that EBR for friend recommendation systems
       statistically significant increase in key metrics, with
                                                                        see benefits leveraging graph-aware embeddings[12]. In
       up to 5.45% improvements in new friends made
                                                                        this setting, nodes would contain individual user features
       and 1.91% improvements in new friends made with
                                                                        while edges map to user-user interactions. This approach
       cold-start users.
                                                                        compliments commonly used graph traversal approaches
                                                                        (eg. friend-of-friend (FoF) [25]), allowing for retrieval of
                                                                        candidates from any number of hops away from the target.
                                                                           Here we describe GNNs for generating graph-aware em-
                                                                        beddings for EBR. GNNs have demonstrated state-of-the-art
                                                                        performance in many problems containing rich topological
                                                                        information within the graph data [26], such as recommen-
                                                                        dation and forecasting. Formally, we define 𝐺 = (𝒱, β„°, 𝑋),
where 𝒱 is the set of 𝑛 nodes (|𝒱| = 𝑛), β„° is the set of         3. Self-Supervised Multitask
edges (β„° ∈ 𝒱 βŠ† 𝒱), and 𝑋 is a feature matrix of dimension
𝑑 where 𝑋 ∈ R𝑛×𝑑 . Many modern GNNs also employ
                                                                    Learning for EBR
a message-passing structure, consisting of an aggregation        In the following sections, we describe details of the SSL
(AGG) and update (UPD) function. The goal of this paradigm       methods used in our SSMTL approach, our experiment set
is for nodes to receive information from their neighbors,        up and results, highlighting the benefits and impact of in-
collecting messages using its AGG function before updating       cluding SSMTL based embedding in EBR for large-scale
their own messages with the UPD function, both of which          industrial recommendation systems.
are learnable and permutation-invariant. For some node 𝑒
at layer π‘˜, the next message-passing layer can be written as
                                                                 3.1. Self-Supervised Learning Methods
                    (︁             (︁                     )︁)︁
 h(π‘˜+1)
  𝑒      = UPD(π‘˜) h(π‘˜) 𝑒 , AGG
                               (π‘˜)
                                      {h𝑣(π‘˜) , βˆ€π‘£ ∈ 𝒩 (𝑒)}       We identify two self-supervised learning approaches that
                                                           (1)   are scalable and lead to improvements in the large-scale
   where 𝒩 (𝑒) is the neighborhood nodes of node 𝑒. Dif-         recommendation setting through a more robust training
ferent message-passing GNN models use different combi-           objective.
nations of AGG and UPD functions. An example of a more
complex GNN, Graph-attention networks (GATs) [27], use              Canonical Correlation Analysis. Based on work from
an attention mechanism for each pair of nodes 𝑖 and 𝑗            [21], Canonical Correlation Analysis (CCA) deploys a non-
                                                                 contrastive, non-discriminitive SSL method to train the
            𝛼𝑖𝑗 = softmax𝑗 (𝑓att (Wβ„Žπ‘– , Wβ„Žπ‘— ))            (2)    GNN. The self-supervised training objective is described in
                                                                 Equation 3. First, given a subgraph with 𝑛 nodes, two aug-
where W is a linear transformation applied to every node         mented views of the subgraph are created and fed through
and 𝑓att is the attention function parameterized by a weight     the GNN, producing ZA and ZB where ZA , ZB ∈ Rπ‘›Γ—π‘˜ .
vector and a non-linearity function. The AGG function is         Each of these embeddings are fed through a task-specific
then a attention-weighted sum of its neighbors features          head, and then are normalized so that each feature has 0
while the UPD function is implicitly defined in W and the        mean and √1𝑛 standard deviation, resulting in ZΜƒA and ZΜƒB .
non-linearity function. Typically, to generate graph-aware
                                                                 The loss is then computed from Equation 3. The first term
embeddings from GNNs, a margin based ranking loss[13, 12]
                                                                 in the equation seeks to minimize the distance of the same
or contrastive[28] loss can be used, to encourage items that
                                                                 nodes between the two views. The second term enforces
are closer in the graph to be closer in the embedding space.
                                                                 that the feature-wise covariance of all nodes is equal to the
                                                                 identity matrix.
2.2. Multitask Learning
Multitask learning (MTL) is an approach in machine learn-               ⃦          ⃦2    (︂⃦         ⃦2   ⃦         ⃦2 )οΈ‚
                                                                        βƒ¦Λœ                 βƒ¦Λœπ‘‡            βƒ¦Λœπ‘‡
ing where a model is trained simultaneously on several tasks.    β„’CCA = ⃦Z   βˆ’ ZΜƒ     +πœ†       ZΜƒ βˆ’ I   +     ZΜƒ βˆ’ I
                                                                                   ⃦                 ⃦              ⃦
                                                                           𝐴      𝐡⃦       ⃦Z𝐴 𝐴     ⃦    ⃦Z𝐡 𝐡     ⃦
                                                                                       𝐹                     𝐹                    𝐹
MTL has been extensively explored in recommendation as a                                                                    (3)
way to improve key metrics [29, 30, 31, 32]. Thus, the core      Masked Autoencoders. Based on work from [22], this
idea behind multitask learning is to improve the robustness      approach leverages a graph masked autoencoder (MAE)
of the model by leveraging the domain-specific information       that focuses on feature reconstruction. First, an augmented
contained in the training signals of related tasks [33, 34].     view of the subgraph is created and the features of the query
Hard parameter sharing, one of the most fundamental forms        users are masked out. This augmented graph is then fed
of MTL, uses a shared representation which then branches         through the GNN and a task-specific head. The features of
into multiple heads capable of learning task-specific infor-     the query users are then re-masked and passed through a
mation [35, 36, 37].                                             graph convolution layer. As described in Equation 4, for
   For graph-aware EBR in particular, self-supervised multi-     all masked nodes 𝒱, the final loss is equal to the average
task learning (SSMTL) has been proposed as a new approach        of the scaled cosine error between the original features X
to MTL, optimizing the embeddings directly to achieve de-        and generated features Z. This approach only relies on the
sirable embedding properties without the use of positive         local neighborhood surrounding the query node, making it
or negative labels. In this setting, we combine several self-    a good option for large-scale SSMTL.
supervised learning (SSL) methods with a downstream re-                                                     )︂𝑦
trieval task to learn both direct and indirect embedding fea-                                    x𝑇𝑖 z𝑖
                                                                                       (οΈ‚
                                                                               1 βˆ‘οΈ
                                                                     β„’MAE =               1βˆ’                    , 𝑦 β‰₯ 1 (4)
tures. Recent work [19] has shown that SSMTL can lead to                      |𝒱| 𝑣 βˆˆπ’±        β€–x𝑖 β€– Β· β€–z𝑖 β€–
improved task generalization and embedding quality on sev-                         𝑖

eral academic benchmarks through the increasingly robust         We note that these two approaches both utilize non-
training objective. However, many of the SSL approaches          contrastive methods. While experimenting with different
used are constrained to the assumption that global graph in-     SSL tasks, we find that contrastive SSL approaches do not
formation can be inferred within the graph structure. This is    perform very well in the production setting due to their
not valid in the large-scale recommendation setting, where       assumption that global information is readily available in
graphs are constrained to some 𝐾-hop around a query user         the original and augmented graphs. This is not necessarily
in order to fit in memory. As a result, many of these SSL        true for large-scale recommendation, where subgraphs are
methods may lead to negative transfer due to SSL task con-       constrained to the K-hop neighborhood surrounding each
flict with the target link prediction task, and there remains    query node.
work to be done to investigate which methods perform best
in this large-scale setting.
3.2. Experimental Setup                                            4. Conclusion
3.2.1. Problem Breakdown                                           In this paper, we evaluate the effectiveness of a robust self-
We evaluate the SSMTL as a robust training objective on an         supervised multitask learning objective in embedding-based
industrial friend recommendation system with hundreds of           retrieval. Through online evaluation, we demonstrate that
millions of users and connections. To handle this scale of         self-supervised methods used in a multi task setting are able
training, we sample subgraphs containing the π‘˜-hop neigh-          to augment the performance of the underlying retrieval task
borhood around each query user. Following training, the            on the scale of over 800 million nodes and edges, providing
embeddings for EBR can be via propagation through the              complementary yet disjoint information to enhance the em-
encoder backbone.                                                  bedding quality. We observe statistically significant gains
                                                                   in the number of friendships made for both high and low
3.2.2. Retrieval Baseline                                          degree users.

The baseline model uses a supervised single-task setup for
embedding-based retrieval. We use a GAT as the GNN en-             References
coder backbone to obtain embeddings for the query user and
each candidate, producing a candidate embedding matrix z.           [1] Y. Li, K. Liu, R. Satapathy, S. Wang, E. Cambria, Re-
We can then compute the dot product between the query                   cent developments in recommender systems: A survey,
user and each candidate and apply Softmax to generate the               2023. arXiv:2306.12680.
logits. We then calculate the Categorical Cross Entropy             [2] A. Sun, Y. Peng, A survey on modern recommendation
Loss with the true labels y across the 𝑁 = 2 classes and 𝑀              system based on big data, 2024. arXiv:2206.02631.
candidates, outlined in Equation 5.                                 [3] P. Covington, J. Adams, E. Sargin, Deep neural
                                                                        networks for youtube recommendations, in: Pro-
                                                                        ceedings of the 10th ACM Conference on Recom-
                        𝑁 βˆ‘οΈπ‘€
                                      (οΈƒ           )οΈƒ
                       βˆ‘οΈ                 𝑒𝑧𝑖𝑗
        β„’retrieval = βˆ’         𝑦𝑖𝑗 log βˆ‘οΈ€π‘€               (5)            mender Systems, RecSys ’16, Association for Comput-
                                               π‘§π‘–π‘˜
                       𝑖=1 𝑗=1           π‘˜=1 𝑒                          ing Machinery, New York, NY, USA, 2016, p. 191–198.
                                                                        URL: https://doi.org/10.1145/2959100.2959190. doi:10.
3.2.3. SSMTL Implementation Details                                     1145/2959100.2959190.
In our SSMTL approach, we use both CCA and MAE in com-              [4] J. Huang, A. Sharma, S. Sun, L. Xia, D. Zhang,
bination with the retrieval baseline as the training objec-             P. Pronin, J. Padmanabhan, G. Ottaviano, L. Yang,
tives. All three methods share the same GAT GNN backbone.               Embedding-based retrieval in facebook search, CoRR
The augmented views for CCA and MAE occur separately,                   abs/2006.11632 (2020). URL: https://arxiv.org/abs/2006.
with CCA performing edge and feature drop augmentations                 11632. arXiv:2006.11632.
while MAE performs edge drop and query node masking.                [5] Y. Gan, Y. Ge, C. Zhou, S. Su, Z. Xu, X. Xu, Q. Hui,
The task-specific head for CCA is a Linear-ReLU-Linear                  X. Chen, Y. Wang, Y. Shan, Binary embedding-based
block while the task-specific head for MAE is one linear                retrieval at tencent, 2023. arXiv:2302.08714.
layer. The final loss with SSMTL is a weighted sum of the           [6] J. Johnson, M. Douze, H. JΓ©gou, Billion-scale sim-
losses.                                                                 ilarity search with gpus,         CoRR abs/1702.08734
                                                                        (2017).      URL:      http://arxiv.org/abs/1702.08734.
                                                                        arXiv:1702.08734.
         β„’combined = 𝛼ℒretrieval + 𝛽ℒCCA + 𝛾ℒMAE            (6)
                                                                    [7] Y. A. Malkov, D. A. Yashunin, Efficient and ro-
where 𝛼 is the weight for the retrieval loss, 𝛽 is the weight           bust approximate nearest neighbor search using hi-
of the CCA loss, and 𝛾 is the weight of the MAE loss. In                erarchical navigable small world graphs, CoRR
practice, we observed best performance when the retrieval               abs/1603.09320 (2016). URL: http://arxiv.org/abs/1603.
weight was several orders of magnitude larger than the                  09320. arXiv:1603.09320.
other loss weights.                                                 [8] Y. Zhang, X. Dong, W. Ding, B. Li, P. Jiang, K. Gai,
                                                                        Divide and conquer: Towards better embedding-based
3.3. Results                                                            retrieval for recommender systems from a multi-task
                                                                        perspective, 2023. arXiv:2302.02657.
We evaluated the effectiveness of SSMTL for end-to-end              [9] G. Linden, B. Smith, J. York, Amazon.com recommen-
friend recommendation with online A/B testing. The con-                 dations: item-to-item collaborative filtering, IEEE In-
trol group used candidates retrieved from the production                ternet Computing 7 (2003) 76–80. doi:10.1109/MIC.
model trained with retrieval baseline, while the treatment              2003.1167344.
group instead used candidates retrieved with the new robust        [10] R. Jha, S. Subramaniyam, E. Benjamin, T. Taula, Unified
training objective in the SSMTL setting, specifically combin-           embedding based personalized retrieval in etsy search,
ing the previous retrieval loss with whitening decorrelation            2023. arXiv:2306.04833.
and generative reconstruction objectives.                          [11] R. Peng, K. Liu, P. Yang, Z. Yuan, S. Li, Embedding-
   In the A/B experimental results, we saw statistically signif-        based retrieval with llm for effective agriculture in-
icant improvements across several friend recommendation                 formation extracting from unstructured data, 2023.
metrics. Specifically, we observed up to 5.45% improve-                 arXiv:2308.03107.
ments in new friends made and +1.91% new friends made              [12] J. Shi, V. Chaurasiya, Y. Liu, S. Vij, Y. Wu, S. Kanduri,
with low-degree users in various markets. Overall, from                 N. Shah, P. Yu, N. Srivastava, L. Shi, G. Venkatara-
these results, we see that SSMTL is able to provide improved            man, J. Yu, Embedding based retrieval in friend rec-
recommendation compared with the single-task setting, in                ommendation, in: Proceedings of the 46th Interna-
particular helping with candidate generation for low-degree             tional ACM SIGIR Conference on Research and De-
users.
     velopment in Information Retrieval, SIGIR ’23, As-           [28] Z. Liu, Y. Ma, Y. Ouyang, Z. Xiong,                  Con-
     sociation for Computing Machinery, New York, NY,                  trastive learning for recommender system, CoRR
     USA, 2023, p. 3330–3334. URL: https://doi.org/10.1145/            abs/2101.01317 (2021). URL: https://arxiv.org/abs/2101.
     3539618.3591848. doi:10.1145/3539618.3591848.                     01317. arXiv:2101.01317.
[13] R. Ying, R. He, K. Chen, P. Eksombatchai, W. L. Hamil-       [29] Y. Lu, R. Dong, B. Smyth, Why i like it: multi-task
     ton, J. Leskovec, Graph convolutional neural net-                 learning for recommendation and explanation, 2018,
     works for web-scale recommender systems, CoRR                     pp. 4–12. doi:10.1145/3240323.3240365.
     abs/1806.01973 (2018). URL: http://arxiv.org/abs/1806.       [30] J. Ma, Z. Zhao, X. Yi, J. Chen, L. Hong, E. H. Chi, Mod-
     01973. arXiv:1806.01973.                                          eling task relationships in multi-task learning with
[14] P. P.-H. Kung, Z. Fan, T. Zhao, Y. Liu, Z. Lai, J. Shi,           multi-gate mixture-of-experts, in: Proceedings of
     Y. Wu, J. Yu, N. Shah, G. Venkataraman, Improving                 the 24th ACM SIGKDD International Conference on
     embedding-based retrieval in friend recommendation                Knowledge Discovery & Data Mining, KDD ’18, As-
     with ann query expansion, in: Proceedings of the 47th             sociation for Computing Machinery, New York, NY,
     International ACM SIGIR Conference on Research and                USA, 2018, p. 1930–1939. URL: https://doi.org/10.1145/
     Development in Information Retrieval, 2024, pp. 2930–             3219819.3220007. doi:10.1145/3219819.3220007.
     2934.                                                        [31] X. Ma, L. Zhao, G. Huang, Z. Wang, Z. Hu, X. Zhu,
[15] C. Li, X. Peng, Y. Niu, S. Zhang, H. Peng, C. Zhou, J. Li,        K. Gai, Entire space multi-task model: An effective
     Learning graph attention-aware knowledge graph                    approach for estimating post-click conversion rate,
     embedding, Neurocomputing 461 (2021) 516–529.                     2018. arXiv:1804.07931.
     URL: https://www.sciencedirect.com/science/article/          [32] H. Tang, J. Liu, M. Zhao, X. Gong, Progressive lay-
     pii/S0925231221010961. doi:https://doi.org/10.                    ered extraction (ple): A novel multi-task learning
     1016/j.neucom.2021.01.139.                                        (mtl) model for personalized recommendations, in:
[16] A. v. d. Oord, Y. Li, O. Vinyals, Representation learning         Proceedings of the 14th ACM Conference on Recom-
     with contrastive predictive coding, arXiv preprint                mender Systems, RecSys ’20, Association for Comput-
     arXiv:1807.03748 (2018).                                          ing Machinery, New York, NY, USA, 2020, p. 269–278.
[17] K. He, X. Chen, S. Xie, Y. Li, P. DollΓ‘r, R. Girshick,            URL: https://doi.org/10.1145/3383313.3412236. doi:10.
     Masked autoencoders are scalable vision learners, in:             1145/3383313.3412236.
     Proceedings of the IEEE/CVF conference on computer           [33] A. Argyriou, T. Evgeniou, M. Pontil, Multi-task
     vision and pattern recognition, 2022, pp. 16000–16009.            feature learning, in: B. SchΓΆlkopf, J. Platt, T. Hoffman
[18] A. Ermolov, A. Siarohin, E. Sangineto, N. Sebe, Whiten-           (Eds.), Advances in Neural Information Processing
     ing for self-supervised representation learning, in: In-          Systems, volume 19, MIT Press, 2006. URL: https:
     ternational Conference on Machine Learning, PMLR,                 //proceedings.neurips.cc/paper_files/paper/2006/file/
     2021, pp. 3015–3024.                                              0afa92fc0f8a9cf051bf2961b06ac56b-Paper.pdf.
[19] M. Ju, T. Zhao, Q. Wen, W. Yu, N. Shah, Y. Ye, C. Zhang,     [34] R. Caruana, Multitask learning, Machine Learning 28
     Multi-task self-supervised graph neural networks en-              (1997) 41–75.
     able stronger task generalization, in: The Eleventh          [35] P. Guo, C.-Y. Lee, D. Ulbricht, Learning to branch for
     International Conference on Learning Representa-                  multi-task learning, 2020. arXiv:2006.01895.
     tions, 2023. URL: https://openreview.net/forum?id=           [36] X. Sun, R. Panda, R. S. Feris, Adashare: Learning what
     1tHAZRqftM.                                                       to share for efficient deep multi-task learning, CoRR
[20] L. Torrey, J. Shavlik, Transfer Learning, IGI Global,             abs/1911.12423 (2019). URL: http://arxiv.org/abs/1911.
     2010, pp. 242–264.                                                12423. arXiv:1911.12423.
[21] H. Zhang, Q. Wu, J. Yan, D. Wipf, P. S. Yu, From canon-      [37] S. Vandenhende, S. Georgoulis, B. D. Brabandere, L. V.
     ical correlation analysis to self-supervised graph neu-           Gool, Branched multi-task networks: Deciding what
     ral networks, CoRR abs/2106.12484 (2021). URL: https:             layers to share, 2020. arXiv:1904.02920.
     //arxiv.org/abs/2106.12484. arXiv:2106.12484.
[22] Z. Hou, X. Liu, Y. Cen, Y. Dong, H. Yang, C. Wang,
     J. Tang, Graphmae: Self-supervised masked graph au-
     toencoders, 2022. arXiv:2205.10803.
[23] P. Covington, J. Adams, E. Sargin, Deep neural net-
     works for youtube recommendations, in: Proceedings
     of the 10th ACM Conference on Recommender Sys-
     tems, New York, NY, USA, 2016.
[24] T. Koh, G. Wu, , M. Mi, Manas hnsw realtime: Power-
     ing realtime embedding-based retrieval, 2021.
[25] M. E. J. Newman, Clustering and preferential at-
     tachment in growing networks, Physical Review E
     64 (2001). URL: http://dx.doi.org/10.1103/PhysRevE.64.
     025102. doi:10.1103/physreve.64.025102.
[26] J. Zhou, G. Cui, Z. Zhang, C. Yang, Z. Liu, M. Sun,
     Graph neural networks: A review of methods and
     applications, CoRR abs/1812.08434 (2018). URL: http:
     //arxiv.org/abs/1812.08434. arXiv:1812.08434.
[27] P. VeličkoviΔ‡, G. Cucurull, A. Casanova, A. Romero,
     P. LiΓ², Y. Bengio, Graph attention networks, 2018.
     arXiv:1710.10903.