Partial Relaxed Optimal Transport for Denoised Recommendation Yanchao Tan1,2 , Carl Yang3,* , Xiangyu Wei4 , Ziyue Wu5 , Weiming Liu4,† and Xiaolin Zheng4 1 The College of Computer and Data Science, Fuzhou University, Fuzhou 350116, China 2 Fujian Key Laboratory of Network Computing and Intelligent Information Processing, Fuzhou University, Fuzhou 350116, China 3 The Department of Computer Science, Emory University, Atlanta 30322, United State 4 The College of Computer Science, Zhejiang University, Hangzhou 310027, China 5 The School of Management, Zhejiang University, Hangzhou 310058, China Abstract The interaction data used by recommender systems (RSs) inevitably include noises resulting from mistaken or exploratory clicks, especially under implicit feedbacks. Without proper denoising, RS models cannot effectively capture users’ intrinsic preferences and the true interactions between users and items. To address such noises, existing methods mostly rely on auxiliary data which are not always available. In this work, we ground on Optimal Transport (OT) to globally match a user embedding space and an item embedding space, allowing both non-deep and deep RS models to discriminate intrinsic and noisy interactions without supervision. Specifically, we firstly leverage the OT framework via Sinkhorn distance to compute the continuous many-to-many user-item matching scores. Then, we relax the regularization in Sinkhorn distance to achieve a closed-form solution with a reduced time complexity. Finally, to consider individual user behaviors for denoising, we develop a partial OT framework to adaptively relabel user-item interactions through a personalized thresholding mechanism. Extensive experiments show that our framework can significantly boost the performances of existing RS models. 1. Introduction interactions for more accurate recommendations? In this work, we approach the denoised recommen- With the rapid growth of various activities on the Web, dation problem in two steps: 1. distinguishing intrinsic recommender systems (RSs) become fundamental in help- and noisy interactions; 2. stressing intrinsic preferences ing users alleviate the problem of information overload. and reducing noisy ones for recommendation. Inspired However, users can click some items by mistake or out by a least modeling effort principle in an unsupervised of curiosity, and many RSs will also recommend some fashion [4, 5], we refer to noises as minor abnormal data less relevant items for exploration every now and then. that needs higher modeling effort than the intrinsic ones. Take movie watching for example. Since a user cannot Based on the above rationale, we propose to distin- always distinguish horrible movies from romantic ones guish noises by ranking the global matching cost between by movie names, he/she can easily click a horrible movie the user and item embedding spaces, and flexibly inte- by mistake among the many romantic movies he/she usu- grate it with both non-deep and deep RS methods. In ally watches. In this case, a traditional method like CF this way, the key to distinguishing noises boils down cannot effectively reduce the probability of recommend- to finding a global matching matrix with the minimum ing a horrible movie unless the noisy clicks are extremely matching cost. Subsequently, we advocate a principled rare, which may further seduce even more noisy clicks. denoised RS framework and calculate the matching ma- Recently, to get out of such mistakes, existing research trix grounding on Optimal Transport (OT), which has discovers users’ intrinsic preferences with the help of been introduced to address the unsupervised matching external user behaviors [1], auxiliary item features [2] or problem between two distributions or spaces in many extra feedbacks [3]. A key limitation is their reliance on fields [6, 7]. Through minimizing the overall cost of the auxiliary data, which costs significant effort and is not mismatched user-item pairs, OT finds a global optimal always attainable in RSs. Without supervision, can we matching matrix, whose values represent the matching develop a framework to automatically denoise user-item cost (i.e., modeling effort). Specifically, the low user-item DL4SR’22: Workshop on Deep Learning for Search and Recommen- matching cost indicates a user’s intrinsic preferences while dation, co-located with the 31st ACM International Conference on the high value indicates noisy ones. Information and Knowledge Management (CIKM), October 17-21, 2022, However, to apply OT to achieve denoised recommen- Atlanta, USA dation, three problems still remain: 1. Unlike the one-hot * Corresponding author. constraint in previous applications of OT, the nature of $ yctan@fzu.edu.cn (Y. Tan); j.carlyang@emory.edu (C. Yang); weixy@zju.edu.cn (X. Wei); wuziyue@zju.edu.cn (Z. Wu); RSs requires a many-to-many matching; 2. The large 21831010@zju.edu.cn (W. Liu); xlzheng@zju.edu.cn (X. Zheng) number of users and items makes the matching process © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). time-consuming; 3. A mechanism is needed to properly CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) leverage the learned matching matrix and finally elimi- Optimal transport (OT) theory with guaranteed optimal- nate the effect of noises for denoised recommendation. ity in the global matching and can be flexibly integrated To address these challenges, we first leverage the OT with both deep and non-deep RS methods. framework via Sinkhorn distance [8]. Unlike a one-hot OT aims to find a matching between two distributions matching matrix that satisfies the requirements of the or spaces, which has been most widely used in transfer many widely studied tasks (e.g., transfer learning [6, 9]), learning. It can match one instance in the source domain we compute a continuous many-to-many approxima- to another in the target domain with the least cost. Ex- tion to the discrete OT, so as to meet the nature of RSs. isting works mainly study discrete OT for one-to-one Furthermore, we propose to relax the regularization in matching. For example, in computer vision, OT has been Sinkhorn to achieve a closed-form solution, which can applied to domain adaption [4, 6] and style transfer [16]. reduce the time complexity for large-scale data from OT has also been successfully applied in many natural 𝒪(max(𝑀, 𝑁 )3 ) to 𝒪(max(𝑀, 𝑁 )2 ) . Then, we rank language processing tasks, such as topic modeling [7, 17], the learned continuous matching cost to differentiate in- text generation [18], and sequence-to-sequence learning trinsic and noisy interactions for each user. Finally, we [19]. Considering the effect of non-matched pairs, [9] take into account that individual users’ behaviors can proposed a partial one-to-one matching that leverages vary in RSs, which should be handled differently during one shared threshold to address domain shift for domain denoising. To this end, we design a personalized thresh- adaption instead of taking all pairs into account. olding mechanism for relabeling user-item interactions, However, the above discrete one-to-one matching can- which efficiently finds a splitting point for each individual not satisfy the nature of many-to-many user-item re- user according to the ranked matching cost. lationships in RSs, where each user can interact with Experiment results on one real-world dataset with syn- many items and each item can interact with many users. thesized noises demonstrate our proposed ProRec’s abil- Though OT is applied into cold-start recommendation ity of denoising under various levels of noises. Further problem in [20], they directly measure user-item similar- experiments on three original real-world datasets also ity without taking the effect of noises into consideration. verify the advantages of ProRec, which demonstrate its Moreover, partial matching with a shared threshold can- significant performance boosts upon multiple popular RS not reflect the individual user behaviors, and thus leads models, in terms of effectiveness and efficiency. to suboptimal recommendation performances. 2. Related work 3. Methodology Many research studies have pointed out that there exist In this section, we detail the proposed Partial relaxed a large proportion of noisy interactions in the implicit optimal Transport for the denoised Recommendation feedbacks (i.e., the clicks) [10, 1, 11]. These feedbacks are (ProRec) framework, as shown in Figure 1. Specifically, easily affected by different factors, such as the position we first formulate the denoised recommendation as a two- bias [12], caption bias [13] and exposure bias [10, 14]. step process. Then, we introduce the Optimal Transport Therefore, there exist large gaps between the implicit (OT) framework for distinguishing noises through global feedbacks and the actual user preferences due to various user-item matching. Furthermore, we elaborate on our reasons, which are detrimental to the users’ continuous proposed ProRec framework, which can be applied to behaviors and overall satisfaction [2, 5]. achieve denoised recommendations. To alleviate the above issues, many researchers have considered incorporating auxiliary feedbacks to the iden- 3.1. Problem Statement tification of noises in the implicit feedbacks, such as dwell time [1] and scroll intervals [13]. These works usually Following the setting of recommendation, we denote design features manually and label items with external the user-item interaction matrix as 𝑋 ∈ {0, 1}𝑀 ×𝑁 , help (e.g., from domain experts) [13, 2], which require where 𝑀 denotes the number of users and 𝑁 denotes the extensive effort and cost, which are not always available number of items. We construct user-item relationships across RSs. To this end, sheer observations on training across a user embedding space 𝒟𝑢 and an item embedding losses have recently been explored to denoise the implicit space 𝒟𝑣 . The problem is that user-item pairs contain feedbacks during training in an unsupervised fashion [5]. some noisy interactions which need to be relabeled. The method is purely heuristic and can be integrated In this work, we approach denoised recommendation with deep learning methods only. Based on a principle through two steps. Firstly, we distinguish intrinsic and of the least modeling effort in different fields [4, 15], our noisy interactions across 𝒟𝑢 and 𝒟𝑣 by a global match- proposed framework aims to denoise the user-item in- ing matrix 𝜋 * . Then, to stress users’ intrinsic preferences teractions without supervision, which is supported by and reduce noises, we rank the learned 𝜋 * and keep only 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 A 1 1 1 0 1 0 0.077 0.077 0.087 0.070 0.054 0.034 3 1 2 4 5 6 1 1 1 0 0 0 B 1 1 0 1 1 1 0.070 0.070 0.050 0.077 0.066 0.069 4 1 2 6 5 3 1 1 0 1 1 1 C 1 1 0 0 0 0 0.046 0.046 0.026 0.046 0.017 0.018 1 2 4 3 6 5 1 1 0 0 0 0 ∗ Interaction Matrix X Matching Matrix " Personalized threshold ! Relabeled Matrix R Intrinsic preference Relabeled data ProRec Figure 1: A toy example illustrating the denoised recommendation based on our ProRec framework. the top 𝜅 interactions of each user via a personalized equal to 1. 𝑀 is the number of users and 𝑁 is the number thresholding mechanism. In this way, we learn a rela- of items. 𝜋 * denote the optimal transport plan. beled matrix 𝑅 ∈ 𝑅𝑀 ×𝑁 for denoised recommendation. Specifically, the matching cost matrix ℳ directly re- flects the modeling effort, and thus the low user-item 3.2. Global Matching for Distinguishing matching cost can indicate users’ intrinsic preferences while the high values can indicate noises. Based on the Noises definition of ℳ, we can integrate the denoising process As motivated in Section 1, a framework that can denoise with both non-deep and deep RS models as follows: user-item interactions for more accurate recommenda- ℒ(𝑈 , 𝑉 ) = ||𝑋 − 𝑈 𝑉 𝑇 ||2𝐹 + 𝜁(||𝑈 ||22 + ||𝑉 ||22 ), (2) tion is in great need. However, without supervision, it is ℒ(𝑈 , 𝑉 ) = 𝒞(𝑋, 𝐺𝑦 (𝑈 , 𝑉 )) + 𝜁(||𝑈 ||22 + ||𝑉 ||22 ), hard to define noisy interactions and then stress intrinsic (3) preferences together while reducing noisy ones. where 𝑈 ∈ R𝑀 ×𝑑 and 𝑉 ∈ R𝑁 ×𝑑 represent the user To distinguish noises in an unsupervised fashion, sev- and item embeddings. 𝐺𝑦 denotes a classification net- eral works [4, 15, 21, 5] have pointed out a principled work. The hyperparameter 𝜁 is to balance the loss and the way to locate noises by finding the data that needs the regularization term. In non-deep learning models, Eq. 2 most modeling effort or cost. Inspired by this principle can be solved effectively through many methods (e.g., Al- of least modeling effort, in this work, we propose to refer ternative Least Square). Since the prediction scores rep- to the noises as minor abnormal data. Compared with resent the user-item similarities where ℳ has negative the interactions from intrinsic preferences, noises take values, we calculate the cost matrix as ℳ = −𝑈 𝑉 𝑇 much effort or cost to model. by using inner product. In deep learning models, we Based on the above least effort rationale, we propose can optimize Eq. 3 by adopting the widely used binary a novel denoising framework to distinguish noises by cross-entropy loss (i.e., 𝒞(·, ·)). Then, similar calcula- ranking the global matching between user and item em- tions can be made for those deep learning ones, where bedding spaces. The framework is served as a plug-in, so ℳ = −𝐺𝑦 (𝑈 , 𝑉 ). as to be flexibly integrated to both non-deep and deep However, there are three technical challenges when RS methods. In this scenario, the key to distinguishing applying OT to denoised recommendation: noises boils down to finding a global user-item matching Challenge 1. In RSs, each user can interact with many across the user space 𝒟𝑢 and the item space 𝒟𝑣 with items and each item interact with many users, which cor- minimal matching cost. responds to a many-to-many interaction matrix. Unlike To address the unsupervised matching problem be- previous applications of OT in CV or NLP, directly apply- tween two spaces, we advocate a principled denoised RS ing the one-hot constraint specified in Eq. 1 is counter- framework and ground on Optimal Transport (OT), which intuitive for recommendation. has been successfully applied in many fields such as CV Challenge 2. In RSs, the number of users and items in [6] and NLP [7]. As studied theoretically in [22], the the real world tends to be large, making the matching OT formulation derived from the Earth Movers Distance process time-consuming. (EMD) can be expressed as Challenge 3. Besides distinguishing noises, it is nec- 𝑀 ∑︁ ∑︁ 𝑁 essary to design a mechanism to eliminate the effect min ℳ𝑖𝑗 𝜋𝑖𝑗 , of noisy interactions. A naïve way is to set one global 𝜋∈𝒳 (𝒟 𝑢 ,𝒟 𝑣 ) 𝑖=1 𝑗=1 (1) thresholding hyperparameter 𝜅 to keep the top 𝜅 items 1 1 with the least matching cost for each user. However, the 𝑠.𝑡. 1𝑀 𝜋 = 1𝑁 , 1𝑁 𝜋 𝑇 = 1𝑀 , 𝑁 𝑀 global thresholding hyperparameter cannot accommo- where 𝒳 (𝒟𝑢 , 𝒟𝑣 ) denotes the joint probability distribu- date individual user behaviors. tion of user embedding space 𝒟𝑢 and item embedding space 𝒟𝑣 . 1𝑀 ∈ R1×𝑀 , 1𝑁 ∈ R1×𝑁 with all elements 3.3. The ProRec Framework Proposition 1. The constraint of problem in Eq. 4 is re- −ℳ In this section, we propose a Partial relaxed optimal laxed by Eq.5 and Eq. 6. By defining ℳ̃ = 𝑒 𝛾 , the Transport for denoised Recommendation (ProRec) frame- closed-form solution for the global optimal matching ma- work to address the above three challenges. trix is: * 𝑇 Many-to-many matching to meet the nature of RSs. 𝜋 = max(ℳ̃ diag(𝑝⊘1𝑀 ℳ̃), diag(𝑞⊘1𝑁 ℳ̃ )ℳ̃), To address Challenge 1, we look for a smoother version of (7) OT by lowering the sparsity and increasing the entropy of where ⊘ corresponds to the element-wise division. the matching matrix [4]. Among the many OT algorithms Proof. Based on Eq. 6 and using the Lagrange multiplier (e.g., ADMM and Proximal Point Method [23, 24]), we method we have: choose the Sinkhorm algorithm [8] due to its simplicity 𝑀 ∑︁𝑁 and efficiency. We achieve OT with a many-to-many ∑︁ ℒ= ℳ𝑖𝑗 𝜋𝑖𝑗 + 𝛾𝐻(𝜋) − (1𝑁 𝜋 𝑇 − 𝑞)𝑓 𝑇 , (8) matching through the Sinkhorn algorithm as follows: 𝑖=1 𝑗=1 𝑀 ∑︁𝑁 ∑︁ where 𝑓 𝑇 is Lagrangian multipliers. min ℳ𝑖𝑗 𝜋𝑖𝑗 + 𝛾𝐻(𝜋) Then, in order to find the optimal solution, we let the 𝜋∈𝒳 (𝒟 𝑢 ,𝒟 𝑣 ) 𝑖=1 𝑗=1 (4) gradient 𝜕𝜋 𝜕ℒ 𝑖𝑗 = 0, which means: 𝑠.𝑡., 1𝑀 𝜋 = 𝑝, 1𝑁 𝜋 𝑇 = 𝑞, 𝜕ℒ = ℳ𝑖𝑗 + 𝛾 log 𝜋𝑖𝑗 − 𝑓𝑖 = 0, (9) where 𝐻(𝜋) = 𝑀 ∑︀ ∑︀𝑁 𝑖=1 𝑗=1 𝜋𝑖𝑗 (log(𝜋𝑖𝑗 ) − 1). 1𝑀 ∈ 𝜕𝜋𝑖𝑗 R1×𝑀 , 1𝑁 ∈ R1×𝑁 are with all elements equal to 1. 𝛾 is In this case, we can obtain the solution of 𝜋𝑖𝑗 as a hyperparameter to balance the entropy regularization 𝑓𝑖 −ℳ𝑖𝑗 𝜋𝑖𝑗 = 𝑒 𝛾 . (10) and OT. Compared with most studies of discrete OT in CV, the The users and items that have different interactions, major advantages of the Sinkhorn algorithm above are which means both users and items are based on differ- as follows: i) By relaxing the discrete constraint, we can ent degrees of popularity. To take such popularity into obtain a continuous matrix instead of a one-hot one. In consideration, we count and norm the number of items this way, the obtained matching matrix can capture the that have interacted with user 𝑖. Based on the popularity many-to-many relationships between users and items. of user 𝑖 ( i.e., ∑︀𝑞𝑁𝑖 ), we add constraints of the matching ii) By adding an entropy regularization, we address the matrix 𝜋 by 𝑗=1 𝜋𝑖𝑗 = 𝑞𝑖 . Then, we have sparsity problem of 𝜋 * . The dense cost matrix can be 𝑁 ∑︁ 𝑓𝑖 −ℳ𝑖𝑗 used for subsequent ranking. iii) By modifying the distri- 𝑒 𝛾 = 𝑞𝑖 . (11) bution constraint of 𝜋 from uniform (i.e., 1𝑀 𝜋 = 𝑁1 1𝑁 ) 𝑗=1 to popularity-based (i.e., 1𝑀 𝜋 = 𝑝), we model non- Since the Lagrange multiplier 𝑓𝑖 is not related to 𝑗, we uniform distributions where users and items have differ- can extract this part and rewrite the formula as: ent number of interactions, and successfully adapt the 𝑁 1 ∑︁ − ℳ𝛾𝑖𝑗 𝑓 − 𝑖 optimization to the recommendation scenario. Specifi- 𝑒 =𝑒 𝛾 . (12) 𝑞𝑖 𝑗=1 cally, we count the interactions and then normalize them to obtain 𝑝𝑖 and 𝑞𝑗 , so 𝑁 Take the logarithm of both sides of the above equation 𝑖=1 𝑝𝑖 = 1 and ∑︀ ∑︀𝑀 𝑗=1 𝑞𝑗 = 1. Relaxed OT to reduce the time complexity. Note and move 𝛾 to the other side: that the optimal 𝜋 is dense and needs to be com- * 1 ∑︁ − ℳ𝛾𝑖𝑗 puted through multiple iteractions. Therefore, it is time- 𝑓𝑖 = −𝛾 log( 𝑒 ). (13) 𝑞𝑖 𝑗 consuming in RSs with large numbers of users and items. Inspired by the Relaxed OT for EMD [17], we extend the To keep the formula simple and clear, we define ℳ̃ = original Sinkhorn algorithm with a relaxed regularization. − ℳ 𝑒 𝛾 , so the formula above can be written as: Specifically, we use two auxiliary distances, each essen- tially is the Sinkhorn with only one of the constraints in 𝑇 𝑓 = 𝛾 log 𝑞 − 𝛾 log(1𝑁 ℳ̃ ). (14) Eq. 4: 𝑀 ∑︁ ∑︁ 𝑁 According to Eq. 10 and Eq. 12, we have ℳ𝑖𝑗 min ℳ 𝑖𝑗 𝜋𝑖𝑗 + 𝛾𝐻(𝜋) 𝑠.𝑡. 1𝑀 𝜋 = 𝑝, 𝑓𝑖 −ℳ𝑖𝑗 − 𝜋∈𝒳 (𝒟 𝑢 ,𝒟 𝑣 ) 𝑒 𝛾 𝑖=1 𝑗=1 𝜋𝑖𝑗 = 𝑒 𝛾 = 𝑞𝑖 . (15) ∑︀ − ℳ𝑖𝑗 (5) 𝑒 𝛾 𝑀 ∑︁𝑁 𝑗 Finally, based on Eq. 15 and the definition of ℳ̃ = ∑︁ 𝑇 min ℳ 𝜋 𝑖𝑗 𝑖𝑗 +𝛾𝐻(𝜋) 𝑠.𝑡. 1𝑁 𝜋 = 𝑞. 𝜋∈𝒳 (𝒟 𝑢 ,𝒟 𝑣 ) −ℳ 𝑖=1 𝑗=1 𝑒 𝛾 , we obtain the closed-form solution via a matrix (6) form as follow 𝑇 𝜋𝑉 = diag(𝑞 ⊘ 1𝑁 ℳ̃ )ℳ̃, (16) where ⊘ corresponds to element-wise division. mean value of top 𝜂 items while 𝑐2𝜂 = 𝑁 1−𝜂 𝑁 ∑︀ 𝑗=𝜂+1 𝜌𝑖𝑗 Following the same derivation, we can obtain the represents mean of the rest 𝑁 − 𝜂 items. After obtaining closed-form solution of Eq. 5 as follow the index of splitting points 𝜅, we have the correspond- 𝜋𝑈 = ℳ̃ diag(𝑝 ⊘ 1𝑀 ℳ̃). (17) ing threshold values 𝜌𝑖,𝜅𝑖 . Based on the personalized The original Sinkhorn minimizes 𝜋 under two con- threshold 𝜌𝑖,𝜅𝑖 , we can relabel by 𝑟𝑖𝑗 ∈ R as follows: straints. By relaxing one constraint, we can balance the 1 𝑟𝑖𝑗 = , (21) computation efficiency and improved preformance. To 1 + exp(−𝛽(𝜌𝑖𝑗 − 𝜌𝑖,𝜅𝑖 )) approximate the solution under the original constraints, where 𝑟𝑖𝑗 is a monotonically increasing function accord- we adopt the max operation, which is consistent with ing to the value of 𝜌𝑖𝑗 − 𝜌𝑖,𝜅𝑖 . 𝛽 is a hyperparameter to the results in previous work [16]. So we have control the inclination of the function. 𝜋 * = max(𝜋𝑈 , 𝜋𝑉 ) (18) Finally, to make the relabeling process more flexible, we propose to use a hyperparameter 𝜆 to control the This is equivalent to: 𝑇 degree of relabeling and obtain a new interaction matrix 𝜋 * = max(ℳ̃ diag(𝑝⊘1𝑀 ℳ̃), diag(𝑞⊘1𝑁 ℳ̃ )ℳ̃). as follows: (19) 𝑋 = 𝜆𝑋 + (1 − 𝜆)𝑅 ⊙ 𝑋, (22) where ⊙ corresponds to the element-wise product. Con- Note that, the closed-form solution yields a time com- sidering that the noises in sparse positive interactions plexity of 𝑂(max(𝑀, 𝑁 )2 ). This shows that by relaxing would induce worse effects in recommendation [5], in the OT formulation with auxiliary limitation from only this work, we only target the part of the original dataset one side, our model can achieve a lower time complexity where 𝑋𝑖𝑗 = 1. In this way, the newly generated 𝑋 can than the 𝑂(max(𝑀, 𝑁 )3 ) reported in [6], which uses OT keep users’ intrinsic preferences and reduce the noise for one-to-one matching. ones. We summarize the learning procedure of our pro- Personalized thresholding mechanism for denoised posed ProRec in Algorithm 1, which is proved to guaran- recommendation. To flexibly discriminate the in- tee the local convergence for denoised recommendation. trinsic and noisy user-item interactions for individual users, we propose a non-parametric personalized thresh- 4. Experiments olding mechanism. We first normalize the matching 𝜋* cost by each user as 𝜋𝑖𝑗 *′ = ∑︀ 𝑖𝑗𝜋* , and rank them In this section, we first conduct controlled experiments with synthetic noises on one dataset, so as to investi- 𝑗 𝑖𝑗 as 𝜌𝑖 = sort(𝜋𝑖*′ ). According to the definition of the gate ProRec’s performance to different levels of noisy matching matrix, each row represents users’ preferences interactions. Then, we evaluate ProRec’s performance towards the items. We define 𝜅 = {𝜅1 , . . . , 𝜅𝑀 } as the on three original real-world datasets of recommendation index of the threshold which can filter out users’ noisy with implicit feedback. preferences. 𝜅𝑖 denotes user 𝑖’s threshold. As shown in Figure 1, some users with consistent (narrow) inter- ests (i.e., user 𝐴) would benefit from a reasonably low 4.1. Experimental Settings threshold since a few items can already represent their 4.1.1. Dataset and evaluation protocols. preferences and more items can introduce more noises. Meanwhile, some users with diverse (wide) interests (i.e., We use ML-100k for the noise-controlled experiments user 𝐵) would require the threshold to be relatively high and do a 4:1 random splitting for train/test data follow- to model a broader range of preferences and satisfy the ing [26]. To simulate the noise, we randomly select need for exploration. To find a personalized splitting 5%/10%/15%/20% of ground-truth records and flip labels point 𝜅𝑖 for each user according to the learned matrix, in the train set. The selected records can be regarded the model requires an automatic splitting mechanism. as noises during training. For real data experiments, we Inspired by the threshold selection mechanism used in use Amazon music (AMusic)1 , Amazon toys (AToy)1 , and CART [25], we efficiently learn the personalized thresh- Yahoo2 . These datasets have been widely adopted in olds 𝜅𝑖 for different users by minimizing the following previous literature [27, 5, 28], and their statistics are sum- sum of square errors: marized in Table 1. We do a 5:2:3 splitting for them fol- lowing [28]. We evaluate the ability to distinguish noise [︃ 𝜂 𝑁 ]︃ ∑︁ (︀ )︀2 ∑︁ )︀2 𝜌𝑖𝑗 − 𝑐1𝜂 + 𝜌𝑖𝑗 − 𝑐2𝜂 , by Hit Ratio, and recommendation performances by nor- (︀ 𝜅𝑖 = arg min 𝜂 𝑗=1 𝑗=𝜂+1 malized discounted cumulative, mean average precision, (20) and recall. where 𝜌𝑖𝑗 is the 𝑗-th value of the∑︀ sorted 𝜌𝑖 and the index 𝜂 ∈ {1, 2, · · · , 𝑁 −1}. 𝑐1𝜂 = 𝜂1 𝜂𝑗=1 𝜌𝑖𝑗 represents the 1 https://github.com/familyld/DeepCF/tree/master/Data 2 https://webscope.sandbox.yahoo.com/catalog.php?datatype=r Algorithm 1: Partial Relaxed Optimal Transport trix Factorization, which is a deep MF framework; VAE- for Denoised Recommendation (ProRec) CF [31] – Variational Autoencoder for Collaborative Fil- Data: The noisy user-item interactions 𝑋, user tering, which is one of the state-of-the-art deep learning popularity 𝑝 and item popularity 𝑞 based recommendation algorithm [28]; LightGCN [32] – Result: Denoised user embeddings 𝑈 and item LightGCN devises a light graph convolution for training embeddings 𝑉 , a relabeled matrix 𝑅. efficiency and generation ability on recommendation sce- 1 while not converged do nario; EGLN [33] – EGLN enhances graph learning and 2 1. Update 𝑈 and 𝑉 ; node embedding modules iteratively based on mutual 3 if Learning 𝑈 and 𝑉 via non-deep methods information maximization. then Update 𝑈 and 𝑉 via Besides the above methods, T-CE [5] and SGDL [34] ||𝑋 − 𝑈 𝑉 𝑇 ||2𝐹 + 𝜁(||𝑈 ||22 + ||𝑉 ||22 in are the existing frameworks for denoised recommenda- Eq. 2 and calculate ℳ = −𝑈 𝑉 𝑇 ; tions. The first one discards the large-loss samples with 4 else Update 𝑈 and 𝑉 via a dynamic threshold in each iteration and the second one 𝒞(𝑋, 𝐺𝑦 (𝑈 , 𝑉 )) + 𝜁(||𝑈 ||22 + ||𝑉 ||22 ) in collects memorized interactions at the early stage of the Eq. 3, where 𝒞(·, ·) is binary cross-entropy training and leverages those data as denoising signals to loss; Then calculate ℳ = −𝐺𝑦 (𝑈 , 𝑉 ), guide the following training. Both of them are designed where 𝐺𝑦 is a classification network; as plug-ins for deep learning methods. We add CDAE 5 2. Update 𝑋; + T-CE and LightGCN + SGDL as baselines to compare 6 Model denoised recommendation problem by the ability of denoising with the proposed ProRec, which are shown to win the best performance in their original minimizing 𝑀 ∑︀ ∑︀𝑁 𝑖=1 𝑗=1 ℳ𝑖𝑗 𝜋𝑖𝑗 + 𝛾𝐻(𝜋), works [5, 34]. where 1𝑀 𝜋 = 𝑝, 1𝑁 𝜋 𝑇 = 𝑞, in Eq. 4; For all the algorithms (except for CDAE + T-CE and 7 Compute a global optimal matching matrix LightGCN + SGDL), we add the proposed ProRec on top 𝜋 * by the closed-form solution of them, allowing both non-deep and deep RS models to 𝜋 * = max(ℳ̃ diag(𝑝 ⊘ 1𝑀 ℳ̃), diag(𝑞 ⊘ discriminate intrinsic and noisy interactions in an unsu- 𝑇 1𝑁 ℳ̃ )ℳ̃) in Eq. 7; pervised fashion. Specifically, we have SVD + ProRec, 8 Compute personalized thresholds 𝜅 by NCE + ProRec, CDAE + ProRec, WRMF + ProRec, VAE- automatic splitting mechanism in Eq. 20; CF + ProRec, LightGCN + ProRec, and EGLN + ProRec, 9 Relabel user-item interaction by sorting 𝜅 in respectively. Eq. 21; 10 Update 𝑋 by 𝑋 = 𝜆𝑋 + (1 − 𝜆)𝑅 ⊙ 𝑋 in 4.1.3. Implementation details Eq. 22; 11 end Implementations of the compared baselines are from open-source projects (SVD3 , NCE3 , CDAE3 , WRMF3 , VAE-CF3 , LightGCN4 , EGLN5 , T-CE6 and SGDL7 ). We Table 1 optimize ProRec with standard Adam. We tune all hyper- Statistics of the datasets used in our experiments. parameters through grid search. In particular, learning Dataset # User # Item # Interaction Sparsity rate in {0.0005, 0.001, 0.005, 0.01}, 𝜆 in {0.25, 0.5, 0.75, ML-100k 942 1,447 55,375 0.9594 1.0}, 𝛾 in {0.05, 0.075, 0.1, 0.125, 0.15, 0.175}, 𝛽 in {1, 5, AMusic 1,776 12,929 46,087 0.9980 10, 20, 50}, 𝜁 in {0.0005, 0.001, 0.005, 0.01}, and the batch AToy 3,137 33,953 84,642 0.9992 size in {100, 250, 500, 1000} for different datasets. We Yahoo 1,948,882 46,110 48,817,561 0.9995 also carefully tuned the hyperparameters of all baselines through cross-validation to achieve their best perfor- 4.1.2. Methods for comparison mances. We compare two types of algorithms for recommenda- tion. The first type is non-deep algorithms, which include: 4.2. Controlled Experiments (Synthetic SVD [29] – A similarity-based method that constructs Noises) a similarity matrix through SVD decomposition of the implicit matrix 𝑋; NCE [28] – Noise contrastive esti- Before looking at the performance of recommendation, mation item embeddings combined with a personalized we first inspect ProRec’s ability to distinguish noises. user model based on PLRec [30]. The second type is 3 deep learning methods, including CDAE [10] – Collabo- 4 https://github.com/wuga214/NCE_Projected_LRec rative Denoising Autoencoder is a deep learning method 5 https://github.com/gusye1234/LightGCN-PyTorch https://github.com/yimutianyang/SIGIR2021-EGLN specifically optimized for implicit feedback recommen- 6 https://github.com/WenjieWWJ/DenoisingRec dation tasks; WRMF [10] – Weighted Regularized Ma- https://github.com/zealscott/SGDL 7 0.910 90.730%90.665% 0.040 0.040 0.162 0.905 0.038 0.038 90.195% 0.161 0.036 0.036 NCE+PreRec(NDCG) NCE+PreRec(NDCG) Recall 0.900 89.983% 0.034 NCE+PreRec(Recall) 0.034 NCE+PreRec(Recall) HR 0.160 NCE(NDCG) NCE(NDCG) 0.032 NCE(Recall) 0.032 NCE(Recall) 0.895 0.159 0.030 0.030 NCE NCE + ProRec 0.158 0.028 0.028 0.890 0% 5% 10% 15% 20% 0.026 0 0.026 0.25 0.5 0.75 1.0 0.050 0.075 0.100 0.125 0.150 0.175 Noise level (a) HR/Recall, controlled dataset (b) Varying 𝜆 on AMusic (c) Varying 𝛾 on AMusic Figure 2: (a) Testing HR/Recall with different noise levels on the noise-controlled dataset. (b)-(c) Hyperparameter effect on the proposed ProRec. Table 2 Comparison between the original baselines and the ones plus ProRec on three benchmark datasets. The best performances of non-deep and deep learning methods are in boldface. Statistically tested significant improvements with 𝑝-value < 0.01 brought by ProRec compared with the original methods are indicated with *. model N@5 M@5 R@5 N@5 M@5 R@5 N@5 M@5 R@5 AMusic AToy Yahoo SVD [29] 0.0363 0.0667 0.0264 0.0116 0.0212 0.0084 0.1614 0.2316 0.1430 NCE [28] 0.0375 0.0693 0.0276 0.0143 0.0262 0.0104 0.2498 0.3440 0.2249 CDAE [10] 0.0074 0.0124 0.0060 0.0046 0.0080 0.0033 0.0716 0.1022 0.0652 WRMF [10] 0.0236 0.0435 0.0171 0.0086 0.0157 0.0062 0.2503 0.3244 0.2321 VAE-CF [31] 0.0146 0.0258 0.0115 0.0117 0.0212 0.0084 0.2622 0.3423 0.2380 LightGCN [32] 0.0406 0.0830 0.0368 0.0162 0.0296 0.0117 0.2750 0.3729 0.2494 EGLN [33] 0.0422 0.0866 0.0386 0.0166 0.0303 0.0120 0.2806 0.3809 0.2543 CDAE + T-CE [5] 0.0080 0.0143 0.0062 0.0049 0.0080 0.0038 0.0714 0.1032 0.0648 LightGCN + SGDL [34] 0.0417 0.0852 0.0378 0.0166 0.0304 0.0120 0.2825 0.3830 0.2561 SVD + ProRec 0.0374* 0.0677 0.0275* 0.0123* 0.0220* 0.0093* 0.1987* 0.2820* 0.1763* NCE + ProRec 0.0396* 0.0723* 0.0298* 0.0149* 0.0267* 0.0108* 0.2619* 0.3562* 0.2364* CDAE + ProRec 0.0080* 0.0148* 0.0065* 0.0049 0.0084 0.0038* 0.0718 0.1028 0.0660* WRMF + ProRec 0.0241* 0.0442* 0.0179* 0.0096* 0.0165* 0.0077* 0.2699* 0.3358* 0.2403* VAE-CF + ProRec 0.0162* 0.0266* 0.0120 0.0129* 0.0231* 0.0102* 0.2743* 0.3535* 0.2472* LightGCN + ProRec 0.0424* 0.0866* 0.0379* 0.0168* 0.0302* 0.0120* 0.2924* 0.3918* 0.2636* EGLN + ProRec 0.0445* 0.0915* 0.0406* 0.0173* 0.0317* 0.0124* 0.2973* 0.4012* 0.2705* Since we know the index of the randomly added noises in have the following observations. the dataset, we evaluate the level of distinguished noises In three different domains, by integrating ProRec with by Hit Ratio (HR) of the real added noises in Figure 2(a). both non-deep and deep methods, we can consistently im- Specifically, we observe a consistent high HR of NCE + prove strong baseline models and achieve start-of-the-art ProRec around 90% when the level of noises increases. performance. Since the graph-based model can leverage Furthermore, we test Recall of NCE and NCE + ProRec high-order relations among users and items, EGLN out- under different levels of noises. It can be clearly observed performs all baselines on three datasets. Furthermore, our that NCE + ProRec always performs better than NCE, as proposed EGLN + ProRec outperforms all deep-learning the former can accurately relabel some noises to elimi- baselines on three datasets, ranging from 3.34% (achieved nate the effect of noises. For example, when the noise on AToy on Recall@5) to 6.38% (achieved on Yahoo on level is at 20%, NCE + ProRec significantly improves NCE Recall@5), while NCE + ProRec outperforms all non deep- from 0.1579 to 0.1596. learning baselines on three datasets, ranging from 1.91% (achieved on AToy on MAP@5) to 7.97% (achieved on 4.3. Real Data Experiments AMusic on Recall@5). All of these show that ProRec is ca- pable of effective recommendation on different datasets. 4.3.1. Overall Performance Comparison One step further, we observe that the methods based on the proposed ProRec framework all outperform the We compare the recommendation results of the proposed original models, which indicates the advantage of the ProRec to those of the baseline models. Table 2 shows denoising process. Compared with existing denoising the NDCG, MAP, and Recall scores on three datasets. We 0 357 3565 1659 2304 1890 1418 357 3565 1659 2304 2630 1418 1 1099 1876 3462 3161 1197 640 1099 1876 3462 1352 1197 640 2 1361 640 1170 1182 721 476 Titanic 1361 2186 1170 1182 721 3436 [Drama, 3 1171 2599 1073 1169 1015 2274 Romance] 1171 1194 1933 1169 1015 2274 4 2514 2241 1672 1462 2177 2245 2514 2241 1672 3560 1799 2245 5 3395 379 896 1347 1077 1513 3395 379 896 1347 1077 2847 Friday the 13th Part 3: 3D Shakespeare in Love Runaway Bride [Horror] [Comedy, Romance] [Comedy, Romance] Figure 3: The denoising process of ProRec. Table 3 Sinkhorn, it is ineffective to directly adopt one-to-one Impact of different model components of ProRec on AMusic. matching (i.e., NCE + EMD) due to the removal of many matched interactions. For example, NCE + Sinkhorn Metric N@5 R@5 Time N@5 R@5 Time significantly improves NCE from 0.0276 to 0.0281 (on Method NCE EGLN Recall@5 on AMusic) while both metrics of NCE + EMD Base 0.0375 0.0276 64s 0.0422 0.0386 82s +EMD 0.0360 0.0259 98s 0.0406 0.0371 124s are lower than those of the original NCE. +Sinkhorn 0.0384 0.0281 132s 0.0431 0.0398 174s The efficiency of relaxed regularization.By balancing +ROT 0.0383 0.0281 76s 0.0430 0.0392 109s the performance and the runtime via relaxing the opti- +ProRec 0.0396 0.0298 82s 0.0445 0.0406 118s mized constraints, we can observe that NCE + ROT can reduce around half of the runtime while keeping almost frameworks (i.e., T-CE and SGDL), ProRec can not only the same NDCG and Recall on AMusic and Yahoo. be flexibly integrated with both deep and non-deep RS The effect of personalized thresholding mechanism. methods but also gain more improvements. We first use one global threshold for ROT to study the Moreover, on top of existing RS methods, our proposed effectiveness of the personalized thresholding mecha- ProRec in Eq. 4 does not introduce significant compu- nism. In the experiment, we search the hyperparameter tations. In the third and sixth columns in Table 3, we of global thresholds 𝜎 in {5, 10, 15, 20, 25} and find that compare the model’s training time under the same exper- NCE + ROT achieves the best performance when 𝜎 = 10. imental setting (e.g., embedding dimension). As can be Compared with NCE + ROT, NCE + ProRec can individu- observed, the training time of the proposed NCE + ProRec ally discriminate intrinsic preferences and noises without and LightGCN + ProRec are within the same scale as NCE tuning the hyperparameter of the threshold, and then and LightGCN on AMusic dataset. improve the performance of the partial OT framework. Note that the improvements brought by the denois- For example, the performance of NCE + ProRec achieves ing process on the denser datasets (e.g., the performance 0.0298, which outperforms 0.0281 of NCE + ROT on Re- gains between 0% noise level and 5% noise level on ML- call@5 on AMusic dataset. 100k shown in Figure 2(a)) are less than those on the The effect of hyperparameters. Our proposed ProRec sparse datasets (e.g., the performance gains between framework introduces four hyperparameters: 𝛾, 𝛽, 𝜁 and EGLN + ProRecn and EGLN on Yahoo in Table 2). Al- 𝜆, where we empirically set 𝛽 to 20 and 𝜁 to 0.001. 𝜆 though we can explicitly eliminate some effects of noisy controls the degree of the relabeling and 𝛾 controls the interactions (as shown in Figure 2(a)), they only have a sparsity of the matching matrix. Take NCE + ProRec for minor impact on the learned embeddings when interac- example (shown in Figure 2(b)-2(c)), the optimal 𝜆 and 𝛾 tions are enough in the observed data, which is consistent on AMusic are found to be 0.5 and 0.1, respectively. In with the results in recent studies [35]. general, we can set 𝜆 to 0.5 for denser datasets and 0.25 for the sparser datasets and always set 𝛾 to 0.1. 4.3.2. Model Ablation and Hyperparameter Study 4.3.3. Denoising Case Study In this section, three groups of ablation studies are con- ducted to evaluate the impacts of our proposed many-to- To demonstrate the advantages of the denoising process, many matching, relaxed regularization, and personalized we visualize the interaction matrix given by ML-100k on thresholding mechanisms, as well as the effects of hyper- the left and the learned matrix of the proposed NCE + parameters. ProRec on the right (as shown in Figure 3). The num- The effect of continuous many-to-many matching. bers in the boxes are actual movie IDs. The colors in Compared with the many-to-many matching of NCE + the original interaction matrix (left) are the same, indi- cating uniform weights, whereas those in the relabeled [6] B. Bhushan Damodaran, B. Kellenberger, R. Fla- matrix (right) are different. The different depths of blue mary, D. Tuia, N. Courty, Deepjdot: Deep joint represent the matching cost (the deeper, the lower) and distribution optimal transport for unsupervised do- white boxes denote the intrinsically preferred items. As main adaptation, in: ECCV, 2018. we can see from the different colors, our many-to-many [7] V. Huynh, H. Zhao, D. Phung, Otlda: A geometry- matching matrix effectively discriminates intrinsic and aware optimal transport approach for topic model- noisy interactions. Moreover, we can also observe the ing, NIPS (2020). personalized thresholding mechanism to work as differ- [8] M. Cuturi, Sinkhorn distances: Lightspeed compu- ent numbers of items is being replaced for different users. tation of optimal transport (2013). Furthermore, we show several movies relevant to user 0. [9] R. Xu, P. Liu, Y. Zhang, F. Cai, J. Wang, S. Liang, Since most interacted movies of user 0 are from the ro- H. Ying, J. Yin, Joint partial optimal transport for mance category, such as Titanic and Shakespeare in Love, open set domain adaptation, in: IJCAI, 2020. the horror movie is abnormal and is unlikely to reflect [10] Y. Hu, Y. Koren, C. Volinsky, Collaborative filtering her/his intrinsic preference. NCE + ProRec automatically for implicit feedback datasets, in: ICDM, 2008. learns to highlight the two romantic movies. After down- [11] Y. Liu, Q. Liu, Y. Tian, C. Wang, Y. Niu, Y. Song, weighing Friday the 13th Part 3: 3D, the scoring of the C. Li, Concept-aware denoising graph neural net- user’s preference changes, and thus the ranking of an- work for micro-video recommendation, in: Proceed- other romantic movie (Runaway Bride) rises to show up ings of the 30th ACM International Conference on in the top 𝐾 = 6 candidate list for user 0. Information & Knowledge Management, 2021, pp. 1099–1108. [12] R. Jagerman, H. Oosterhuis, M. de Rijke, To model 5. Conclusion or to intervene: A comparison of counterfactual and online learning to rank from user interactions, In this paper, we propose a novel ProRec framework for in: SIGIR, 2019. the recommendation with implicit feedback. ProRec can [13] H. Lu, M. Zhang, S. Ma, Between clicks and sat- effectively denoise the user-item interactions by flexibly isfaction: Study on multi-phase user preferences serving as a plug-in, so as to further improve the recom- and satisfaction for online news reading, in: SIGIR, mendation accuracy for both non-deep and deep learning 2018, pp. 435–444. RS methods. We demonstrate the superior performance [14] T. Schnabel, A. Swaminathan, A. Singh, N. Chan- of ProRec in recommendation through extensive experi- dak, T. Joachims, Recommendations as treatments: ments and showcase its effectiveness in denoising user- Debiasing learning and evaluation, JMLR (2016). item interactions through insightful case studies. Since [15] L. Jiang, Z. Zhou, T. Leung, L.-J. Li, L. Fei-Fei, Men- we achieve denoising in a fully unsupervised fashion tornet: Learning data-driven curriculum for very without accessing any additional user or item data, we deep neural networks on corrupted labels, in: ICML, expect ProRec to incur no more negative societal impact 2018. than any basic recommender system. [16] N. Kolkin, J. Salavon, G. Shakhnarovich, Style trans- fer by relaxed optimal transport and self-similarity, References in: CVPR, 2019. [17] M. Kusner, Y. Sun, N. Kolkin, K. Weinberger, From [1] Y. Kim, A. Hassan, R. W. White, I. Zitouni, Modeling word embeddings to document distances, in: ICML, dwell time to predict click-level satisfaction, in: 2015, pp. 957–966. WSDM, 2014. [18] L. Chen, S. Dai, C. Tao, H. Zhang, Z. Gan, D. Shen, [2] H. Lu, M. Zhang, W. Ma, C. Wang, F. xia, Y. Liu, Y. Zhang, G. Wang, R. Zhang, L. Carin, Adversarial L. Lin, S. Ma, Effects of user negative experience in text generation via feature-mover’s distance, in: mobile news streaming, in: SIGIR, 2019. NIPS, 2018. [3] B. Yang, S. Lee, S. Park, S.-g. Lee, Exploiting vari- [19] L. Chen, Y. Zhang, R. Zhang, C. Tao, Z. Gan, ous implicit feedback for collaborative filtering, in: H. Zhang, B. Li, D. Shen, C. Chen, L. Carin, Im- WWW, 2012. proving sequence-to-sequence learning via optimal [4] N. Courty, R. Flamary, D. Tuia, A. Rakotomamonjy, transport, in: ICLR, 2019. Optimal transport for domain adaptation, TPAMI [20] Y. Meng, X. Yan, W. Liu, H. Wu, J. Cheng, Wasser- 39 (2017) 1853–1865. stein collaborative filtering for item cold-start rec- [5] W. Wang, F. Feng, X. He, L. Nie, T.-S. Chua, Denois- ommendation, in: Proceedings of the 28th ACM ing implicit feedback for recommendation, WSDM Conference on User Modeling, Adaptation and Per- (2021). sonalization, 2020, pp. 318–322. [21] Q. Xu, J. Xiong, X. Cao, Q. Huang, Y. Yao, From rative filtering, in: SIGIR, 2019. social to individuals: a parsimonious path of multi- [29] P. Cremonesi, Y. Koren, R. Turrin, Performance of level models for crowdsourced preference aggrega- recommender algorithms on top-n recommenda- tion, TPAMI (2018). tion tasks, in: RecSys, 2010. [22] A. Gretton, D. Sejdinovic, H. Strathmann, S. Bal- [30] S. Sedhain, H. Bui, J. Kawale, N. Vlassis, B. Kveton, akrishnan, M. Pontil, K. Fukumizu, B. K. Sriperum- A. K. Menon, T. Bui, S. Sanner, Practical linear mod- budur, Optimal kernel choice for large-scale two- els for large-scale one-class collaborative filtering, sample tests, in: NIPS, 2012. in: IJCAI, 2016. [23] S. Boyd, N. Parikh, E. Chu, Distributed optimization [31] D. Liang, R. G. Krishnan, M. D. Hoffman, T. Jebara, and statistical learning via the alternating direction Variational autoencoders for collaborative filtering, method of multipliers, 2011. in: WWW, 2018. [24] N. Parikh, S. Boyd, Proximal algorithms, Founda- [32] X. He, K. Deng, X. Wang, Y. Li, Y. Zhang, M. Wang, tions and Trends in optimization (2014). Lightgcn: Simplifying and powering graph convolu- [25] N. Angelopoulos, J. Cussens, Exploiting informa- tion network for recommendation, in: SIGIR, 2020, tive priors for bayesian classification and regression pp. 639–648. trees., in: IJCAI, 2005. [33] Y. Yang, L. Wu, R. Hong, K. Zhang, M. Wang, En- [26] J. Ding, Y. Quan, Q. Yao, Y. Li, D. Jin, Simplify and hanced graph learning for collaborative filtering robustify negative sampling for implicit collabora- via mutual information maximization, in: SIGIR, tive filtering (2020). 2021, pp. 71–80. [27] Z.-H. Deng, L. Huang, C.-D. Wang, J.-H. Lai, S. Y. [34] Y. Gao, Y. Du, Y. Hu, L. Chen, X. Zhu, Z. Fang, Philip, Deepcf: A unified framework of represen- B. Zheng, Self-guided learning to denoise for robust tation learning and matching function learning in recommendation, in: SIGIR, 2022. recommender system, in: AAAI, volume 33, 2019, [35] S. Pal, S. Malekmohammadi, F. Regol, Y. Zhang, pp. 61–68. Y. Xu, M. Coates, Non-parametric graph learning [28] G. Wu, M. Volkovs, C. L. Soon, S. Sanner, H. Rai, for bayesian graph neural networks, PMLR (2020). Noise contrastive estimation for one-class collabo-