=Paper= {{Paper |id=Vol-3317/paper6 |storemode=property |title=Partial Relaxed Optimal Transport for Denoised Recommendation |pdfUrl=https://ceur-ws.org/Vol-3317/Paper6.pdf |volume=Vol-3317 |authors=Yanchao Tan,Carl Yang,Yanchao Tan,Ziyue Wu,Weiming Liu,Xiaolin Zheng |dblpUrl=https://dblp.org/rec/conf/cikm/TanYTWLZ22 }} ==Partial Relaxed Optimal Transport for Denoised Recommendation== https://ceur-ws.org/Vol-3317/Paper6.pdf

Partial Relaxed Optimal Transport for Denoised
Recommendation
Yanchao Tan1,2 , Carl Yang3,* , Xiangyu Wei4 , Ziyue Wu5 , Weiming Liu4,† and Xiaolin Zheng4
1
The College of Computer and Data Science, Fuzhou University, Fuzhou 350116, China
2
Fujian Key Laboratory of Network Computing and Intelligent Information Processing, Fuzhou University, Fuzhou 350116, China
3
The Department of Computer Science, Emory University, Atlanta 30322, United State
4
The College of Computer Science, Zhejiang University, Hangzhou 310027, China
5
The School of Management, Zhejiang University, Hangzhou 310058, China

Abstract
The interaction data used by recommender systems (RSs) inevitably include noises resulting from mistaken or exploratory
clicks, especially under implicit feedbacks. Without proper denoising, RS models cannot effectively capture users’ intrinsic
preferences and the true interactions between users and items. To address such noises, existing methods mostly rely on
auxiliary data which are not always available. In this work, we ground on Optimal Transport (OT) to globally match a user
embedding space and an item embedding space, allowing both non-deep and deep RS models to discriminate intrinsic and
noisy interactions without supervision. Specifically, we firstly leverage the OT framework via Sinkhorn distance to compute
the continuous many-to-many user-item matching scores. Then, we relax the regularization in Sinkhorn distance to achieve
a closed-form solution with a reduced time complexity. Finally, to consider individual user behaviors for denoising, we
develop a partial OT framework to adaptively relabel user-item interactions through a personalized thresholding mechanism.
Extensive experiments show that our framework can significantly boost the performances of existing RS models.

1. Introduction interactions for more accurate recommendations?
In this work, we approach the denoised recommen-
With the rapid growth of various activities on the Web, dation problem in two steps: 1. distinguishing intrinsic
recommender systems (RSs) become fundamental in help- and noisy interactions; 2. stressing intrinsic preferences
ing users alleviate the problem of information overload. and reducing noisy ones for recommendation. Inspired
However, users can click some items by mistake or out by a least modeling effort principle in an unsupervised
of curiosity, and many RSs will also recommend some fashion [4, 5], we refer to noises as minor abnormal data
less relevant items for exploration every now and then. that needs higher modeling effort than the intrinsic ones.
Take movie watching for example. Since a user cannot Based on the above rationale, we propose to distin-
always distinguish horrible movies from romantic ones guish noises by ranking the global matching cost between
by movie names, he/she can easily click a horrible movie the user and item embedding spaces, and flexibly inte-
by mistake among the many romantic movies he/she usu- grate it with both non-deep and deep RS methods. In
ally watches. In this case, a traditional method like CF this way, the key to distinguishing noises boils down
cannot effectively reduce the probability of recommend- to finding a global matching matrix with the minimum
ing a horrible movie unless the noisy clicks are extremely matching cost. Subsequently, we advocate a principled
rare, which may further seduce even more noisy clicks. denoised RS framework and calculate the matching ma-
Recently, to get out of such mistakes, existing research trix grounding on Optimal Transport (OT), which has
discovers users’ intrinsic preferences with the help of been introduced to address the unsupervised matching
external user behaviors [1], auxiliary item features [2] or problem between two distributions or spaces in many
extra feedbacks [3]. A key limitation is their reliance on fields [6, 7]. Through minimizing the overall cost of the
auxiliary data, which costs significant effort and is not mismatched user-item pairs, OT finds a global optimal
always attainable in RSs. Without supervision, can we matching matrix, whose values represent the matching
develop a framework to automatically denoise user-item cost (i.e., modeling effort). Specifically, the low user-item
DL4SR’22: Workshop on Deep Learning for Search and Recommen- matching cost indicates a user’s intrinsic preferences while
dation, co-located with the 31st ACM International Conference on the high value indicates noisy ones.
Information and Knowledge Management (CIKM), October 17-21, 2022, However, to apply OT to achieve denoised recommen-
Atlanta, USA dation, three problems still remain: 1. Unlike the one-hot
*
Corresponding author.
constraint in previous applications of OT, the nature of
$ yctan@fzu.edu.cn (Y. Tan); j.carlyang@emory.edu (C. Yang);
weixy@zju.edu.cn (X. Wei); wuziyue@zju.edu.cn (Z. Wu); RSs requires a many-to-many matching; 2. The large
21831010@zju.edu.cn (W. Liu); xlzheng@zju.edu.cn (X. Zheng) number of users and items makes the matching process
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0). time-consuming; 3. A mechanism is needed to properly
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
leverage the learned matching matrix and finally elimi- Optimal transport (OT) theory with guaranteed optimal-
nate the effect of noises for denoised recommendation. ity in the global matching and can be flexibly integrated
To address these challenges, we first leverage the OT with both deep and non-deep RS methods.
framework via Sinkhorn distance [8]. Unlike a one-hot OT aims to find a matching between two distributions
matching matrix that satisfies the requirements of the or spaces, which has been most widely used in transfer
many widely studied tasks (e.g., transfer learning [6, 9]), learning. It can match one instance in the source domain
we compute a continuous many-to-many approxima- to another in the target domain with the least cost. Ex-
tion to the discrete OT, so as to meet the nature of RSs. isting works mainly study discrete OT for one-to-one
Furthermore, we propose to relax the regularization in matching. For example, in computer vision, OT has been
Sinkhorn to achieve a closed-form solution, which can applied to domain adaption [4, 6] and style transfer [16].
reduce the time complexity for large-scale data from OT has also been successfully applied in many natural
𝒪(max(𝑀, 𝑁 )3 ) to 𝒪(max(𝑀, 𝑁 )2 ) . Then, we rank language processing tasks, such as topic modeling [7, 17],
the learned continuous matching cost to differentiate in- text generation [18], and sequence-to-sequence learning
trinsic and noisy interactions for each user. Finally, we [19]. Considering the effect of non-matched pairs, [9]
take into account that individual users’ behaviors can proposed a partial one-to-one matching that leverages
vary in RSs, which should be handled differently during one shared threshold to address domain shift for domain
denoising. To this end, we design a personalized thresh- adaption instead of taking all pairs into account.
olding mechanism for relabeling user-item interactions, However, the above discrete one-to-one matching can-
which efficiently finds a splitting point for each individual not satisfy the nature of many-to-many user-item re-
user according to the ranked matching cost. lationships in RSs, where each user can interact with
Experiment results on one real-world dataset with syn- many items and each item can interact with many users.
thesized noises demonstrate our proposed ProRec’s abil- Though OT is applied into cold-start recommendation
ity of denoising under various levels of noises. Further problem in [20], they directly measure user-item similar-
experiments on three original real-world datasets also ity without taking the effect of noises into consideration.
verify the advantages of ProRec, which demonstrate its Moreover, partial matching with a shared threshold can-
significant performance boosts upon multiple popular RS not reflect the individual user behaviors, and thus leads
models, in terms of effectiveness and efficiency. to suboptimal recommendation performances.

2. Related work 3. Methodology
Many research studies have pointed out that there exist In this section, we detail the proposed Partial relaxed
a large proportion of noisy interactions in the implicit optimal Transport for the denoised Recommendation
feedbacks (i.e., the clicks) [10, 1, 11]. These feedbacks are (ProRec) framework, as shown in Figure 1. Specifically,
easily affected by different factors, such as the position we first formulate the denoised recommendation as a two-
bias [12], caption bias [13] and exposure bias [10, 14]. step process. Then, we introduce the Optimal Transport
Therefore, there exist large gaps between the implicit (OT) framework for distinguishing noises through global
feedbacks and the actual user preferences due to various user-item matching. Furthermore, we elaborate on our
reasons, which are detrimental to the users’ continuous proposed ProRec framework, which can be applied to
behaviors and overall satisfaction [2, 5]. achieve denoised recommendations.
To alleviate the above issues, many researchers have
considered incorporating auxiliary feedbacks to the iden- 3.1. Problem Statement
tification of noises in the implicit feedbacks, such as dwell
time [1] and scroll intervals [13]. These works usually Following the setting of recommendation, we denote
design features manually and label items with external the user-item interaction matrix as 𝑋 ∈ {0, 1}𝑀 ×𝑁 ,
help (e.g., from domain experts) [13, 2], which require where 𝑀 denotes the number of users and 𝑁 denotes the
extensive effort and cost, which are not always available number of items. We construct user-item relationships
across RSs. To this end, sheer observations on training across a user embedding space 𝒟𝑢 and an item embedding
losses have recently been explored to denoise the implicit space 𝒟𝑣 . The problem is that user-item pairs contain
feedbacks during training in an unsupervised fashion [5]. some noisy interactions which need to be relabeled.
The method is purely heuristic and can be integrated In this work, we approach denoised recommendation
with deep learning methods only. Based on a principle through two steps. Firstly, we distinguish intrinsic and
of the least modeling effort in different fields [4, 15], our noisy interactions across 𝒟𝑢 and 𝒟𝑣 by a global match-
proposed framework aims to denoise the user-item in- ing matrix 𝜋 * . Then, to stress users’ intrinsic preferences
teractions without supervision, which is supported by and reduce noises, we rank the learned 𝜋 * and keep only
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
A 1 1 1 0 1 0 0.077 0.077 0.087 0.070 0.054 0.034 3 1 2 4 5 6 1 1 1 0 0 0
B 1 1 0 1 1 1 0.070 0.070 0.050 0.077 0.066 0.069 4 1 2 6 5 3 1 1 0 1 1 1
C 1 1 0 0 0 0 0.046 0.046 0.026 0.046 0.017 0.018 1 2 4 3 6 5 1 1 0 0 0 0
∗
Interaction Matrix X Matching Matrix " Personalized threshold ! Relabeled Matrix R
Intrinsic preference
Relabeled data ProRec

Figure 1: A toy example illustrating the denoised recommendation based on our ProRec framework.

the top 𝜅 interactions of each user via a personalized equal to 1. 𝑀 is the number of users and 𝑁 is the number
thresholding mechanism. In this way, we learn a rela- of items. 𝜋 * denote the optimal transport plan.
beled matrix 𝑅 ∈ 𝑅𝑀 ×𝑁 for denoised recommendation. Specifically, the matching cost matrix ℳ directly re-
flects the modeling effort, and thus the low user-item
3.2. Global Matching for Distinguishing matching cost can indicate users’ intrinsic preferences
while the high values can indicate noises. Based on the
Noises definition of ℳ, we can integrate the denoising process
As motivated in Section 1, a framework that can denoise with both non-deep and deep RS models as follows:
user-item interactions for more accurate recommenda- ℒ(𝑈 , 𝑉 ) = ||𝑋 − 𝑈 𝑉 𝑇 ||2𝐹 + 𝜁(||𝑈 ||22 + ||𝑉 ||22 ), (2)
tion is in great need. However, without supervision, it is ℒ(𝑈 , 𝑉 ) = 𝒞(𝑋, 𝐺𝑦 (𝑈 , 𝑉 )) + 𝜁(||𝑈 ||22 + ||𝑉 ||22 ),
hard to define noisy interactions and then stress intrinsic (3)
preferences together while reducing noisy ones. where 𝑈 ∈ R𝑀 ×𝑑 and 𝑉 ∈ R𝑁 ×𝑑 represent the user
To distinguish noises in an unsupervised fashion, sev- and item embeddings. 𝐺𝑦 denotes a classification net-
eral works [4, 15, 21, 5] have pointed out a principled work. The hyperparameter 𝜁 is to balance the loss and the
way to locate noises by finding the data that needs the regularization term. In non-deep learning models, Eq. 2
most modeling effort or cost. Inspired by this principle can be solved effectively through many methods (e.g., Al-
of least modeling effort, in this work, we propose to refer ternative Least Square). Since the prediction scores rep-
to the noises as minor abnormal data. Compared with resent the user-item similarities where ℳ has negative
the interactions from intrinsic preferences, noises take values, we calculate the cost matrix as ℳ = −𝑈 𝑉 𝑇
much effort or cost to model. by using inner product. In deep learning models, we
Based on the above least effort rationale, we propose can optimize Eq. 3 by adopting the widely used binary
a novel denoising framework to distinguish noises by cross-entropy loss (i.e., 𝒞(·, ·)). Then, similar calcula-
ranking the global matching between user and item em- tions can be made for those deep learning ones, where
bedding spaces. The framework is served as a plug-in, so ℳ = −𝐺𝑦 (𝑈 , 𝑉 ).
as to be flexibly integrated to both non-deep and deep However, there are three technical challenges when
RS methods. In this scenario, the key to distinguishing applying OT to denoised recommendation:
noises boils down to finding a global user-item matching Challenge 1. In RSs, each user can interact with many
across the user space 𝒟𝑢 and the item space 𝒟𝑣 with items and each item interact with many users, which cor-
minimal matching cost. responds to a many-to-many interaction matrix. Unlike
To address the unsupervised matching problem be- previous applications of OT in CV or NLP, directly apply-
tween two spaces, we advocate a principled denoised RS ing the one-hot constraint specified in Eq. 1 is counter-
framework and ground on Optimal Transport (OT), which intuitive for recommendation.
has been successfully applied in many fields such as CV Challenge 2. In RSs, the number of users and items in
[6] and NLP [7]. As studied theoretically in [22], the the real world tends to be large, making the matching
OT formulation derived from the Earth Movers Distance process time-consuming.
(EMD) can be expressed as Challenge 3. Besides distinguishing noises, it is nec-
𝑀 ∑︁
∑︁ 𝑁
essary to design a mechanism to eliminate the effect
min ℳ𝑖𝑗 𝜋𝑖𝑗 , of noisy interactions. A naïve way is to set one global
𝜋∈𝒳 (𝒟 𝑢 ,𝒟 𝑣 )
𝑖=1 𝑗=1 (1) thresholding hyperparameter 𝜅 to keep the top 𝜅 items
1 1 with the least matching cost for each user. However, the
𝑠.𝑡. 1𝑀 𝜋 = 1𝑁 , 1𝑁 𝜋 𝑇 = 1𝑀 ,
𝑁 𝑀 global thresholding hyperparameter cannot accommo-
where 𝒳 (𝒟𝑢 , 𝒟𝑣 ) denotes the joint probability distribu- date individual user behaviors.
tion of user embedding space 𝒟𝑢 and item embedding
space 𝒟𝑣 . 1𝑀 ∈ R1×𝑀 , 1𝑁 ∈ R1×𝑁 with all elements
3.3. The ProRec Framework Proposition 1. The constraint of problem in Eq. 4 is re-
−ℳ
In this section, we propose a Partial relaxed optimal laxed by Eq.5 and Eq. 6. By defining ℳ̃ = 𝑒
𝛾 , the

Transport for denoised Recommendation (ProRec) frame- closed-form solution for the global optimal matching ma-
work to address the above three challenges. trix is:
* 𝑇
Many-to-many matching to meet the nature of RSs. 𝜋 = max(ℳ̃ diag(𝑝⊘1𝑀 ℳ̃), diag(𝑞⊘1𝑁 ℳ̃ )ℳ̃),
To address Challenge 1, we look for a smoother version of (7)
OT by lowering the sparsity and increasing the entropy of where ⊘ corresponds to the element-wise division.
the matching matrix [4]. Among the many OT algorithms
Proof. Based on Eq. 6 and using the Lagrange multiplier
(e.g., ADMM and Proximal Point Method [23, 24]), we
method we have:
choose the Sinkhorm algorithm [8] due to its simplicity 𝑀 ∑︁𝑁
and efficiency. We achieve OT with a many-to-many
∑︁
ℒ= ℳ𝑖𝑗 𝜋𝑖𝑗 + 𝛾𝐻(𝜋) − (1𝑁 𝜋 𝑇 − 𝑞)𝑓 𝑇 , (8)
matching through the Sinkhorn algorithm as follows: 𝑖=1 𝑗=1
𝑀 ∑︁𝑁
∑︁ where 𝑓 𝑇 is Lagrangian multipliers.
min ℳ𝑖𝑗 𝜋𝑖𝑗 + 𝛾𝐻(𝜋) Then, in order to find the optimal solution, we let the
𝜋∈𝒳 (𝒟 𝑢 ,𝒟 𝑣 )
𝑖=1 𝑗=1 (4)
gradient 𝜕𝜋 𝜕ℒ
𝑖𝑗
= 0, which means:
𝑠.𝑡., 1𝑀 𝜋 = 𝑝, 1𝑁 𝜋 𝑇 = 𝑞, 𝜕ℒ
= ℳ𝑖𝑗 + 𝛾 log 𝜋𝑖𝑗 − 𝑓𝑖 = 0, (9)
where 𝐻(𝜋) = 𝑀
∑︀ ∑︀𝑁
𝑖=1 𝑗=1 𝜋𝑖𝑗 (log(𝜋𝑖𝑗 ) − 1). 1𝑀 ∈ 𝜕𝜋𝑖𝑗
R1×𝑀 , 1𝑁 ∈ R1×𝑁 are with all elements equal to 1. 𝛾 is In this case, we can obtain the solution of 𝜋𝑖𝑗 as
a hyperparameter to balance the entropy regularization 𝑓𝑖 −ℳ𝑖𝑗
𝜋𝑖𝑗 = 𝑒 𝛾 . (10)
and OT.
Compared with most studies of discrete OT in CV, the The users and items that have different interactions,
major advantages of the Sinkhorn algorithm above are which means both users and items are based on differ-
as follows: i) By relaxing the discrete constraint, we can ent degrees of popularity. To take such popularity into
obtain a continuous matrix instead of a one-hot one. In consideration, we count and norm the number of items
this way, the obtained matching matrix can capture the that have interacted with user 𝑖. Based on the popularity
many-to-many relationships between users and items. of user 𝑖 ( i.e., ∑︀𝑞𝑁𝑖 ), we add constraints of the matching
ii) By adding an entropy regularization, we address the matrix 𝜋 by 𝑗=1 𝜋𝑖𝑗 = 𝑞𝑖 . Then, we have
sparsity problem of 𝜋 * . The dense cost matrix can be 𝑁
∑︁ 𝑓𝑖 −ℳ𝑖𝑗

used for subsequent ranking. iii) By modifying the distri- 𝑒 𝛾 = 𝑞𝑖 . (11)
bution constraint of 𝜋 from uniform (i.e., 1𝑀 𝜋 = 𝑁1 1𝑁 ) 𝑗=1

to popularity-based (i.e., 1𝑀 𝜋 = 𝑝), we model non- Since the Lagrange multiplier 𝑓𝑖 is not related to 𝑗, we
uniform distributions where users and items have differ- can extract this part and rewrite the formula as:
ent number of interactions, and successfully adapt the 𝑁
1 ∑︁ − ℳ𝛾𝑖𝑗 𝑓
− 𝑖
optimization to the recommendation scenario. Specifi- 𝑒 =𝑒 𝛾 . (12)
𝑞𝑖 𝑗=1
cally, we count the interactions and then normalize them
to obtain 𝑝𝑖 and 𝑞𝑗 , so 𝑁 Take the logarithm of both sides of the above equation
𝑖=1 𝑝𝑖 = 1 and
∑︀ ∑︀𝑀
𝑗=1 𝑞𝑗 = 1.
Relaxed OT to reduce the time complexity. Note and move 𝛾 to the other side:
that the optimal 𝜋 is dense and needs to be com-
*
1 ∑︁ − ℳ𝛾𝑖𝑗
puted through multiple iteractions. Therefore, it is time- 𝑓𝑖 = −𝛾 log( 𝑒 ). (13)
𝑞𝑖 𝑗
consuming in RSs with large numbers of users and items.
Inspired by the Relaxed OT for EMD [17], we extend the To keep the formula simple and clear, we define ℳ̃ =
original Sinkhorn algorithm with a relaxed regularization. − ℳ
𝑒 𝛾 , so the formula above can be written as:
Specifically, we use two auxiliary distances, each essen-
tially is the Sinkhorn with only one of the constraints in 𝑇
𝑓 = 𝛾 log 𝑞 − 𝛾 log(1𝑁 ℳ̃ ). (14)
Eq. 4:
𝑀 ∑︁
∑︁ 𝑁 According to Eq. 10 and Eq. 12, we have
ℳ𝑖𝑗
min ℳ 𝑖𝑗 𝜋𝑖𝑗 + 𝛾𝐻(𝜋) 𝑠.𝑡. 1𝑀 𝜋 = 𝑝, 𝑓𝑖 −ℳ𝑖𝑗 −
𝜋∈𝒳 (𝒟 𝑢 ,𝒟 𝑣 ) 𝑒 𝛾
𝑖=1 𝑗=1 𝜋𝑖𝑗 = 𝑒 𝛾 = 𝑞𝑖 . (15)
∑︀ − ℳ𝑖𝑗
(5) 𝑒 𝛾
𝑀 ∑︁𝑁 𝑗
Finally, based on Eq. 15 and the definition of ℳ̃ =
∑︁ 𝑇
min ℳ 𝜋
𝑖𝑗 𝑖𝑗 +𝛾𝐻(𝜋) 𝑠.𝑡. 1𝑁 𝜋 = 𝑞.
𝜋∈𝒳 (𝒟 𝑢 ,𝒟 𝑣 ) −ℳ
𝑖=1 𝑗=1 𝑒 𝛾 , we obtain the closed-form solution via a matrix
(6) form as follow
𝑇
𝜋𝑉 = diag(𝑞 ⊘ 1𝑁 ℳ̃ )ℳ̃, (16)
where ⊘ corresponds to element-wise division. mean value of top 𝜂 items while 𝑐2𝜂 = 𝑁 1−𝜂 𝑁
∑︀
𝑗=𝜂+1 𝜌𝑖𝑗
Following the same derivation, we can obtain the represents mean of the rest 𝑁 − 𝜂 items. After obtaining
closed-form solution of Eq. 5 as follow the index of splitting points 𝜅, we have the correspond-
𝜋𝑈 = ℳ̃ diag(𝑝 ⊘ 1𝑀 ℳ̃). (17) ing threshold values 𝜌𝑖,𝜅𝑖 . Based on the personalized
The original Sinkhorn minimizes 𝜋 under two con- threshold 𝜌𝑖,𝜅𝑖 , we can relabel by 𝑟𝑖𝑗 ∈ R as follows:
straints. By relaxing one constraint, we can balance the 1
𝑟𝑖𝑗 = , (21)
computation efficiency and improved preformance. To 1 + exp(−𝛽(𝜌𝑖𝑗 − 𝜌𝑖,𝜅𝑖 ))
approximate the solution under the original constraints, where 𝑟𝑖𝑗 is a monotonically increasing function accord-
we adopt the max operation, which is consistent with ing to the value of 𝜌𝑖𝑗 − 𝜌𝑖,𝜅𝑖 . 𝛽 is a hyperparameter to
the results in previous work [16]. So we have control the inclination of the function.
𝜋 * = max(𝜋𝑈 , 𝜋𝑉 ) (18) Finally, to make the relabeling process more flexible,
we propose to use a hyperparameter 𝜆 to control the
This is equivalent to:
𝑇
degree of relabeling and obtain a new interaction matrix
𝜋 * = max(ℳ̃ diag(𝑝⊘1𝑀 ℳ̃), diag(𝑞⊘1𝑁 ℳ̃ )ℳ̃). as follows:
(19) 𝑋 = 𝜆𝑋 + (1 − 𝜆)𝑅 ⊙ 𝑋, (22)
where ⊙ corresponds to the element-wise product. Con-
Note that, the closed-form solution yields a time com- sidering that the noises in sparse positive interactions
plexity of 𝑂(max(𝑀, 𝑁 )2 ). This shows that by relaxing would induce worse effects in recommendation [5], in
the OT formulation with auxiliary limitation from only this work, we only target the part of the original dataset
one side, our model can achieve a lower time complexity where 𝑋𝑖𝑗 = 1. In this way, the newly generated 𝑋 can
than the 𝑂(max(𝑀, 𝑁 )3 ) reported in [6], which uses OT keep users’ intrinsic preferences and reduce the noise
for one-to-one matching. ones. We summarize the learning procedure of our pro-
Personalized thresholding mechanism for denoised posed ProRec in Algorithm 1, which is proved to guaran-
recommendation. To flexibly discriminate the in- tee the local convergence for denoised recommendation.
trinsic and noisy user-item interactions for individual
users, we propose a non-parametric personalized thresh- 4. Experiments
olding mechanism. We first normalize the matching
𝜋*
cost by each user as 𝜋𝑖𝑗 *′ = ∑︀ 𝑖𝑗𝜋* , and rank them In this section, we first conduct controlled experiments
with synthetic noises on one dataset, so as to investi-
𝑗 𝑖𝑗
as 𝜌𝑖 = sort(𝜋𝑖*′ ). According to the definition of the
gate ProRec’s performance to different levels of noisy
matching matrix, each row represents users’ preferences
interactions. Then, we evaluate ProRec’s performance
towards the items. We define 𝜅 = {𝜅1 , . . . , 𝜅𝑀 } as the
on three original real-world datasets of recommendation
index of the threshold which can filter out users’ noisy
with implicit feedback.
preferences. 𝜅𝑖 denotes user 𝑖’s threshold. As shown
in Figure 1, some users with consistent (narrow) inter-
ests (i.e., user 𝐴) would benefit from a reasonably low 4.1. Experimental Settings
threshold since a few items can already represent their
4.1.1. Dataset and evaluation protocols.
preferences and more items can introduce more noises.
Meanwhile, some users with diverse (wide) interests (i.e., We use ML-100k for the noise-controlled experiments
user 𝐵) would require the threshold to be relatively high and do a 4:1 random splitting for train/test data follow-
to model a broader range of preferences and satisfy the ing [26]. To simulate the noise, we randomly select
need for exploration. To find a personalized splitting 5%/10%/15%/20% of ground-truth records and flip labels
point 𝜅𝑖 for each user according to the learned matrix, in the train set. The selected records can be regarded
the model requires an automatic splitting mechanism. as noises during training. For real data experiments, we
Inspired by the threshold selection mechanism used in use Amazon music (AMusic)1 , Amazon toys (AToy)1 , and
CART [25], we efficiently learn the personalized thresh- Yahoo2 . These datasets have been widely adopted in
olds 𝜅𝑖 for different users by minimizing the following previous literature [27, 5, 28], and their statistics are sum-
sum of square errors: marized in Table 1. We do a 5:2:3 splitting for them fol-
lowing [28]. We evaluate the ability to distinguish noise
[︃ 𝜂 𝑁
]︃
∑︁ (︀ )︀2 ∑︁ )︀2
𝜌𝑖𝑗 − 𝑐1𝜂 + 𝜌𝑖𝑗 − 𝑐2𝜂 , by Hit Ratio, and recommendation performances by nor-
(︀
𝜅𝑖 = arg min
𝜂
𝑗=1 𝑗=𝜂+1 malized discounted cumulative, mean average precision,
(20) and recall.
where 𝜌𝑖𝑗 is the 𝑗-th value of the∑︀ sorted 𝜌𝑖 and the index
𝜂 ∈ {1, 2, · · · , 𝑁 −1}. 𝑐1𝜂 = 𝜂1 𝜂𝑗=1 𝜌𝑖𝑗 represents the 1 https://github.com/familyld/DeepCF/tree/master/Data
2
https://webscope.sandbox.yahoo.com/catalog.php?datatype=r
Algorithm 1: Partial Relaxed Optimal Transport trix Factorization, which is a deep MF framework; VAE-
for Denoised Recommendation (ProRec) CF [31] – Variational Autoencoder for Collaborative Fil-
Data: The noisy user-item interactions 𝑋, user tering, which is one of the state-of-the-art deep learning
popularity 𝑝 and item popularity 𝑞 based recommendation algorithm [28]; LightGCN [32] –
Result: Denoised user embeddings 𝑈 and item LightGCN devises a light graph convolution for training
embeddings 𝑉 , a relabeled matrix 𝑅. efficiency and generation ability on recommendation sce-
1 while not converged do
nario; EGLN [33] – EGLN enhances graph learning and
2 1. Update 𝑈 and 𝑉 ; node embedding modules iteratively based on mutual
3 if Learning 𝑈 and 𝑉 via non-deep methods information maximization.
then Update 𝑈 and 𝑉 via Besides the above methods, T-CE [5] and SGDL [34]
||𝑋 − 𝑈 𝑉 𝑇 ||2𝐹 + 𝜁(||𝑈 ||22 + ||𝑉 ||22 in are the existing frameworks for denoised recommenda-
Eq. 2 and calculate ℳ = −𝑈 𝑉 𝑇 ; tions. The first one discards the large-loss samples with
4 else Update 𝑈 and 𝑉 via a dynamic threshold in each iteration and the second one
𝒞(𝑋, 𝐺𝑦 (𝑈 , 𝑉 )) + 𝜁(||𝑈 ||22 + ||𝑉 ||22 ) in collects memorized interactions at the early stage of the
Eq. 3, where 𝒞(·, ·) is binary cross-entropy training and leverages those data as denoising signals to
loss; Then calculate ℳ = −𝐺𝑦 (𝑈 , 𝑉 ), guide the following training. Both of them are designed
where 𝐺𝑦 is a classification network; as plug-ins for deep learning methods. We add CDAE
5 2. Update 𝑋; + T-CE and LightGCN + SGDL as baselines to compare
6 Model denoised recommendation problem by the ability of denoising with the proposed ProRec, which
are shown to win the best performance in their original
minimizing 𝑀
∑︀ ∑︀𝑁
𝑖=1 𝑗=1 ℳ𝑖𝑗 𝜋𝑖𝑗 + 𝛾𝐻(𝜋), works [5, 34].
where 1𝑀 𝜋 = 𝑝, 1𝑁 𝜋 𝑇 = 𝑞, in Eq. 4; For all the algorithms (except for CDAE + T-CE and
7 Compute a global optimal matching matrix LightGCN + SGDL), we add the proposed ProRec on top
𝜋 * by the closed-form solution of them, allowing both non-deep and deep RS models to
𝜋 * = max(ℳ̃ diag(𝑝 ⊘ 1𝑀 ℳ̃), diag(𝑞 ⊘ discriminate intrinsic and noisy interactions in an unsu-
𝑇
1𝑁 ℳ̃ )ℳ̃) in Eq. 7; pervised fashion. Specifically, we have SVD + ProRec,
8 Compute personalized thresholds 𝜅 by NCE + ProRec, CDAE + ProRec, WRMF + ProRec, VAE-
automatic splitting mechanism in Eq. 20; CF + ProRec, LightGCN + ProRec, and EGLN + ProRec,
9 Relabel user-item interaction by sorting 𝜅 in respectively.
Eq. 21;
10 Update 𝑋 by 𝑋 = 𝜆𝑋 + (1 − 𝜆)𝑅 ⊙ 𝑋 in 4.1.3. Implementation details
Eq. 22;
11 end
Implementations of the compared baselines are from
open-source projects (SVD3 , NCE3 , CDAE3 , WRMF3 ,
VAE-CF3 , LightGCN4 , EGLN5 , T-CE6 and SGDL7 ). We
Table 1 optimize ProRec with standard Adam. We tune all hyper-
Statistics of the datasets used in our experiments.
parameters through grid search. In particular, learning
Dataset # User # Item # Interaction Sparsity rate in {0.0005, 0.001, 0.005, 0.01}, 𝜆 in {0.25, 0.5, 0.75,
ML-100k 942 1,447 55,375 0.9594 1.0}, 𝛾 in {0.05, 0.075, 0.1, 0.125, 0.15, 0.175}, 𝛽 in {1, 5,
AMusic 1,776 12,929 46,087 0.9980 10, 20, 50}, 𝜁 in {0.0005, 0.001, 0.005, 0.01}, and the batch
AToy 3,137 33,953 84,642 0.9992 size in {100, 250, 500, 1000} for different datasets. We
Yahoo 1,948,882 46,110 48,817,561 0.9995 also carefully tuned the hyperparameters of all baselines
through cross-validation to achieve their best perfor-
4.1.2. Methods for comparison mances.
We compare two types of algorithms for recommenda-
tion. The first type is non-deep algorithms, which include: 4.2. Controlled Experiments (Synthetic
SVD [29] – A similarity-based method that constructs Noises)
a similarity matrix through SVD decomposition of the
implicit matrix 𝑋; NCE [28] – Noise contrastive esti- Before looking at the performance of recommendation,
mation item embeddings combined with a personalized we first inspect ProRec’s ability to distinguish noises.
user model based on PLRec [30]. The second type is
3
deep learning methods, including CDAE [10] – Collabo- 4 https://github.com/wuga214/NCE_Projected_LRec
rative Denoising Autoencoder is a deep learning method 5 https://github.com/gusye1234/LightGCN-PyTorch
https://github.com/yimutianyang/SIGIR2021-EGLN
specifically optimized for implicit feedback recommen- 6 https://github.com/WenjieWWJ/DenoisingRec
dation tasks; WRMF [10] – Weighted Regularized Ma- https://github.com/zealscott/SGDL
7
0.910
90.730%90.665% 0.040 0.040
0.162
0.905 0.038 0.038
90.195% 0.161 0.036 0.036
NCE+PreRec(NDCG) NCE+PreRec(NDCG)

Recall
0.900 89.983% 0.034 NCE+PreRec(Recall) 0.034 NCE+PreRec(Recall)
HR
0.160 NCE(NDCG) NCE(NDCG)
0.032 NCE(Recall) 0.032 NCE(Recall)
0.895 0.159 0.030 0.030
NCE
NCE + ProRec 0.158 0.028 0.028
0.890 0% 5% 10% 15% 20% 0.026 0 0.026
0.25 0.5 0.75 1.0 0.050 0.075 0.100 0.125 0.150 0.175
Noise level
(a) HR/Recall, controlled dataset (b) Varying 𝜆 on AMusic (c) Varying 𝛾 on AMusic

Figure 2: (a) Testing HR/Recall with different noise levels on the noise-controlled dataset. (b)-(c) Hyperparameter effect on
the proposed ProRec.

Table 2
Comparison between the original baselines and the ones plus ProRec on three benchmark datasets. The best performances of
non-deep and deep learning methods are in boldface. Statistically tested significant improvements with 𝑝-value < 0.01 brought
by ProRec compared with the original methods are indicated with *.

model N@5 M@5 R@5 N@5 M@5 R@5 N@5 M@5 R@5
AMusic AToy Yahoo
SVD [29] 0.0363 0.0667 0.0264 0.0116 0.0212 0.0084 0.1614 0.2316 0.1430
NCE [28] 0.0375 0.0693 0.0276 0.0143 0.0262 0.0104 0.2498 0.3440 0.2249
CDAE [10] 0.0074 0.0124 0.0060 0.0046 0.0080 0.0033 0.0716 0.1022 0.0652
WRMF [10] 0.0236 0.0435 0.0171 0.0086 0.0157 0.0062 0.2503 0.3244 0.2321
VAE-CF [31] 0.0146 0.0258 0.0115 0.0117 0.0212 0.0084 0.2622 0.3423 0.2380
LightGCN [32] 0.0406 0.0830 0.0368 0.0162 0.0296 0.0117 0.2750 0.3729 0.2494
EGLN [33] 0.0422 0.0866 0.0386 0.0166 0.0303 0.0120 0.2806 0.3809 0.2543
CDAE + T-CE [5] 0.0080 0.0143 0.0062 0.0049 0.0080 0.0038 0.0714 0.1032 0.0648
LightGCN + SGDL [34] 0.0417 0.0852 0.0378 0.0166 0.0304 0.0120 0.2825 0.3830 0.2561
SVD + ProRec 0.0374* 0.0677 0.0275* 0.0123* 0.0220* 0.0093* 0.1987* 0.2820* 0.1763*
NCE + ProRec 0.0396* 0.0723* 0.0298* 0.0149* 0.0267* 0.0108* 0.2619* 0.3562* 0.2364*
CDAE + ProRec 0.0080* 0.0148* 0.0065* 0.0049 0.0084 0.0038* 0.0718 0.1028 0.0660*
WRMF + ProRec 0.0241* 0.0442* 0.0179* 0.0096* 0.0165* 0.0077* 0.2699* 0.3358* 0.2403*
VAE-CF + ProRec 0.0162* 0.0266* 0.0120 0.0129* 0.0231* 0.0102* 0.2743* 0.3535* 0.2472*
LightGCN + ProRec 0.0424* 0.0866* 0.0379* 0.0168* 0.0302* 0.0120* 0.2924* 0.3918* 0.2636*
EGLN + ProRec 0.0445* 0.0915* 0.0406* 0.0173* 0.0317* 0.0124* 0.2973* 0.4012* 0.2705*

Since we know the index of the randomly added noises in have the following observations.
the dataset, we evaluate the level of distinguished noises In three different domains, by integrating ProRec with
by Hit Ratio (HR) of the real added noises in Figure 2(a). both non-deep and deep methods, we can consistently im-
Specifically, we observe a consistent high HR of NCE + prove strong baseline models and achieve start-of-the-art
ProRec around 90% when the level of noises increases. performance. Since the graph-based model can leverage
Furthermore, we test Recall of NCE and NCE + ProRec high-order relations among users and items, EGLN out-
under different levels of noises. It can be clearly observed performs all baselines on three datasets. Furthermore, our
that NCE + ProRec always performs better than NCE, as proposed EGLN + ProRec outperforms all deep-learning
the former can accurately relabel some noises to elimi- baselines on three datasets, ranging from 3.34% (achieved
nate the effect of noises. For example, when the noise on AToy on Recall@5) to 6.38% (achieved on Yahoo on
level is at 20%, NCE + ProRec significantly improves NCE Recall@5), while NCE + ProRec outperforms all non deep-
from 0.1579 to 0.1596. learning baselines on three datasets, ranging from 1.91%
(achieved on AToy on MAP@5) to 7.97% (achieved on
4.3. Real Data Experiments AMusic on Recall@5). All of these show that ProRec is ca-
pable of effective recommendation on different datasets.
4.3.1. Overall Performance Comparison One step further, we observe that the methods based
on the proposed ProRec framework all outperform the
We compare the recommendation results of the proposed
original models, which indicates the advantage of the
ProRec to those of the baseline models. Table 2 shows
denoising process. Compared with existing denoising
the NDCG, MAP, and Recall scores on three datasets. We
0 357 3565 1659 2304 1890 1418 357 3565 1659 2304 2630 1418

1 1099 1876 3462 3161 1197 640 1099 1876 3462 1352 1197 640

2 1361 640 1170 1182 721 476 Titanic 1361 2186 1170 1182 721 3436
[Drama,
3 1171 2599 1073 1169 1015 2274 Romance] 1171 1194 1933 1169 1015 2274

4 2514 2241 1672 1462 2177 2245 2514 2241 1672 3560 1799 2245

5 3395 379 896 1347 1077 1513 3395 379 896 1347 1077 2847

Friday the 13th Part 3: 3D Shakespeare in Love Runaway Bride
[Horror] [Comedy, Romance] [Comedy, Romance]

Figure 3: The denoising process of ProRec.

Table 3 Sinkhorn, it is ineffective to directly adopt one-to-one
Impact of different model components of ProRec on AMusic. matching (i.e., NCE + EMD) due to the removal of many
matched interactions. For example, NCE + Sinkhorn
Metric N@5 R@5 Time N@5 R@5 Time
significantly improves NCE from 0.0276 to 0.0281 (on
Method NCE EGLN
Recall@5 on AMusic) while both metrics of NCE + EMD
Base 0.0375 0.0276 64s 0.0422 0.0386 82s
+EMD 0.0360 0.0259 98s 0.0406 0.0371 124s are lower than those of the original NCE.
+Sinkhorn 0.0384 0.0281 132s 0.0431 0.0398 174s The efficiency of relaxed regularization.By balancing
+ROT 0.0383 0.0281 76s 0.0430 0.0392 109s the performance and the runtime via relaxing the opti-
+ProRec 0.0396 0.0298 82s 0.0445 0.0406 118s mized constraints, we can observe that NCE + ROT can
reduce around half of the runtime while keeping almost
frameworks (i.e., T-CE and SGDL), ProRec can not only the same NDCG and Recall on AMusic and Yahoo.
be flexibly integrated with both deep and non-deep RS The effect of personalized thresholding mechanism.
methods but also gain more improvements. We first use one global threshold for ROT to study the
Moreover, on top of existing RS methods, our proposed effectiveness of the personalized thresholding mecha-
ProRec in Eq. 4 does not introduce significant compu- nism. In the experiment, we search the hyperparameter
tations. In the third and sixth columns in Table 3, we of global thresholds 𝜎 in {5, 10, 15, 20, 25} and find that
compare the model’s training time under the same exper- NCE + ROT achieves the best performance when 𝜎 = 10.
imental setting (e.g., embedding dimension). As can be Compared with NCE + ROT, NCE + ProRec can individu-
observed, the training time of the proposed NCE + ProRec ally discriminate intrinsic preferences and noises without
and LightGCN + ProRec are within the same scale as NCE tuning the hyperparameter of the threshold, and then
and LightGCN on AMusic dataset. improve the performance of the partial OT framework.
Note that the improvements brought by the denois- For example, the performance of NCE + ProRec achieves
ing process on the denser datasets (e.g., the performance 0.0298, which outperforms 0.0281 of NCE + ROT on Re-
gains between 0% noise level and 5% noise level on ML- call@5 on AMusic dataset.
100k shown in Figure 2(a)) are less than those on the The effect of hyperparameters. Our proposed ProRec
sparse datasets (e.g., the performance gains between framework introduces four hyperparameters: 𝛾, 𝛽, 𝜁 and
EGLN + ProRecn and EGLN on Yahoo in Table 2). Al- 𝜆, where we empirically set 𝛽 to 20 and 𝜁 to 0.001. 𝜆
though we can explicitly eliminate some effects of noisy controls the degree of the relabeling and 𝛾 controls the
interactions (as shown in Figure 2(a)), they only have a sparsity of the matching matrix. Take NCE + ProRec for
minor impact on the learned embeddings when interac- example (shown in Figure 2(b)-2(c)), the optimal 𝜆 and 𝛾
tions are enough in the observed data, which is consistent on AMusic are found to be 0.5 and 0.1, respectively. In
with the results in recent studies [35]. general, we can set 𝜆 to 0.5 for denser datasets and 0.25
for the sparser datasets and always set 𝛾 to 0.1.
4.3.2. Model Ablation and Hyperparameter Study
4.3.3. Denoising Case Study
In this section, three groups of ablation studies are con-
ducted to evaluate the impacts of our proposed many-to- To demonstrate the advantages of the denoising process,
many matching, relaxed regularization, and personalized we visualize the interaction matrix given by ML-100k on
thresholding mechanisms, as well as the effects of hyper- the left and the learned matrix of the proposed NCE +
parameters. ProRec on the right (as shown in Figure 3). The num-
The effect of continuous many-to-many matching. bers in the boxes are actual movie IDs. The colors in
Compared with the many-to-many matching of NCE + the original interaction matrix (left) are the same, indi-
cating uniform weights, whereas those in the relabeled [6] B. Bhushan Damodaran, B. Kellenberger, R. Fla-
matrix (right) are different. The different depths of blue mary, D. Tuia, N. Courty, Deepjdot: Deep joint
represent the matching cost (the deeper, the lower) and distribution optimal transport for unsupervised do-
white boxes denote the intrinsically preferred items. As main adaptation, in: ECCV, 2018.
we can see from the different colors, our many-to-many [7] V. Huynh, H. Zhao, D. Phung, Otlda: A geometry-
matching matrix effectively discriminates intrinsic and aware optimal transport approach for topic model-
noisy interactions. Moreover, we can also observe the ing, NIPS (2020).
personalized thresholding mechanism to work as differ- [8] M. Cuturi, Sinkhorn distances: Lightspeed compu-
ent numbers of items is being replaced for different users. tation of optimal transport (2013).
Furthermore, we show several movies relevant to user 0. [9] R. Xu, P. Liu, Y. Zhang, F. Cai, J. Wang, S. Liang,
Since most interacted movies of user 0 are from the ro- H. Ying, J. Yin, Joint partial optimal transport for
mance category, such as Titanic and Shakespeare in Love, open set domain adaptation, in: IJCAI, 2020.
the horror movie is abnormal and is unlikely to reflect [10] Y. Hu, Y. Koren, C. Volinsky, Collaborative filtering
her/his intrinsic preference. NCE + ProRec automatically for implicit feedback datasets, in: ICDM, 2008.
learns to highlight the two romantic movies. After down- [11] Y. Liu, Q. Liu, Y. Tian, C. Wang, Y. Niu, Y. Song,
weighing Friday the 13th Part 3: 3D, the scoring of the C. Li, Concept-aware denoising graph neural net-
user’s preference changes, and thus the ranking of an- work for micro-video recommendation, in: Proceed-
other romantic movie (Runaway Bride) rises to show up ings of the 30th ACM International Conference on
in the top 𝐾 = 6 candidate list for user 0. Information & Knowledge Management, 2021, pp.
1099–1108.
[12] R. Jagerman, H. Oosterhuis, M. de Rijke, To model
5. Conclusion or to intervene: A comparison of counterfactual
and online learning to rank from user interactions,
In this paper, we propose a novel ProRec framework for
in: SIGIR, 2019.
the recommendation with implicit feedback. ProRec can
[13] H. Lu, M. Zhang, S. Ma, Between clicks and sat-
effectively denoise the user-item interactions by flexibly
isfaction: Study on multi-phase user preferences
serving as a plug-in, so as to further improve the recom-
and satisfaction for online news reading, in: SIGIR,
mendation accuracy for both non-deep and deep learning
2018, pp. 435–444.
RS methods. We demonstrate the superior performance
[14] T. Schnabel, A. Swaminathan, A. Singh, N. Chan-
of ProRec in recommendation through extensive experi-
dak, T. Joachims, Recommendations as treatments:
ments and showcase its effectiveness in denoising user-
Debiasing learning and evaluation, JMLR (2016).
item interactions through insightful case studies. Since
[15] L. Jiang, Z. Zhou, T. Leung, L.-J. Li, L. Fei-Fei, Men-
we achieve denoising in a fully unsupervised fashion
tornet: Learning data-driven curriculum for very
without accessing any additional user or item data, we
deep neural networks on corrupted labels, in: ICML,
expect ProRec to incur no more negative societal impact
2018.
than any basic recommender system.
[16] N. Kolkin, J. Salavon, G. Shakhnarovich, Style trans-
fer by relaxed optimal transport and self-similarity,
References in: CVPR, 2019.
[17] M. Kusner, Y. Sun, N. Kolkin, K. Weinberger, From
[1] Y. Kim, A. Hassan, R. W. White, I. Zitouni, Modeling word embeddings to document distances, in: ICML,
dwell time to predict click-level satisfaction, in: 2015, pp. 957–966.
WSDM, 2014. [18] L. Chen, S. Dai, C. Tao, H. Zhang, Z. Gan, D. Shen,
[2] H. Lu, M. Zhang, W. Ma, C. Wang, F. xia, Y. Liu, Y. Zhang, G. Wang, R. Zhang, L. Carin, Adversarial
L. Lin, S. Ma, Effects of user negative experience in text generation via feature-mover’s distance, in:
mobile news streaming, in: SIGIR, 2019. NIPS, 2018.
[3] B. Yang, S. Lee, S. Park, S.-g. Lee, Exploiting vari- [19] L. Chen, Y. Zhang, R. Zhang, C. Tao, Z. Gan,
ous implicit feedback for collaborative filtering, in: H. Zhang, B. Li, D. Shen, C. Chen, L. Carin, Im-
WWW, 2012. proving sequence-to-sequence learning via optimal
[4] N. Courty, R. Flamary, D. Tuia, A. Rakotomamonjy, transport, in: ICLR, 2019.
Optimal transport for domain adaptation, TPAMI [20] Y. Meng, X. Yan, W. Liu, H. Wu, J. Cheng, Wasser-
39 (2017) 1853–1865. stein collaborative filtering for item cold-start rec-
[5] W. Wang, F. Feng, X. He, L. Nie, T.-S. Chua, Denois- ommendation, in: Proceedings of the 28th ACM
ing implicit feedback for recommendation, WSDM Conference on User Modeling, Adaptation and Per-
(2021). sonalization, 2020, pp. 318–322.
[21] Q. Xu, J. Xiong, X. Cao, Q. Huang, Y. Yao, From rative filtering, in: SIGIR, 2019.
social to individuals: a parsimonious path of multi- [29] P. Cremonesi, Y. Koren, R. Turrin, Performance of
level models for crowdsourced preference aggrega- recommender algorithms on top-n recommenda-
tion, TPAMI (2018). tion tasks, in: RecSys, 2010.
[22] A. Gretton, D. Sejdinovic, H. Strathmann, S. Bal- [30] S. Sedhain, H. Bui, J. Kawale, N. Vlassis, B. Kveton,
akrishnan, M. Pontil, K. Fukumizu, B. K. Sriperum- A. K. Menon, T. Bui, S. Sanner, Practical linear mod-
budur, Optimal kernel choice for large-scale two- els for large-scale one-class collaborative filtering,
sample tests, in: NIPS, 2012. in: IJCAI, 2016.
[23] S. Boyd, N. Parikh, E. Chu, Distributed optimization [31] D. Liang, R. G. Krishnan, M. D. Hoffman, T. Jebara,
and statistical learning via the alternating direction Variational autoencoders for collaborative filtering,
method of multipliers, 2011. in: WWW, 2018.
[24] N. Parikh, S. Boyd, Proximal algorithms, Founda- [32] X. He, K. Deng, X. Wang, Y. Li, Y. Zhang, M. Wang,
tions and Trends in optimization (2014). Lightgcn: Simplifying and powering graph convolu-
[25] N. Angelopoulos, J. Cussens, Exploiting informa- tion network for recommendation, in: SIGIR, 2020,
tive priors for bayesian classification and regression pp. 639–648.
trees., in: IJCAI, 2005. [33] Y. Yang, L. Wu, R. Hong, K. Zhang, M. Wang, En-
[26] J. Ding, Y. Quan, Q. Yao, Y. Li, D. Jin, Simplify and hanced graph learning for collaborative filtering
robustify negative sampling for implicit collabora- via mutual information maximization, in: SIGIR,
tive filtering (2020). 2021, pp. 71–80.
[27] Z.-H. Deng, L. Huang, C.-D. Wang, J.-H. Lai, S. Y. [34] Y. Gao, Y. Du, Y. Hu, L. Chen, X. Zhu, Z. Fang,
Philip, Deepcf: A unified framework of represen- B. Zheng, Self-guided learning to denoise for robust
tation learning and matching function learning in recommendation, in: SIGIR, 2022.
recommender system, in: AAAI, volume 33, 2019, [35] S. Pal, S. Malekmohammadi, F. Regol, Y. Zhang,
pp. 61–68. Y. Xu, M. Coates, Non-parametric graph learning
[28] G. Wu, M. Volkovs, C. L. Soon, S. Sanner, H. Rai, for bayesian graph neural networks, PMLR (2020).
Noise contrastive estimation for one-class collabo-