Partial Relaxed Optimal Transport for Denoised
Recommendation
Yanchao Tan1,2 , Carl Yang3,* , Xiangyu Wei4 , Ziyue Wu5 , Weiming Liu4,† and Xiaolin Zheng4
1
  The College of Computer and Data Science, Fuzhou University, Fuzhou 350116, China
2
  Fujian Key Laboratory of Network Computing and Intelligent Information Processing, Fuzhou University, Fuzhou 350116, China
3
  The Department of Computer Science, Emory University, Atlanta 30322, United State
4
  The College of Computer Science, Zhejiang University, Hangzhou 310027, China
5
  The School of Management, Zhejiang University, Hangzhou 310058, China


                                          Abstract
                                          The interaction data used by recommender systems (RSs) inevitably include noises resulting from mistaken or exploratory
                                          clicks, especially under implicit feedbacks. Without proper denoising, RS models cannot effectively capture users’ intrinsic
                                          preferences and the true interactions between users and items. To address such noises, existing methods mostly rely on
                                          auxiliary data which are not always available. In this work, we ground on Optimal Transport (OT) to globally match a user
                                          embedding space and an item embedding space, allowing both non-deep and deep RS models to discriminate intrinsic and
                                          noisy interactions without supervision. Specifically, we firstly leverage the OT framework via Sinkhorn distance to compute
                                          the continuous many-to-many user-item matching scores. Then, we relax the regularization in Sinkhorn distance to achieve
                                          a closed-form solution with a reduced time complexity. Finally, to consider individual user behaviors for denoising, we
                                          develop a partial OT framework to adaptively relabel user-item interactions through a personalized thresholding mechanism.
                                          Extensive experiments show that our framework can significantly boost the performances of existing RS models.


1. Introduction                                                                                        interactions for more accurate recommendations?
                                                                                                          In this work, we approach the denoised recommen-
With the rapid growth of various activities on the Web, dation problem in two steps: 1. distinguishing intrinsic
recommender systems (RSs) become fundamental in help- and noisy interactions; 2. stressing intrinsic preferences
ing users alleviate the problem of information overload. and reducing noisy ones for recommendation. Inspired
However, users can click some items by mistake or out by a least modeling effort principle in an unsupervised
of curiosity, and many RSs will also recommend some fashion [4, 5], we refer to noises as minor abnormal data
less relevant items for exploration every now and then. that needs higher modeling effort than the intrinsic ones.
Take movie watching for example. Since a user cannot                                                      Based on the above rationale, we propose to distin-
always distinguish horrible movies from romantic ones guish noises by ranking the global matching cost between
by movie names, he/she can easily click a horrible movie the user and item embedding spaces, and flexibly inte-
by mistake among the many romantic movies he/she usu- grate it with both non-deep and deep RS methods. In
ally watches. In this case, a traditional method like CF this way, the key to distinguishing noises boils down
cannot effectively reduce the probability of recommend- to finding a global matching matrix with the minimum
ing a horrible movie unless the noisy clicks are extremely matching cost. Subsequently, we advocate a principled
rare, which may further seduce even more noisy clicks. denoised RS framework and calculate the matching ma-
   Recently, to get out of such mistakes, existing research trix grounding on Optimal Transport (OT), which has
discovers users’ intrinsic preferences with the help of been introduced to address the unsupervised matching
external user behaviors [1], auxiliary item features [2] or problem between two distributions or spaces in many
extra feedbacks [3]. A key limitation is their reliance on fields [6, 7]. Through minimizing the overall cost of the
auxiliary data, which costs significant effort and is not mismatched user-item pairs, OT finds a global optimal
always attainable in RSs. Without supervision, can we matching matrix, whose values represent the matching
develop a framework to automatically denoise user-item cost (i.e., modeling effort). Specifically, the low user-item
DL4SR’22: Workshop on Deep Learning for Search and Recommen- matching cost indicates a user’s intrinsic preferences while
dation, co-located with the 31st ACM International Conference on the high value indicates noisy ones.
Information and Knowledge Management (CIKM), October 17-21, 2022,                                         However, to apply OT to achieve denoised recommen-
Atlanta, USA                                                                                           dation, three problems still remain: 1. Unlike the one-hot
*
  Corresponding author.
                                                                                                       constraint in previous applications of OT, the nature of
$ yctan@fzu.edu.cn (Y. Tan); j.carlyang@emory.edu (C. Yang);
weixy@zju.edu.cn (X. Wei); wuziyue@zju.edu.cn (Z. Wu);                                                 RSs requires a many-to-many matching; 2. The large
21831010@zju.edu.cn (W. Liu); xlzheng@zju.edu.cn (X. Zheng)                                            number of users and items makes the matching process
          © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
          Attribution 4.0 International (CC BY 4.0).                                                   time-consuming; 3. A mechanism is needed to properly
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
leverage the learned matching matrix and finally elimi-         Optimal transport (OT) theory with guaranteed optimal-
nate the effect of noises for denoised recommendation.          ity in the global matching and can be flexibly integrated
   To address these challenges, we first leverage the OT        with both deep and non-deep RS methods.
framework via Sinkhorn distance [8]. Unlike a one-hot              OT aims to find a matching between two distributions
matching matrix that satisfies the requirements of the          or spaces, which has been most widely used in transfer
many widely studied tasks (e.g., transfer learning [6, 9]),     learning. It can match one instance in the source domain
we compute a continuous many-to-many approxima-                 to another in the target domain with the least cost. Ex-
tion to the discrete OT, so as to meet the nature of RSs.       isting works mainly study discrete OT for one-to-one
Furthermore, we propose to relax the regularization in          matching. For example, in computer vision, OT has been
Sinkhorn to achieve a closed-form solution, which can           applied to domain adaption [4, 6] and style transfer [16].
reduce the time complexity for large-scale data from            OT has also been successfully applied in many natural
𝒪(max(𝑀, 𝑁 )3 ) to 𝒪(max(𝑀, 𝑁 )2 ) . Then, we rank              language processing tasks, such as topic modeling [7, 17],
the learned continuous matching cost to differentiate in-       text generation [18], and sequence-to-sequence learning
trinsic and noisy interactions for each user. Finally, we       [19]. Considering the effect of non-matched pairs, [9]
take into account that individual users’ behaviors can          proposed a partial one-to-one matching that leverages
vary in RSs, which should be handled differently during         one shared threshold to address domain shift for domain
denoising. To this end, we design a personalized thresh-        adaption instead of taking all pairs into account.
olding mechanism for relabeling user-item interactions,            However, the above discrete one-to-one matching can-
which efficiently finds a splitting point for each individual   not satisfy the nature of many-to-many user-item re-
user according to the ranked matching cost.                     lationships in RSs, where each user can interact with
   Experiment results on one real-world dataset with syn-       many items and each item can interact with many users.
thesized noises demonstrate our proposed ProRec’s abil-         Though OT is applied into cold-start recommendation
ity of denoising under various levels of noises. Further        problem in [20], they directly measure user-item similar-
experiments on three original real-world datasets also          ity without taking the effect of noises into consideration.
verify the advantages of ProRec, which demonstrate its          Moreover, partial matching with a shared threshold can-
significant performance boosts upon multiple popular RS         not reflect the individual user behaviors, and thus leads
models, in terms of effectiveness and efficiency.               to suboptimal recommendation performances.


2. Related work                                                 3. Methodology
Many research studies have pointed out that there exist         In this section, we detail the proposed Partial relaxed
a large proportion of noisy interactions in the implicit        optimal Transport for the denoised Recommendation
feedbacks (i.e., the clicks) [10, 1, 11]. These feedbacks are   (ProRec) framework, as shown in Figure 1. Specifically,
easily affected by different factors, such as the position      we first formulate the denoised recommendation as a two-
bias [12], caption bias [13] and exposure bias [10, 14].        step process. Then, we introduce the Optimal Transport
Therefore, there exist large gaps between the implicit          (OT) framework for distinguishing noises through global
feedbacks and the actual user preferences due to various        user-item matching. Furthermore, we elaborate on our
reasons, which are detrimental to the users’ continuous         proposed ProRec framework, which can be applied to
behaviors and overall satisfaction [2, 5].                      achieve denoised recommendations.
    To alleviate the above issues, many researchers have
considered incorporating auxiliary feedbacks to the iden-       3.1. Problem Statement
tification of noises in the implicit feedbacks, such as dwell
time [1] and scroll intervals [13]. These works usually         Following the setting of recommendation, we denote
design features manually and label items with external          the user-item interaction matrix as 𝑋 ∈ {0, 1}𝑀 ×𝑁 ,
help (e.g., from domain experts) [13, 2], which require         where 𝑀 denotes the number of users and 𝑁 denotes the
extensive effort and cost, which are not always available       number of items. We construct user-item relationships
across RSs. To this end, sheer observations on training         across a user embedding space 𝒟𝑢 and an item embedding
losses have recently been explored to denoise the implicit      space 𝒟𝑣 . The problem is that user-item pairs contain
feedbacks during training in an unsupervised fashion [5].       some noisy interactions which need to be relabeled.
The method is purely heuristic and can be integrated              In this work, we approach denoised recommendation
with deep learning methods only. Based on a principle           through two steps. Firstly, we distinguish intrinsic and
of the least modeling effort in different fields [4, 15], our   noisy interactions across 𝒟𝑢 and 𝒟𝑣 by a global match-
proposed framework aims to denoise the user-item in-            ing matrix 𝜋 * . Then, to stress users’ intrinsic preferences
teractions without supervision, which is supported by           and reduce noises, we rank the learned 𝜋 * and keep only
      1 2 3 4 5 6                    1     2     3     4     5     6                                 1 2 3 4 5 6
A     1    1    1   0    1    0     0.077 0.077 0.087 0.070 0.054 0.034     3 1 2 4 5 6              1     1   1   0   0      0
B     1    1    0   1    1    1     0.070 0.070 0.050 0.077 0.066 0.069     4 1 2 6 5 3              1     1   0   1   1      1
C     1    1    0   0    0    0     0.046 0.046 0.026 0.046 0.017 0.018     1 2 4 3 6 5              1     1   0   0   0      0
                                                             ∗
       Interaction Matrix X                Matching Matrix "              Personalized threshold !       Relabeled Matrix R
       Intrinsic preference
      Relabeled data                                             ProRec

Figure 1: A toy example illustrating the denoised recommendation based on our ProRec framework.

the top 𝜅 interactions of each user via a personalized equal to 1. 𝑀 is the number of users and 𝑁 is the number
thresholding mechanism. In this way, we learn a rela- of items. 𝜋 * denote the optimal transport plan.
beled matrix 𝑅 ∈ 𝑅𝑀 ×𝑁 for denoised recommendation.            Specifically, the matching cost matrix ℳ directly re-
                                                            flects the modeling effort, and thus the low user-item
3.2. Global Matching for Distinguishing                     matching cost can indicate users’ intrinsic preferences
                                                            while the high values can indicate noises. Based on the
       Noises                                               definition of ℳ, we can integrate the denoising process
As motivated in Section 1, a framework that can denoise with both non-deep and deep RS models as follows:
user-item interactions for more accurate recommenda- ℒ(𝑈 , 𝑉 ) = ||𝑋 − 𝑈 𝑉 𝑇 ||2𝐹 + 𝜁(||𝑈 ||22 + ||𝑉 ||22 ), (2)
tion is in great need. However, without supervision, it is    ℒ(𝑈 , 𝑉 ) = 𝒞(𝑋, 𝐺𝑦 (𝑈 , 𝑉 )) + 𝜁(||𝑈 ||22 + ||𝑉 ||22 ),
hard to define noisy interactions and then stress intrinsic                                                         (3)
preferences together while reducing noisy ones.             where 𝑈 ∈ R𝑀 ×𝑑 and 𝑉 ∈ R𝑁 ×𝑑 represent the user
   To distinguish noises in an unsupervised fashion, sev- and item embeddings. 𝐺𝑦 denotes a classification net-
eral works [4, 15, 21, 5] have pointed out a principled work. The hyperparameter 𝜁 is to balance the loss and the
way to locate noises by finding the data that needs the regularization term. In non-deep learning models, Eq. 2
most modeling effort or cost. Inspired by this principle can be solved effectively through many methods (e.g., Al-
of least modeling effort, in this work, we propose to refer ternative Least Square). Since the prediction scores rep-
to the noises as minor abnormal data. Compared with resent the user-item similarities where ℳ has negative
the interactions from intrinsic preferences, noises take values, we calculate the cost matrix as ℳ = −𝑈 𝑉 𝑇
much effort or cost to model.                               by using inner product. In deep learning models, we
   Based on the above least effort rationale, we propose can optimize Eq. 3 by adopting the widely used binary
a novel denoising framework to distinguish noises by cross-entropy loss (i.e., 𝒞(·, ·)). Then, similar calcula-
ranking the global matching between user and item em- tions can be made for those deep learning ones, where
bedding spaces. The framework is served as a plug-in, so ℳ = −𝐺𝑦 (𝑈 , 𝑉 ).
as to be flexibly integrated to both non-deep and deep         However, there are three technical challenges when
RS methods. In this scenario, the key to distinguishing applying OT to denoised recommendation:
noises boils down to finding a global user-item matching Challenge 1. In RSs, each user can interact with many
across the user space 𝒟𝑢 and the item space 𝒟𝑣 with items and each item interact with many users, which cor-
minimal matching cost.                                      responds to a many-to-many interaction matrix. Unlike
   To address the unsupervised matching problem be- previous applications of OT in CV or NLP, directly apply-
tween two spaces, we advocate a principled denoised RS ing the one-hot constraint specified in Eq. 1 is counter-
framework and ground on Optimal Transport (OT), which intuitive for recommendation.
has been successfully applied in many fields such as CV Challenge 2. In RSs, the number of users and items in
[6] and NLP [7]. As studied theoretically in [22], the the real world tends to be large, making the matching
OT formulation derived from the Earth Movers Distance process time-consuming.
(EMD) can be expressed as                                   Challenge 3. Besides distinguishing noises, it is nec-
                           𝑀 ∑︁
                          ∑︁   𝑁
                                                            essary to design a mechanism to eliminate the effect
               min                ℳ𝑖𝑗 𝜋𝑖𝑗 ,                 of noisy interactions. A naïve way is to set one global
          𝜋∈𝒳 (𝒟 𝑢 ,𝒟 𝑣 )
                          𝑖=1 𝑗=1                       (1) thresholding hyperparameter 𝜅 to keep the top 𝜅 items
                           1                1               with the least matching cost for each user. However, the
          𝑠.𝑡. 1𝑀 𝜋 = 1𝑁 , 1𝑁 𝜋 𝑇 =           1𝑀 ,
                          𝑁                 𝑀               global thresholding hyperparameter cannot accommo-
where 𝒳 (𝒟𝑢 , 𝒟𝑣 ) denotes the joint probability distribu- date individual user behaviors.
tion of user embedding space 𝒟𝑢 and item embedding
space 𝒟𝑣 . 1𝑀 ∈ R1×𝑀 , 1𝑁 ∈ R1×𝑁 with all elements
3.3. The ProRec Framework                                   Proposition 1. The constraint of problem in Eq. 4 is re-
                                                                                                                   −ℳ
In this section, we propose a Partial relaxed optimal laxed by Eq.5 and Eq. 6. By defining ℳ̃ = 𝑒
                                                                                                                     𝛾 , the


Transport for denoised Recommendation (ProRec) frame-       closed-form    solution   for the global optimal matching   ma-
work to address the above three challenges.                 trix is:
                                                              *                                                      𝑇
Many-to-many matching to meet the nature of RSs. 𝜋 = max(ℳ̃ diag(𝑝⊘1𝑀 ℳ̃), diag(𝑞⊘1𝑁 ℳ̃ )ℳ̃),
To address Challenge 1, we look for a smoother version of                                                                 (7)
OT by lowering the sparsity and increasing the entropy of where ⊘ corresponds to the element-wise division.
the matching matrix [4]. Among the many OT algorithms
                                                            Proof. Based on Eq. 6 and using the Lagrange multiplier
(e.g., ADMM and Proximal Point Method [23, 24]), we
                                                            method we have:
choose the Sinkhorm algorithm [8] due to its simplicity               𝑀 ∑︁𝑁
and efficiency. We achieve OT with a many-to-many
                                                                     ∑︁
                                                              ℒ=              ℳ𝑖𝑗 𝜋𝑖𝑗 + 𝛾𝐻(𝜋) − (1𝑁 𝜋 𝑇 − 𝑞)𝑓 𝑇 , (8)
matching through the Sinkhorn algorithm as follows:                  𝑖=1 𝑗=1
                          𝑀 ∑︁𝑁
                         ∑︁                                 where 𝑓 𝑇 is Lagrangian multipliers.
               min               ℳ𝑖𝑗 𝜋𝑖𝑗 + 𝛾𝐻(𝜋)               Then, in order to find the optimal solution, we let the
         𝜋∈𝒳 (𝒟 𝑢 ,𝒟 𝑣 )
                         𝑖=1 𝑗=1                        (4)
                                                            gradient 𝜕𝜋  𝜕ℒ
                                                                           𝑖𝑗
                                                                              = 0, which means:
         𝑠.𝑡., 1𝑀 𝜋 = 𝑝, 1𝑁 𝜋 𝑇 = 𝑞,                                     𝜕ℒ
                                                                                = ℳ𝑖𝑗 + 𝛾 log 𝜋𝑖𝑗 − 𝑓𝑖 = 0,               (9)
where 𝐻(𝜋) = 𝑀
                  ∑︀       ∑︀𝑁
                      𝑖=1    𝑗=1 𝜋𝑖𝑗 (log(𝜋𝑖𝑗 ) − 1). 1𝑀 ∈              𝜕𝜋𝑖𝑗
R1×𝑀 , 1𝑁 ∈ R1×𝑁 are with all elements equal to 1. 𝛾 is In this case, we can obtain the solution of 𝜋𝑖𝑗 as
a hyperparameter to balance the entropy regularization                                         𝑓𝑖 −ℳ𝑖𝑗
                                                                                     𝜋𝑖𝑗 = 𝑒 𝛾 .                        (10)
and OT.
    Compared with most studies of discrete OT in CV, the       The    users and   items   that  have different interactions,
major advantages of the Sinkhorn algorithm above are which means both users and items are based on differ-
as follows: i) By relaxing the discrete constraint, we can ent degrees of popularity. To take such popularity into
obtain a continuous matrix instead of a one-hot one. In consideration, we count and norm the number of items
this way, the obtained matching matrix can capture the that have interacted with user 𝑖. Based on the popularity
many-to-many relationships between users and items. of user 𝑖 ( i.e.,       ∑︀𝑞𝑁𝑖 ), we add constraints of the matching
ii) By adding an entropy regularization, we address the     matrix    𝜋 by     𝑗=1 𝜋𝑖𝑗 = 𝑞𝑖 . Then, we have
sparsity problem of 𝜋 * . The dense cost matrix can be                           𝑁
                                                                                ∑︁    𝑓𝑖 −ℳ𝑖𝑗

used for subsequent ranking. iii) By modifying the distri-                           𝑒 𝛾       = 𝑞𝑖 .                (11)
bution constraint of 𝜋 from uniform (i.e., 1𝑀 𝜋 = 𝑁1 1𝑁 )                       𝑗=1

to popularity-based (i.e., 1𝑀 𝜋 = 𝑝), we model non-             Since the Lagrange multiplier 𝑓𝑖 is not related to 𝑗, we
uniform distributions where users and items have differ- can extract this part and rewrite the formula as:
ent number of interactions, and successfully adapt the                            𝑁
                                                                               1 ∑︁ − ℳ𝛾𝑖𝑗           𝑓
                                                                                                   − 𝑖
optimization to the recommendation scenario. Specifi-                                 𝑒        =𝑒 𝛾 .                (12)
                                                                              𝑞𝑖 𝑗=1
cally, we count the interactions  and then normalize  them
to obtain 𝑝𝑖 and 𝑞𝑗 , so 𝑁                                     Take the logarithm of both sides of the above equation
                             𝑖=1 𝑝𝑖 = 1 and
                         ∑︀                  ∑︀𝑀
                                               𝑗=1 𝑞𝑗 = 1.
Relaxed OT to reduce the time complexity. Note               and  move 𝛾 to the other side:
that the optimal 𝜋 is dense and needs to be com-
                       *
                                                                                           1 ∑︁ − ℳ𝛾𝑖𝑗
puted through multiple iteractions. Therefore, it is time-                 𝑓𝑖 = −𝛾 log(          𝑒        ).         (13)
                                                                                          𝑞𝑖 𝑗
consuming in RSs with large numbers of users and items.
Inspired by the Relaxed OT for EMD [17], we extend the         To keep the formula simple and clear, we define ℳ̃ =
original Sinkhorn algorithm with a relaxed regularization. − ℳ
                                                             𝑒 𝛾 , so the formula above can be written as:
Specifically, we use two auxiliary distances, each essen-
tially is the Sinkhorn with only one of the constraints in                                               𝑇
                                                                          𝑓 = 𝛾 log 𝑞 − 𝛾 log(1𝑁 ℳ̃ ).               (14)
Eq. 4:
                 𝑀 ∑︁
                ∑︁   𝑁                                         According to Eq. 10 and Eq. 12, we have
                                                                                                       ℳ𝑖𝑗
     min                ℳ 𝑖𝑗 𝜋𝑖𝑗 + 𝛾𝐻(𝜋) 𝑠.𝑡. 1𝑀 𝜋 = 𝑝,                          𝑓𝑖 −ℳ𝑖𝑗            −
𝜋∈𝒳 (𝒟 𝑢 ,𝒟 𝑣 )                                                                                   𝑒 𝛾
                𝑖=1 𝑗=1                                                  𝜋𝑖𝑗 = 𝑒 𝛾         = 𝑞𝑖              .       (15)
                                                                                                ∑︀ − ℳ𝑖𝑗
                                                         (5)                                         𝑒    𝛾
                 𝑀 ∑︁𝑁                                                                             𝑗
                                                                Finally, based on  Eq.  15 and the  definition of ℳ̃ =
                ∑︁                                  𝑇
     min                ℳ    𝜋
                          𝑖𝑗 𝑖𝑗 +𝛾𝐻(𝜋)    𝑠.𝑡. 1𝑁 𝜋    =  𝑞.
𝜋∈𝒳 (𝒟 𝑢 ,𝒟 𝑣 )                                               −ℳ
                𝑖=1 𝑗=1                                      𝑒 𝛾 , we obtain the closed-form solution via a matrix
                                                         (6) form as follow
                                                                                                     𝑇
                                                                           𝜋𝑉 = diag(𝑞 ⊘ 1𝑁 ℳ̃ )ℳ̃,                  (16)
where ⊘ corresponds to element-wise division.                      mean value of top 𝜂 items while 𝑐2𝜂 = 𝑁 1−𝜂 𝑁
                                                                                                                 ∑︀
                                                                                                                    𝑗=𝜂+1 𝜌𝑖𝑗
   Following the same derivation, we can obtain the                represents mean of the rest 𝑁 − 𝜂 items. After obtaining
closed-form solution of Eq. 5 as follow                            the index of splitting points 𝜅, we have the correspond-
               𝜋𝑈 = ℳ̃ diag(𝑝 ⊘ 1𝑀 ℳ̃).             (17)           ing threshold values 𝜌𝑖,𝜅𝑖 . Based on the personalized
   The original Sinkhorn minimizes 𝜋 under two con-                threshold 𝜌𝑖,𝜅𝑖 , we can relabel by 𝑟𝑖𝑗 ∈ R as follows:
straints. By relaxing one constraint, we can balance the                                            1
                                                                               𝑟𝑖𝑗 =                              ,      (21)
computation efficiency and improved preformance. To                                   1 + exp(−𝛽(𝜌𝑖𝑗 − 𝜌𝑖,𝜅𝑖 ))
approximate the solution under the original constraints,           where 𝑟𝑖𝑗 is a monotonically increasing function accord-
we adopt the max operation, which is consistent with               ing to the value of 𝜌𝑖𝑗 − 𝜌𝑖,𝜅𝑖 . 𝛽 is a hyperparameter to
the results in previous work [16]. So we have                      control the inclination of the function.
                  𝜋 * = max(𝜋𝑈 , 𝜋𝑉 )               (18)              Finally, to make the relabeling process more flexible,
                                                                   we propose to use a hyperparameter 𝜆 to control the
This is equivalent to:
                                                      𝑇
                                                                   degree of relabeling and obtain a new interaction matrix
𝜋 * = max(ℳ̃ diag(𝑝⊘1𝑀 ℳ̃), diag(𝑞⊘1𝑁 ℳ̃ )ℳ̃). as follows:
                                                            (19)                  𝑋 = 𝜆𝑋 + (1 − 𝜆)𝑅 ⊙ 𝑋,                 (22)
                                                                   where ⊙ corresponds to the element-wise product. Con-
   Note that, the closed-form solution yields a time com- sidering that the noises in sparse positive interactions
plexity of 𝑂(max(𝑀, 𝑁 )2 ). This shows that by relaxing would induce worse effects in recommendation [5], in
the OT formulation with auxiliary limitation from only this work, we only target the part of the original dataset
one side, our model can achieve a lower time complexity where 𝑋𝑖𝑗 = 1. In this way, the newly generated 𝑋 can
than the 𝑂(max(𝑀, 𝑁 )3 ) reported in [6], which uses OT keep users’ intrinsic preferences and reduce the noise
for one-to-one matching.                                           ones. We summarize the learning procedure of our pro-
Personalized thresholding mechanism for denoised posed ProRec in Algorithm 1, which is proved to guaran-
recommendation. To flexibly discriminate the in- tee the local convergence for denoised recommendation.
trinsic and noisy user-item interactions for individual
users, we propose a non-parametric personalized thresh- 4. Experiments
olding mechanism. We first normalize the matching
                                      𝜋*
cost by each user as 𝜋𝑖𝑗 *′ = ∑︀ 𝑖𝑗𝜋* , and rank them In this section, we first conduct controlled experiments
                                                                   with synthetic noises on one dataset, so as to investi-
                                      𝑗 𝑖𝑗
as 𝜌𝑖 = sort(𝜋𝑖*′ ). According to the definition of the
                                                                   gate ProRec’s performance to different levels of noisy
matching matrix, each row represents users’ preferences
                                                                   interactions. Then, we evaluate ProRec’s performance
towards the items. We define 𝜅 = {𝜅1 , . . . , 𝜅𝑀 } as the
                                                                   on three original real-world datasets of recommendation
index of the threshold which can filter out users’ noisy
                                                                   with implicit feedback.
preferences. 𝜅𝑖 denotes user 𝑖’s threshold. As shown
in Figure 1, some users with consistent (narrow) inter-
ests (i.e., user 𝐴) would benefit from a reasonably low 4.1. Experimental Settings
threshold since a few items can already represent their
                                                                   4.1.1. Dataset and evaluation protocols.
preferences and more items can introduce more noises.
Meanwhile, some users with diverse (wide) interests (i.e., We use ML-100k for the noise-controlled experiments
user 𝐵) would require the threshold to be relatively high and do a 4:1 random splitting for train/test data follow-
to model a broader range of preferences and satisfy the ing [26]. To simulate the noise, we randomly select
need for exploration. To find a personalized splitting 5%/10%/15%/20% of ground-truth records and flip labels
point 𝜅𝑖 for each user according to the learned matrix, in the train set. The selected records can be regarded
the model requires an automatic splitting mechanism. as noises during training. For real data experiments, we
Inspired by the threshold selection mechanism used in use Amazon music (AMusic)1 , Amazon toys (AToy)1 , and
CART [25], we efficiently learn the personalized thresh- Yahoo2 . These datasets have been widely adopted in
olds 𝜅𝑖 for different users by minimizing the following previous literature [27, 5, 28], and their statistics are sum-
sum of square errors:                                              marized in Table 1. We do a 5:2:3 splitting for them fol-
                                                                   lowing [28]. We evaluate the ability to distinguish noise
                 [︃ 𝜂                     𝑁
                                                              ]︃
                   ∑︁ (︀         )︀2     ∑︁               )︀2
                        𝜌𝑖𝑗 − 𝑐1𝜂 +             𝜌𝑖𝑗 − 𝑐2𝜂        , by Hit Ratio, and recommendation performances by nor-
                                              (︀
𝜅𝑖 = arg min
             𝜂
                   𝑗=1                 𝑗=𝜂+1                       malized discounted cumulative, mean average precision,
                                                            (20) and recall.
where 𝜌𝑖𝑗 is the 𝑗-th value of the∑︀ sorted 𝜌𝑖 and the index
𝜂 ∈ {1, 2, · · · , 𝑁 −1}. 𝑐1𝜂 = 𝜂1 𝜂𝑗=1 𝜌𝑖𝑗 represents the 1 https://github.com/familyld/DeepCF/tree/master/Data
                                                               2
                                                                   https://webscope.sandbox.yahoo.com/catalog.php?datatype=r
  Algorithm 1: Partial Relaxed Optimal Transport                 trix Factorization, which is a deep MF framework; VAE-
  for Denoised Recommendation (ProRec)                           CF [31] – Variational Autoencoder for Collaborative Fil-
   Data: The noisy user-item interactions 𝑋, user                tering, which is one of the state-of-the-art deep learning
          popularity 𝑝 and item popularity 𝑞                     based recommendation algorithm [28]; LightGCN [32] –
   Result: Denoised user embeddings 𝑈 and item                   LightGCN devises a light graph convolution for training
            embeddings 𝑉 , a relabeled matrix 𝑅.                 efficiency and generation ability on recommendation sce-
 1 while not converged do
                                                                 nario; EGLN [33] – EGLN enhances graph learning and
 2     1. Update 𝑈 and 𝑉 ;                                       node embedding modules iteratively based on mutual
 3     if Learning 𝑈 and 𝑉 via non-deep methods                  information maximization.
        then Update 𝑈 and 𝑉 via                                     Besides the above methods, T-CE [5] and SGDL [34]
        ||𝑋 − 𝑈 𝑉 𝑇 ||2𝐹 + 𝜁(||𝑈 ||22 + ||𝑉 ||22 in              are the existing frameworks for denoised recommenda-
        Eq. 2 and calculate ℳ = −𝑈 𝑉 𝑇 ;                         tions. The first one discards the large-loss samples with
 4     else Update 𝑈 and 𝑉 via                                   a dynamic threshold in each iteration and the second one
        𝒞(𝑋, 𝐺𝑦 (𝑈 , 𝑉 )) + 𝜁(||𝑈 ||22 + ||𝑉 ||22 ) in           collects memorized interactions at the early stage of the
        Eq. 3, where 𝒞(·, ·) is binary cross-entropy             training and leverages those data as denoising signals to
        loss; Then calculate ℳ = −𝐺𝑦 (𝑈 , 𝑉 ),                   guide the following training. Both of them are designed
        where 𝐺𝑦 is a classification network;                    as plug-ins for deep learning methods. We add CDAE
 5     2. Update 𝑋;                                              + T-CE and LightGCN + SGDL as baselines to compare
 6     Model denoised recommendation problem by                  the ability of denoising with the proposed ProRec, which
                                                                 are shown to win the best performance in their original
        minimizing 𝑀
                     ∑︀     ∑︀𝑁
                        𝑖=1    𝑗=1 ℳ𝑖𝑗 𝜋𝑖𝑗 + 𝛾𝐻(𝜋),              works [5, 34].
        where 1𝑀 𝜋 = 𝑝, 1𝑁 𝜋 𝑇 = 𝑞, in Eq. 4;                       For all the algorithms (except for CDAE + T-CE and
 7     Compute a global optimal matching matrix                  LightGCN + SGDL), we add the proposed ProRec on top
        𝜋 * by the closed-form solution                          of them, allowing both non-deep and deep RS models to
        𝜋 * = max(ℳ̃ diag(𝑝 ⊘ 1𝑀 ℳ̃), diag(𝑞 ⊘                   discriminate intrinsic and noisy interactions in an unsu-
               𝑇
        1𝑁 ℳ̃ )ℳ̃) in Eq. 7;                                     pervised fashion. Specifically, we have SVD + ProRec,
 8     Compute personalized thresholds 𝜅 by                      NCE + ProRec, CDAE + ProRec, WRMF + ProRec, VAE-
        automatic splitting mechanism in Eq. 20;                 CF + ProRec, LightGCN + ProRec, and EGLN + ProRec,
 9     Relabel user-item interaction by sorting 𝜅 in             respectively.
        Eq. 21;
10     Update 𝑋 by 𝑋 = 𝜆𝑋 + (1 − 𝜆)𝑅 ⊙ 𝑋 in                      4.1.3. Implementation details
        Eq. 22;
11 end
                                                                 Implementations of the compared baselines are from
                                                                 open-source projects (SVD3 , NCE3 , CDAE3 , WRMF3 ,
                                                                 VAE-CF3 , LightGCN4 , EGLN5 , T-CE6 and SGDL7 ). We
Table 1                                                          optimize ProRec with standard Adam. We tune all hyper-
Statistics of the datasets used in our experiments.
                                                                 parameters through grid search. In particular, learning
  Dataset     # User      # Item    # Interaction     Sparsity   rate in {0.0005, 0.001, 0.005, 0.01}, 𝜆 in {0.25, 0.5, 0.75,
  ML-100k     942         1,447     55,375            0.9594     1.0}, 𝛾 in {0.05, 0.075, 0.1, 0.125, 0.15, 0.175}, 𝛽 in {1, 5,
  AMusic      1,776       12,929    46,087            0.9980     10, 20, 50}, 𝜁 in {0.0005, 0.001, 0.005, 0.01}, and the batch
   AToy       3,137       33,953    84,642            0.9992     size in {100, 250, 500, 1000} for different datasets. We
   Yahoo      1,948,882   46,110    48,817,561        0.9995     also carefully tuned the hyperparameters of all baselines
                                                                 through cross-validation to achieve their best perfor-
4.1.2. Methods for comparison                                    mances.
We compare two types of algorithms for recommenda-
tion. The first type is non-deep algorithms, which include: 4.2. Controlled Experiments (Synthetic
SVD [29] – A similarity-based method that constructs                 Noises)
a similarity matrix through SVD decomposition of the
implicit matrix 𝑋; NCE [28] – Noise contrastive esti- Before looking at the performance of recommendation,
mation item embeddings combined with a personalized we first inspect ProRec’s ability to distinguish noises.
user model based on PLRec [30]. The second type is
                                                            3
deep learning methods, including CDAE [10] – Collabo- 4 https://github.com/wuga214/NCE_Projected_LRec
rative Denoising Autoencoder is a deep learning method 5 https://github.com/gusye1234/LightGCN-PyTorch
                                                              https://github.com/yimutianyang/SIGIR2021-EGLN
specifically optimized for implicit feedback recommen- 6 https://github.com/WenjieWWJ/DenoisingRec
dation tasks; WRMF [10] – Weighted Regularized Ma- https://github.com/zealscott/SGDL
                                                            7
           0.910
                      90.730%90.665%                             0.040                                               0.040
                                                   0.162
           0.905                                                 0.038                                               0.038
                                     90.195%        0.161        0.036                                               0.036
                                                                                              NCE+PreRec(NDCG)                             NCE+PreRec(NDCG)


                                                        Recall
           0.900                            89.983%              0.034                        NCE+PreRec(Recall)     0.034                 NCE+PreRec(Recall)
      HR
                                                   0.160                                      NCE(NDCG)                                    NCE(NDCG)
                                                                 0.032                        NCE(Recall)            0.032                 NCE(Recall)
           0.895                                   0.159         0.030                                               0.030
                   NCE
                   NCE + ProRec                    0.158         0.028                                               0.028
           0.890 0% 5% 10% 15% 20%                               0.026 0                                             0.026
                                                                               0.25     0.5        0.75        1.0      0.050 0.075 0.100 0.125 0.150 0.175
                             Noise level
            (a) HR/Recall, controlled dataset                              (b) Varying 𝜆 on AMusic                         (c) Varying 𝛾 on AMusic

Figure 2: (a) Testing HR/Recall with different noise levels on the noise-controlled dataset. (b)-(c) Hyperparameter effect on
the proposed ProRec.

Table 2
Comparison between the original baselines and the ones plus ProRec on three benchmark datasets. The best performances of
non-deep and deep learning methods are in boldface. Statistically tested significant improvements with 𝑝-value < 0.01 brought
by ProRec compared with the original methods are indicated with *.

 model                                 N@5            M@5             R@5             N@5              M@5           R@5          N@5          M@5          R@5
                                                      AMusic                                            AToy                                   Yahoo
 SVD [29]                              0.0363         0.0667          0.0264          0.0116           0.0212        0.0084       0.1614       0.2316       0.1430
 NCE [28]                              0.0375         0.0693          0.0276          0.0143           0.0262        0.0104       0.2498       0.3440       0.2249
 CDAE [10]                             0.0074         0.0124          0.0060          0.0046           0.0080        0.0033       0.0716       0.1022       0.0652
 WRMF [10]                             0.0236         0.0435          0.0171          0.0086           0.0157        0.0062       0.2503       0.3244       0.2321
 VAE-CF [31]                           0.0146         0.0258          0.0115          0.0117           0.0212        0.0084       0.2622       0.3423       0.2380
 LightGCN [32]                         0.0406         0.0830          0.0368          0.0162           0.0296        0.0117       0.2750       0.3729       0.2494
 EGLN [33]                             0.0422         0.0866          0.0386          0.0166           0.0303        0.0120       0.2806       0.3809       0.2543
 CDAE + T-CE [5]                       0.0080         0.0143          0.0062          0.0049           0.0080        0.0038       0.0714       0.1032       0.0648
 LightGCN + SGDL [34]                  0.0417         0.0852          0.0378          0.0166           0.0304        0.0120       0.2825       0.3830       0.2561
 SVD + ProRec                          0.0374*        0.0677          0.0275*         0.0123*          0.0220*       0.0093*      0.1987*      0.2820*      0.1763*
 NCE + ProRec                          0.0396*        0.0723*         0.0298*         0.0149*          0.0267*       0.0108*      0.2619*      0.3562*      0.2364*
 CDAE + ProRec                         0.0080*        0.0148*         0.0065*         0.0049           0.0084        0.0038*      0.0718       0.1028       0.0660*
 WRMF + ProRec                         0.0241*        0.0442*         0.0179*         0.0096*          0.0165*       0.0077*      0.2699*      0.3358*      0.2403*
 VAE-CF + ProRec                       0.0162*        0.0266*         0.0120          0.0129*          0.0231*       0.0102*      0.2743*      0.3535*      0.2472*
 LightGCN + ProRec                     0.0424*        0.0866*         0.0379*         0.0168*          0.0302*       0.0120*      0.2924*      0.3918*      0.2636*
 EGLN + ProRec                         0.0445*        0.0915*         0.0406*         0.0173*          0.0317*       0.0124*      0.2973*      0.4012*      0.2705*

Since we know the index of the randomly added noises in                                have the following observations.
the dataset, we evaluate the level of distinguished noises                                In three different domains, by integrating ProRec with
by Hit Ratio (HR) of the real added noises in Figure 2(a).                             both non-deep and deep methods, we can consistently im-
Specifically, we observe a consistent high HR of NCE +                                 prove strong baseline models and achieve start-of-the-art
ProRec around 90% when the level of noises increases.                                  performance. Since the graph-based model can leverage
Furthermore, we test Recall of NCE and NCE + ProRec                                    high-order relations among users and items, EGLN out-
under different levels of noises. It can be clearly observed                           performs all baselines on three datasets. Furthermore, our
that NCE + ProRec always performs better than NCE, as                                  proposed EGLN + ProRec outperforms all deep-learning
the former can accurately relabel some noises to elimi-                                baselines on three datasets, ranging from 3.34% (achieved
nate the effect of noises. For example, when the noise                                 on AToy on Recall@5) to 6.38% (achieved on Yahoo on
level is at 20%, NCE + ProRec significantly improves NCE                               Recall@5), while NCE + ProRec outperforms all non deep-
from 0.1579 to 0.1596.                                                                 learning baselines on three datasets, ranging from 1.91%
                                                                                       (achieved on AToy on MAP@5) to 7.97% (achieved on
4.3. Real Data Experiments                                                             AMusic on Recall@5). All of these show that ProRec is ca-
                                                                                       pable of effective recommendation on different datasets.
4.3.1. Overall Performance Comparison                                                     One step further, we observe that the methods based
                                                                                       on the proposed ProRec framework all outperform the
We compare the recommendation results of the proposed
                                                                                       original models, which indicates the advantage of the
ProRec to those of the baseline models. Table 2 shows
                                                                                       denoising process. Compared with existing denoising
the NDCG, MAP, and Recall scores on three datasets. We
                               0    357   3565 1659 2304 1890 1418                   357    3565 1659 2304 2630 1418

                               1   1099 1876 3462 3161 1197       640                1099 1876 3462 1352 1197      640

                               2   1361   640   1170 1182   721   476      Titanic   1361 2186 1170 1182     721   3436
                                                                          [Drama,
                               3   1171 2599 1073 1169 1015 2274         Romance]    1171 1194 1933 1169 1015 2274

                               4   2514 2241 1672 1462 2177 2245                     2514 2241 1672 3560 1799 2245

                               5   3395   379   896   1347 1077 1513                 3395   379   896   1347 1077 2847

                     Friday the 13th Part 3: 3D                    Shakespeare in Love                     Runaway Bride
                              [Horror]                             [Comedy, Romance]                     [Comedy, Romance]

Figure 3: The denoising process of ProRec.

Table 3                                                                     Sinkhorn, it is ineffective to directly adopt one-to-one
Impact of different model components of ProRec on AMusic.                   matching (i.e., NCE + EMD) due to the removal of many
                                                                            matched interactions. For example, NCE + Sinkhorn
  Metric       N@5    R@5          Time N@5    R@5                Time
                                                                            significantly improves NCE from 0.0276 to 0.0281 (on
  Method             NCE                      EGLN
                                                                            Recall@5 on AMusic) while both metrics of NCE + EMD
  Base         0.0375 0.0276       64s 0.0422 0.0386              82s
  +EMD         0.0360 0.0259       98s 0.0406 0.0371              124s      are lower than those of the original NCE.
  +Sinkhorn    0.0384 0.0281       132s 0.0431 0.0398             174s      The efficiency of relaxed regularization.By balancing
  +ROT         0.0383 0.0281       76s 0.0430 0.0392              109s      the performance and the runtime via relaxing the opti-
  +ProRec      0.0396 0.0298       82s 0.0445 0.0406              118s      mized constraints, we can observe that NCE + ROT can
                                                                            reduce around half of the runtime while keeping almost
frameworks (i.e., T-CE and SGDL), ProRec can not only                       the same NDCG and Recall on AMusic and Yahoo.
be flexibly integrated with both deep and non-deep RS                       The effect of personalized thresholding mechanism.
methods but also gain more improvements.                                    We first use one global threshold for ROT to study the
   Moreover, on top of existing RS methods, our proposed                    effectiveness of the personalized thresholding mecha-
ProRec in Eq. 4 does not introduce significant compu-                       nism. In the experiment, we search the hyperparameter
tations. In the third and sixth columns in Table 3, we                      of global thresholds 𝜎 in {5, 10, 15, 20, 25} and find that
compare the model’s training time under the same exper-                     NCE + ROT achieves the best performance when 𝜎 = 10.
imental setting (e.g., embedding dimension). As can be                      Compared with NCE + ROT, NCE + ProRec can individu-
observed, the training time of the proposed NCE + ProRec                    ally discriminate intrinsic preferences and noises without
and LightGCN + ProRec are within the same scale as NCE                      tuning the hyperparameter of the threshold, and then
and LightGCN on AMusic dataset.                                             improve the performance of the partial OT framework.
   Note that the improvements brought by the denois-                        For example, the performance of NCE + ProRec achieves
ing process on the denser datasets (e.g., the performance                   0.0298, which outperforms 0.0281 of NCE + ROT on Re-
gains between 0% noise level and 5% noise level on ML-                      call@5 on AMusic dataset.
100k shown in Figure 2(a)) are less than those on the                       The effect of hyperparameters. Our proposed ProRec
sparse datasets (e.g., the performance gains between                        framework introduces four hyperparameters: 𝛾, 𝛽, 𝜁 and
EGLN + ProRecn and EGLN on Yahoo in Table 2). Al-                           𝜆, where we empirically set 𝛽 to 20 and 𝜁 to 0.001. 𝜆
though we can explicitly eliminate some effects of noisy                    controls the degree of the relabeling and 𝛾 controls the
interactions (as shown in Figure 2(a)), they only have a                    sparsity of the matching matrix. Take NCE + ProRec for
minor impact on the learned embeddings when interac-                        example (shown in Figure 2(b)-2(c)), the optimal 𝜆 and 𝛾
tions are enough in the observed data, which is consistent                  on AMusic are found to be 0.5 and 0.1, respectively. In
with the results in recent studies [35].                                    general, we can set 𝜆 to 0.5 for denser datasets and 0.25
                                                                            for the sparser datasets and always set 𝛾 to 0.1.
4.3.2. Model Ablation and Hyperparameter Study
                                                                            4.3.3. Denoising Case Study
In this section, three groups of ablation studies are con-
ducted to evaluate the impacts of our proposed many-to-                     To demonstrate the advantages of the denoising process,
many matching, relaxed regularization, and personalized                     we visualize the interaction matrix given by ML-100k on
thresholding mechanisms, as well as the effects of hyper-                   the left and the learned matrix of the proposed NCE +
parameters.                                                                 ProRec on the right (as shown in Figure 3). The num-
The effect of continuous many-to-many matching.                             bers in the boxes are actual movie IDs. The colors in
Compared with the many-to-many matching of NCE +                            the original interaction matrix (left) are the same, indi-
cating uniform weights, whereas those in the relabeled        [6] B. Bhushan Damodaran, B. Kellenberger, R. Fla-
matrix (right) are different. The different depths of blue        mary, D. Tuia, N. Courty, Deepjdot: Deep joint
represent the matching cost (the deeper, the lower) and           distribution optimal transport for unsupervised do-
white boxes denote the intrinsically preferred items. As          main adaptation, in: ECCV, 2018.
we can see from the different colors, our many-to-many        [7] V. Huynh, H. Zhao, D. Phung, Otlda: A geometry-
matching matrix effectively discriminates intrinsic and           aware optimal transport approach for topic model-
noisy interactions. Moreover, we can also observe the             ing, NIPS (2020).
personalized thresholding mechanism to work as differ- [8] M. Cuturi, Sinkhorn distances: Lightspeed compu-
ent numbers of items is being replaced for different users.       tation of optimal transport (2013).
Furthermore, we show several movies relevant to user 0. [9] R. Xu, P. Liu, Y. Zhang, F. Cai, J. Wang, S. Liang,
Since most interacted movies of user 0 are from the ro-           H. Ying, J. Yin, Joint partial optimal transport for
mance category, such as Titanic and Shakespeare in Love,          open set domain adaptation, in: IJCAI, 2020.
the horror movie is abnormal and is unlikely to reflect [10] Y. Hu, Y. Koren, C. Volinsky, Collaborative filtering
her/his intrinsic preference. NCE + ProRec automatically          for implicit feedback datasets, in: ICDM, 2008.
learns to highlight the two romantic movies. After down- [11] Y. Liu, Q. Liu, Y. Tian, C. Wang, Y. Niu, Y. Song,
weighing Friday the 13th Part 3: 3D, the scoring of the           C. Li, Concept-aware denoising graph neural net-
user’s preference changes, and thus the ranking of an-            work for micro-video recommendation, in: Proceed-
other romantic movie (Runaway Bride) rises to show up             ings of the 30th ACM International Conference on
in the top 𝐾 = 6 candidate list for user 0.                       Information & Knowledge Management, 2021, pp.
                                                                  1099–1108.
                                                             [12] R. Jagerman, H. Oosterhuis, M. de Rijke, To model
5. Conclusion                                                     or to intervene: A comparison of counterfactual
                                                                  and online learning to rank from user interactions,
In this paper, we propose a novel ProRec framework for
                                                                  in: SIGIR, 2019.
the recommendation with implicit feedback. ProRec can
                                                             [13] H. Lu, M. Zhang, S. Ma, Between clicks and sat-
effectively denoise the user-item interactions by flexibly
                                                                  isfaction: Study on multi-phase user preferences
serving as a plug-in, so as to further improve the recom-
                                                                  and satisfaction for online news reading, in: SIGIR,
mendation accuracy for both non-deep and deep learning
                                                                  2018, pp. 435–444.
RS methods. We demonstrate the superior performance
                                                             [14] T. Schnabel, A. Swaminathan, A. Singh, N. Chan-
of ProRec in recommendation through extensive experi-
                                                                  dak, T. Joachims, Recommendations as treatments:
ments and showcase its effectiveness in denoising user-
                                                                  Debiasing learning and evaluation, JMLR (2016).
item interactions through insightful case studies. Since
                                                             [15] L. Jiang, Z. Zhou, T. Leung, L.-J. Li, L. Fei-Fei, Men-
we achieve denoising in a fully unsupervised fashion
                                                                  tornet: Learning data-driven curriculum for very
without accessing any additional user or item data, we
                                                                  deep neural networks on corrupted labels, in: ICML,
expect ProRec to incur no more negative societal impact
                                                                  2018.
than any basic recommender system.
                                                             [16] N. Kolkin, J. Salavon, G. Shakhnarovich, Style trans-
                                                                  fer by relaxed optimal transport and self-similarity,
References                                                        in: CVPR, 2019.
                                                             [17] M. Kusner, Y. Sun, N. Kolkin, K. Weinberger, From
  [1] Y. Kim, A. Hassan, R. W. White, I. Zitouni, Modeling        word embeddings to document distances, in: ICML,
      dwell time to predict click-level satisfaction, in:         2015, pp. 957–966.
      WSDM, 2014.                                            [18] L. Chen, S. Dai, C. Tao, H. Zhang, Z. Gan, D. Shen,
  [2] H. Lu, M. Zhang, W. Ma, C. Wang, F. xia, Y. Liu,            Y. Zhang, G. Wang, R. Zhang, L. Carin, Adversarial
      L. Lin, S. Ma, Effects of user negative experience in       text generation via feature-mover’s distance, in:
      mobile news streaming, in: SIGIR, 2019.                     NIPS, 2018.
  [3] B. Yang, S. Lee, S. Park, S.-g. Lee, Exploiting vari- [19] L. Chen, Y. Zhang, R. Zhang, C. Tao, Z. Gan,
      ous implicit feedback for collaborative filtering, in:      H. Zhang, B. Li, D. Shen, C. Chen, L. Carin, Im-
      WWW, 2012.                                                  proving sequence-to-sequence learning via optimal
  [4] N. Courty, R. Flamary, D. Tuia, A. Rakotomamonjy,           transport, in: ICLR, 2019.
      Optimal transport for domain adaptation, TPAMI [20] Y. Meng, X. Yan, W. Liu, H. Wu, J. Cheng, Wasser-
      39 (2017) 1853–1865.                                        stein collaborative filtering for item cold-start rec-
  [5] W. Wang, F. Feng, X. He, L. Nie, T.-S. Chua, Denois-        ommendation, in: Proceedings of the 28th ACM
      ing implicit feedback for recommendation, WSDM              Conference on User Modeling, Adaptation and Per-
      (2021).                                                     sonalization, 2020, pp. 318–322.
[21] Q. Xu, J. Xiong, X. Cao, Q. Huang, Y. Yao, From               rative filtering, in: SIGIR, 2019.
     social to individuals: a parsimonious path of multi-     [29] P. Cremonesi, Y. Koren, R. Turrin, Performance of
     level models for crowdsourced preference aggrega-             recommender algorithms on top-n recommenda-
     tion, TPAMI (2018).                                           tion tasks, in: RecSys, 2010.
[22] A. Gretton, D. Sejdinovic, H. Strathmann, S. Bal-        [30] S. Sedhain, H. Bui, J. Kawale, N. Vlassis, B. Kveton,
     akrishnan, M. Pontil, K. Fukumizu, B. K. Sriperum-            A. K. Menon, T. Bui, S. Sanner, Practical linear mod-
     budur, Optimal kernel choice for large-scale two-             els for large-scale one-class collaborative filtering,
     sample tests, in: NIPS, 2012.                                 in: IJCAI, 2016.
[23] S. Boyd, N. Parikh, E. Chu, Distributed optimization     [31] D. Liang, R. G. Krishnan, M. D. Hoffman, T. Jebara,
     and statistical learning via the alternating direction        Variational autoencoders for collaborative filtering,
     method of multipliers, 2011.                                  in: WWW, 2018.
[24] N. Parikh, S. Boyd, Proximal algorithms, Founda-         [32] X. He, K. Deng, X. Wang, Y. Li, Y. Zhang, M. Wang,
     tions and Trends in optimization (2014).                      Lightgcn: Simplifying and powering graph convolu-
[25] N. Angelopoulos, J. Cussens, Exploiting informa-              tion network for recommendation, in: SIGIR, 2020,
     tive priors for bayesian classification and regression        pp. 639–648.
     trees., in: IJCAI, 2005.                                 [33] Y. Yang, L. Wu, R. Hong, K. Zhang, M. Wang, En-
[26] J. Ding, Y. Quan, Q. Yao, Y. Li, D. Jin, Simplify and         hanced graph learning for collaborative filtering
     robustify negative sampling for implicit collabora-           via mutual information maximization, in: SIGIR,
     tive filtering (2020).                                        2021, pp. 71–80.
[27] Z.-H. Deng, L. Huang, C.-D. Wang, J.-H. Lai, S. Y.       [34] Y. Gao, Y. Du, Y. Hu, L. Chen, X. Zhu, Z. Fang,
     Philip, Deepcf: A unified framework of represen-              B. Zheng, Self-guided learning to denoise for robust
     tation learning and matching function learning in             recommendation, in: SIGIR, 2022.
     recommender system, in: AAAI, volume 33, 2019,           [35] S. Pal, S. Malekmohammadi, F. Regol, Y. Zhang,
     pp. 61–68.                                                    Y. Xu, M. Coates, Non-parametric graph learning
[28] G. Wu, M. Volkovs, C. L. Soon, S. Sanner, H. Rai,             for bayesian graph neural networks, PMLR (2020).
     Noise contrastive estimation for one-class collabo-