1. Introduction

Partial Relaxed Optimal Transport for Denoised Recommendation

Yanchao Tan

0 2

Carl Yang

Xiangyu Wei

Ziyue Wu

Weiming Liu

Xiaolin Zheng

1 0 Fujian Key Laboratory of Network Computing and Intelligent Information Processing, Fuzhou University , Fuzhou 350116 , China 1 The College of Computer Science, Zhejiang University , Hangzhou 310027 , China 2 The College of Computer and Data Science, Fuzhou University , Fuzhou 350116 , China 3 The Department of Computer Science, Emory University , Atlanta 30322, United State 4 The School of Management, Zhejiang University , Hangzhou 310058 , China

The interaction data used by recommender systems (RSs) inevitably include noises resulting from mistaken or exploratory clicks, especially under implicit feedbacks. Without proper denoising, RS models cannot efectively capture users' intrinsic preferences and the true interactions between users and items. To address such noises, existing methods mostly rely on auxiliary data which are not always available. In this work, we ground on Optimal Transport (OT) to globally match a user embedding space and an item embedding space, allowing both non-deep and deep RS models to discriminate intrinsic and noisy interactions without supervision. Specifically, we firstly leverage the OT framework via Sinkhorn distance to compute the continuous many-to-many user-item matching scores. Then, we relax the regularization in Sinkhorn distance to achieve a closed-form solution with a reduced time complexity. Finally, to consider individual user behaviors for denoising, we develop a partial OT framework to adaptively relabel user-item interactions through a personalized thresholding mechanism. Extensive experiments show that our framework can significantly boost the performances of existing RS models.

1. Introduction

interactions for more accurate recommendations?

In this work, we approach the denoised recommenWith the rapid growth of various activities on the Web, dation problem in two steps: 1. distinguishing intrinsic recommender systems (RSs) become fundamental in help- and noisy interactions; 2. stressing intrinsic preferences ing users alleviate the problem of information overload. and reducing noisy ones for recommendation. Inspired However, users can click some items by mistake or out by a least modeling efort principle in an unsupervised of curiosity, and many RSs will also recommend some fashion [4, 5], we refer to noises as minor abnormal data less relevant items for exploration every now and then. that needs higher modeling efort than the intrinsic ones. Take movie watching for example. Since a user cannot Based on the above rationale, we propose to distinalways distinguish horrible movies from romantic ones guish noises by ranking the global matching cost between by movie names, he/she can easily click a horrible movie the user and item embedding spaces, and flexibly inteby mistake among the many romantic movies he/she usu- grate it with both non-deep and deep RS methods. In ally watches. In this case, a traditional method like CF this way, the key to distinguishing noises boils down cannot efectively reduce the probability of recommend- to finding a global matching matrix with the minimum ing a horrible movie unless the noisy clicks are extremely matching cost. Subsequently, we advocate a principled rare, which may further seduce even more noisy clicks. denoised RS framework and calculate the matching ma

Recently, to get out of such mistakes, existing research trix grounding on Optimal Transport (OT), which has discovers users’ intrinsic preferences with the help of been introduced to address the unsupervised matching external user behaviors [1], auxiliary item features [2] or problem between two distributions or spaces in many extra feedbacks [3]. A key limitation is their reliance on fields [ 6, 7]. Through minimizing the overall cost of the auxiliary data, which costs significant efort and is not mismatched user-item pairs, OT finds a global optimal always attainable in RSs. Without supervision, can we matching matrix, whose values represent the matching develop a framework to automatically denoise user-item cost (i.e., modeling efort). Specifically, the low user-item DL4SR’22: Workshop on Deep Learning for Search and Recommen- matching cost indicates a user’s intrinsic preferences while dation, co-located with the 31st ACM International Conference on the high value indicates noisy ones. Information and Knowledge Management (CIKM), October 17-21, 2022, However, to apply OT to achieve denoised recommenAtlanta, USA dation, three problems still remain: 1. Unlike the one-hot * Corresponding author. constraint in previous applications of OT, the nature of w$eiyxcyt@anz@juf.zeud.ue.dcun.c(Xn.(YW. eTia);nw);uj.zciayrulye@anzgj@u.eedmuo.crny.e(Zdu. W(Cu.)Y;ang); RSs requires a many-to-many matching; 2. The large 21831010@zju.edu.cn (W. Liu); xlzheng@zju.edu.cn (X. Zheng) number of users and items makes the matching process © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License time-consuming; 3. A mechanism is needed to properly CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) leverage the learned matching matrix and finally elimi- Optimal transport (OT) theory with guaranteed optimalnate the efect of noises for denoised recommendation. ity in the global matching and can be flexibly integrated

To address these challenges, we first leverage the OT with both deep and non-deep RS methods. framework via Sinkhorn distance [8]. Unlike a one-hot OT aims to find a matching between two distributions matching matrix that satisfies the requirements of the or spaces, which has been most widely used in transfer many widely studied tasks (e.g., transfer learning [6, 9]), learning. It can match one instance in the source domain we compute a continuous many-to-many approxima- to another in the target domain with the least cost. Extion to the discrete OT, so as to meet the nature of RSs. isting works mainly study discrete OT for one-to-one Furthermore, we propose to relax the regularization in matching. For example, in computer vision, OT has been Sinkhorn to achieve a closed-form solution, which can applied to domain adaption [4, 6] and style transfer [16]. reduce the time complexity for large-scale data from OT has also been successfully applied in many natural (max(, )3) to (max(, )2) . Then, we rank language processing tasks, such as topic modeling [7, 17], the learned continuous matching cost to diferentiate in- text generation [18], and sequence-to-sequence learning trinsic and noisy interactions for each user. Finally, we [19]. Considering the efect of non-matched pairs, [ 9] take into account that individual users’ behaviors can proposed a partial one-to-one matching that leverages vary in RSs, which should be handled diferently during one shared threshold to address domain shift for domain denoising. To this end, we design a personalized thresh- adaption instead of taking all pairs into account. olding mechanism for relabeling user-item interactions, However, the above discrete one-to-one matching canwhich eficiently finds a splitting point for each individual not satisfy the nature of many-to-many user-item reuser according to the ranked matching cost. lationships in RSs, where each user can interact with

Experiment results on one real-world dataset with syn- many items and each item can interact with many users. thesized noises demonstrate our proposed ProRec’s abil- Though OT is applied into cold-start recommendation ity of denoising under various levels of noises. Further problem in [20], they directly measure user-item similarexperiments on three original real-world datasets also ity without taking the efect of noises into consideration. verify the advantages of ProRec, which demonstrate its Moreover, partial matching with a shared threshold cansignificant performance boosts upon multiple popular RS not reflect the individual user behaviors, and thus leads models, in terms of efectiveness and eficiency. to suboptimal recommendation performances.

2. Related work 3. Methodology

Many research studies have pointed out that there exist In this section, we detail the proposed Partial relaxed a large proportion of noisy interactions in the implicit optimal Transport for the denoised Recommendation feedbacks (i.e., the clicks) [10, 1, 11]. These feedbacks are (ProRec) framework, as shown in Figure 1. Specifically, easily afected by diferent factors, such as the position we first formulate the denoised recommendation as a twobias [12], caption bias [13] and exposure bias [10, 14]. step process. Then, we introduce the Optimal Transport Therefore, there exist large gaps between the implicit (OT) framework for distinguishing noises through global feedbacks and the actual user preferences due to various user-item matching. Furthermore, we elaborate on our reasons, which are detrimental to the users’ continuous proposed ProRec framework, which can be applied to behaviors and overall satisfaction [2, 5]. achieve denoised recommendations.

To alleviate the above issues, many researchers have considered incorporating auxiliary feedbacks to the iden- 3.1. Problem Statement tification of noises in the implicit feedbacks, such as dwell time [1] and scroll intervals [13]. These works usually Following the setting of recommendation, we denote design features manually and label items with external the user-item interaction matrix as ∈ {0, 1}× , help (e.g., from domain experts) [13, 2], which require where denotes the number of users and denotes the extensive efort and cost, which are not always available number of items. We construct user-item relationships across RSs. To this end, sheer observations on training across a user embedding space and an item embedding losses have recently been explored to denoise the implicit space . The problem is that user-item pairs contain feedbacks during training in an unsupervised fashion [5]. some noisy interactions which need to be relabeled. The method is purely heuristic and can be integrated In this work, we approach denoised recommendation with deep learning methods only. Based on a principle through two steps. Firstly, we distinguish intrinsic and of the least modeling efort in diferent fields [ 4, 15], our noisy interactions across and by a global matchproposed framework aims to denoise the user-item in- ing matrix * . Then, to stress users’ intrinsic preferences teractions without supervision, which is supported by and reduce noises, we rank the learned * and keep only A B C Interaction Matrix X Intrinsic preference Relabeled data the top interactions of each user via a personalized thresholding mechanism. In this way, we learn a relabeled matrix ∈ × for denoised recommendation.

3.2. Global Matching for Distinguishing Noises

equal to 1. is the number of users and is the number of items. * denote the optimal transport plan.

Specifically, the matching cost matrix ℳ directly relfects the modeling efort, and thus the low user-item matching cost can indicate users’ intrinsic preferences while the high values can indicate noises. Based on the definition of ℳ, we can integrate the denoising process with both non-deep and deep RS models as follows:

As motivated in Section 1, a framework that can denoise

user-item interactions for more accurate recommenda- ℒ( , ) = || − ||2 + (|| ||22 + || ||22), (2) tion is in great need. However, without supervision, it is ℒ( , ) = (, ( , )) + (|| ||22 + || ||22), hard to define noisy interactions and then stress intrinsic (3) preferences together while reducing noisy ones. where ∈ R× and ∈ R× represent the user

To distinguish noises in an unsupervised fashion, sev- and item embeddings. denotes a classification neteral works [4, 15, 21, 5] have pointed out a principled work. The hyperparameter is to balance the loss and the way to locate noises by finding the data that needs the regularization term. In non-deep learning models, Eq. 2 most modeling efort or cost. Inspired by this principle can be solved efectively through many methods ( e.g., Alof least modeling efort, in this work, we propose to refer ternative Least Square). Since the prediction scores repto the noises as minor abnormal data. Compared with resent the user-item similarities where ℳ has negative the interactions from intrinsic preferences, noises take values, we calculate the cost matrix as ℳ = − much efort or cost to model. by using inner product. In deep learning models, we

Based on the above least efort rationale, we propose can optimize Eq. 3 by adopting the widely used binary a novel denoising framework to distinguish noises by cross-entropy loss (i.e., (· , · )). Then, similar calcularanking the global matching between user and item em- tions can be made for those deep learning ones, where bedding spaces. The framework is served as a plug-in, so as to be flexibly integrated to both non-deep and deep RS methods. In this scenario, the key to distinguishing noises boils down to finding a global user-item matching across the user space and the item space with minimal matching cost.

To address the unsupervised matching problem between two spaces, we advocate a principled denoised RS framework and ground on Optimal Transport (OT), which has been successfully applied in many fields such as CV [6] and NLP [7]. As studied theoretically in [22], the OT formulation derived from the Earth Movers Distance (EMD) can be expressed as ∑︁ ∑︁ ℳ = − ( , ).

However, there are three technical challenges when applying OT to denoised recommendation: Challenge 1. In RSs, each user can interact with many items and each item interact with many users, which corresponds to a many-to-many interaction matrix. Unlike previous applications of OT in CV or NLP, directly applying the one-hot constraint specified in Eq. 1 is counterintuitive for recommendation.

Challenge 2. In RSs, the number of users and items in the real world tends to be large, making the matching process time-consuming.

Challenge 3. Besides distinguishing noises, it is necessary to design a mechanism to eliminate the efect min ℳ , of noisy interactions. A naïve way is to set one global ∈ (,) =1 =1 (1) thresholding hyperparameter to keep the top items .. 1 = 1 1 , 1 = 1 1 , wgliothbatlhtehlreeasshtomldaitncghihnygpceorsptafroarmeaectehrucsaenr.nHotoawcecvoemr, mthoewhere (, ) denotes the joint probability distribu- date individual user behaviors. tion of user embedding space and item embedding space . 1 ∈ R1× , 1 ∈ R1× with all elements 3.3. The ProRec Framework where ( ) = ∑︀ =1 ∑︀ =1

(log( ) − 1). 1 ∈ a hyperparameter to balance the entropy regularization R1× , 1 ∈ R1× are with all elements equal to 1. is In this case, we can obtain the solution of as = ℳ + log − = 0, ℳ + ( ) − (1

− ) , (8) (4) where is Lagrangian multipliers. gradient ℒ

= 0, which means:

Then, in order to find the optimal solution, we let the

where ⊘ corresponds to the element-wise division. Proof. Based on Eq. 6 and using the Lagrange multiplier method we have:

ℒ = ∑︁ ∑︁ =1 =1 ℒ

In this section, we propose a Partial relaxed optimal

Transport for denoised Recommendation (ProRec) frame- closed-form solution for the global optimal matching mawork to address the above three challenges. threshold , , we can relabel by ∈ R as follows: , where is a monotonically increasing function according to the value of − , . is a hyperparameter to

Finally, to make the relabeling process more flexible,

we propose to use a hyperparameter to control the degree of relabeling and obtain a new interaction matrix where ⊘

corresponds to element-wise division.

Following the same derivation, we can obtain the mean value of top items while 2 = 1− represents mean of the rest − items. After obtaining the index of splitting points , we have the correspond]︃ (20) as noises during training. For real data experiments, we use Amazon music (AMusic)1, Amazon toys (AToy)1, and Yahoo2. These datasets have been widely adopted in previous literature [27, 5, 28], and their statistics are summarized in Table 1. We do a 5:2:3 splitting for them following [28]. We evaluate the ability to distinguish noise , by Hit Ratio, and recommendation performances by normalized discounted cumulative, mean average precision, and recall.

1https://github.com/familyld/DeepCF/tree/master/Data 2https://webscope.sandbox.yahoo.com/catalog.php?datatype=r Note that, the closed-form solution yields a time com

plexity of (max(, )2). This shows that by relaxing the OT formulation with auxiliary limitation from only one side, our model can achieve a lower time complexity than the (max(, )3) reported in [6], which uses OT for one-to-one matching.

Personalized thresholding mechanism for denoised recommendation.

To flexibly discriminate the intrinsic and noisy user-item interactions for individual users, we propose a non-parametric personalized thresholding mechanism. We first normalize the matching cost by each user as *′ * = ∑︀ * , and rank them as = sort( *′ ). According to the definition of the matching matrix, each row represents users’ preferences towards the items. We define = { 1, . . . , } as the index of the threshold which can filter out users’ noisy preferences. denotes user ’s threshold. As shown in Figure 1, some users with consistent (narrow) interests (i.e., user ) would benefit from a reasonably low threshold since a few items can already represent their preferences and more items can introduce more noises.

Meanwhile, some users with diverse (wide) interests (i.e., user ) would require the threshold to be relatively high

[︃

=1 the model requires an automatic splitting mechanism.

Inspired by the threshold selection mechanism used in

CART [25], we eficiently learn the personalized thresholds for diferent users by minimizing the following sum of square errors: = arg min ∑︁ (︀ − 1 )︀ 2 + ∑︁ (︀ − 2 )︀ 2

= +1 where is the -th value of the sorted and the index ∈ {1, 2, · · · , − 1}. 1 = 1 ∑︀ =1 represents the = + (1 − ) ⊙ , where ⊙

corresponds to the element-wise product. Considering that the noises in sparse positive interactions would induce worse efects in recommendation [ 5], in this work, we only target the part of the original dataset where = 1. In this way, the newly generated can keep users’ intrinsic preferences and reduce the noise ones. We summarize the learning procedure of our proposed ProRec in Algorithm 1, which is proved to guarantee the local convergence for denoised recommendation. (22)

4. Experiments

In this section, we first conduct controlled experiments with synthetic noises on one dataset, so as to investigate ProRec’s performance to diferent levels of noisy interactions. Then, we evaluate ProRec’s performance on three original real-world datasets of recommendation with implicit feedback.

4.1. Experimental Settings

4.1.1. Dataset and evaluation protocols.

We use ML-100k for the noise-controlled experiments and do a 4:1 random splitting for train/test data followto model a broader range of preferences and satisfy the ing [26]. To simulate the noise, we randomly select need for exploration. To find a personalized splitting

5%/10%/15%/20% of ground-truth records and flip labels

point for each user according to the learned matrix, in the train set. The selected records can be regarded

Algorithm 1: Partial Relaxed Optimal Transport

for Denoised Recommendation (ProRec) Data: The noisy user-item interactions , user popularity and item popularity Result: Denoised user embeddings and item embeddings , a relabeled matrix . 1 while not converged do 2 1. Update and ; 3 if Learning and via non-deep methods then Update and via || − ||2 + (|| ||22 + || ||22 in

Eq. 2 and calculate ℳ = − ; 4 else Update and via (, ( , )) + (|| ||22 + || ||22) in Eq. 3, where (· , · ) is binary cross-entropy loss; Then calculate ℳ = − ( , ), where is a classification network; 2. Update ; Model denoised recommendation problem by minimizing ∑︀ =1 ∑︀=1 ℳ + ( ), where 1 = , 1 = , in Eq. 4; Compute a global optimal matching matrix * by the closed-form solution * = max( ℳ˜ diag( ⊘ 1 ℳ˜), diag( ⊘ 1 ℳ˜ ) ℳ˜) in Eq. 7; Compute personalized thresholds by automatic splitting mechanism in Eq. 20; Relabel user-item interaction by sorting in Eq. 21; Update by = + (1 − ) ⊙ in Eq. 22; 4.1.2. Methods for comparison trix Factorization, which is a deep MF framework; VAECF [31] – Variational Autoencoder for Collaborative Filtering, which is one of the state-of-the-art deep learning based recommendation algorithm [28]; LightGCN [32] – LightGCN devises a light graph convolution for training eficiency and generation ability on recommendation scenario; EGLN [33] – EGLN enhances graph learning and node embedding modules iteratively based on mutual information maximization.

Besides the above methods, T-CE [5] and SGDL [34] are the existing frameworks for denoised recommendations. The first one discards the large-loss samples with a dynamic threshold in each iteration and the second one collects memorized interactions at the early stage of the training and leverages those data as denoising signals to guide the following training. Both of them are designed as plug-ins for deep learning methods. We add CDAE + T-CE and LightGCN + SGDL as baselines to compare the ability of denoising with the proposed ProRec, which are shown to win the best performance in their original works [5, 34].

For all the algorithms (except for CDAE + T-CE and LightGCN + SGDL), we add the proposed ProRec on top of them, allowing both non-deep and deep RS models to discriminate intrinsic and noisy interactions in an unsupervised fashion. Specifically, we have SVD + ProRec, NCE + ProRec, CDAE + ProRec, WRMF + ProRec, VAECF + ProRec, LightGCN + ProRec, and EGLN + ProRec, respectively. 4.1.3. Implementation details

Implementations of the compared baselines are from

open-source projects (SVD3, NCE3, CDAE3, WRMF3, VAE-CF3, LightGCN4, EGLN5, T-CE6 and SGDL7). We optimize ProRec with standard Adam. We tune all hyperparameters through grid search. In particular, learning rate in {0.0005, 0.001, 0.005, 0.01}, in {0.25, 0.5, 0.75, 1.0}, in {0.05, 0.075, 0.1, 0.125, 0.15, 0.175}, in {1, 5, 10, 20, 50}, in {0.0005, 0.001, 0.005, 0.01}, and the batch size in {100, 250, 500, 1000} for diferent datasets. We also carefully tuned the hyperparameters of all baselines through cross-validation to achieve their best performances.

We compare two types of algorithms for recommenda

tion. The first type is non-deep algorithms, which include: SVD [29] – A similarity-based method that constructs a similarity matrix through SVD decomposition of the implicit matrix ; NCE [28] – Noise contrastive estimation item embeddings combined with a personalized user model based on PLRec [30]. The second type is deep learning methods, including CDAE [10] – Collabo- 3https://github.com/wuga214/NCE_Projected_LRec rative Denoising Autoencoder is a deep learning method 54hhttttppss::////ggiitthhuubb..ccoomm//gyuimsyuet1ia2n3y4a/Lnigg/hStIGGCIRN2-0P2y1T-EorGcLhN specifically optimized for implicit feedback recommen- 6https://github.com/WenjieWWJ/DenoisingRec dation tasks; WRMF [10] – Weighted Regularized Ma- 7https://github.com/zealscott/SGDL

4.2. Controlled Experiments (Synthetic Noises) Before looking at the performance of recommendation, we first inspect ProRec’s ability to distinguish noises. We compare the recommendation results of the proposed

ProRec to those of the baseline models. Table 2 shows the NDCG, MAP, and Recall scores on three datasets. We Since we know the index of the randomly added noises in have the following observations. the dataset, we evaluate the level of distinguished noises In three diferent domains, by integrating ProRec with by Hit Ratio (HR) of the real added noises in Figure 2(a). both non-deep and deep methods, we can consistently imSpecifically, we observe a consistent high HR of NCE + prove strong baseline models and achieve start-of-the-art ProRec around 90% when the level of noises increases. performance. Since the graph-based model can leverage Furthermore, we test Recall of NCE and NCE + ProRec high-order relations among users and items, EGLN outunder diferent levels of noises. It can be clearly observed performs all baselines on three datasets. Furthermore, our that NCE + ProRec always performs better than NCE, as proposed EGLN + ProRec outperforms all deep-learning the former can accurately relabel some noises to elimi- baselines on three datasets, ranging from 3.34% (achieved nate the efect of noises. For example, when the noise on AToy on Recall@5) to 6.38% (achieved on Yahoo on level is at 20%, NCE + ProRec significantly improves NCE Recall@5), while NCE + ProRec outperforms all non deepfrom 0.1579 to 0.1596. learning baselines on three datasets, ranging from 1.91% (achieved on AToy on MAP@5) to 7.97% (achieved on 4.3. Real Data Experiments AMusic on Recall@5). All of these show that ProRec is capable of efective recommendation on diferent datasets. 4.3.1. Overall Performance Comparison One step further, we observe that the methods based on the proposed ProRec framework all outperform the original models, which indicates the advantage of the denoising process. Compared with existing denoising

Table 3 Sinkhorn, it is inefective to directly adopt one-to-one Impact of diferent model components of ProRec on AMusic. matching (i.e., NCE + EMD) due to the removal of many matched interactions. For example, NCE + Sinkhorn Metric N@5 R@5 Time N@5 R@5 Time significantly improves NCE from 0.0276 to 0.0281 (on BMaestehod 0.0375 N0C.0E276 64s 0.0422 EG0.L0N386 82s Recall@5 on AMusic) while both metrics of NCE + EMD +EMD 0.0360 0.0259 98s 0.0406 0.0371 124s are lower than those of the original NCE. +Sinkhorn 0.0384 0.0281 132s 0.0431 0.0398 174s The eficiency of relaxed regularization. By balancing +ROT 0.0383 0.0281 76s 0.0430 0.0392 109s the performance and the runtime via relaxing the opti+ProRec 0.0396 0.0298 82s 0.0445 0.0406 118s mized constraints, we can observe that NCE + ROT can reduce around half of the runtime while keeping almost frameworks (i.e., T-CE and SGDL), ProRec can not only the same NDCG and Recall on AMusic and Yahoo. be flexibly integrated with both deep and non-deep RS The efect of personalized thresholding mechanism. methods but also gain more improvements. We first use one global threshold for ROT to study the

Moreover, on top of existing RS methods, our proposed efectiveness of the personalized thresholding mechaProRec in Eq. 4 does not introduce significant compu- nism. In the experiment, we search the hyperparameter tations. In the third and sixth columns in Table 3, we of global thresholds in {5, 10, 15, 20, 25} and find that compare the model’s training time under the same exper- NCE + ROT achieves the best performance when = 10. imental setting (e.g., embedding dimension). As can be Compared with NCE + ROT, NCE + ProRec can individuobserved, the training time of the proposed NCE + ProRec ally discriminate intrinsic preferences and noises without and LightGCN + ProRec are within the same scale as NCE tuning the hyperparameter of the threshold, and then and LightGCN on AMusic dataset. improve the performance of the partial OT framework.

Note that the improvements brought by the denois- For example, the performance of NCE + ProRec achieves ing process on the denser datasets (e.g., the performance 0.0298, which outperforms 0.0281 of NCE + ROT on Regains between 0% noise level and 5% noise level on ML- call@5 on AMusic dataset. 100k shown in Figure 2(a)) are less than those on the The efect of hyperparameters. Our proposed ProRec sparse datasets (e.g., the performance gains between framework introduces four hyperparameters: , , and EGLN + ProRecn and EGLN on Yahoo in Table 2). Al- , where we empirically set to 20 and to 0.001. though we can explicitly eliminate some efects of noisy controls the degree of the relabeling and controls the interactions (as shown in Figure 2(a)), they only have a sparsity of the matching matrix. Take NCE + ProRec for minor impact on the learned embeddings when interac- example (shown in Figure 2(b)-2(c)), the optimal and tions are enough in the observed data, which is consistent on AMusic are found to be 0.5 and 0.1, respectively. In with the results in recent studies [35]. general, we can set to 0.5 for denser datasets and 0.25 for the sparser datasets and always set to 0.1. 4.3.2. Model Ablation and Hyperparameter Study

In this section, three groups of ablation studies are con

ducted to evaluate the impacts of our proposed many-to- To demonstrate the advantages of the denoising process, many matching, relaxed regularization, and personalized we visualize the interaction matrix given by ML-100k on thresholding mechanisms, as well as the efects of hyper- the left and the learned matrix of the proposed NCE + parameters. ProRec on the right (as shown in Figure 3). The numThe efect of continuous many-to-many matching. bers in the boxes are actual movie IDs. The colors in Compared with the many-to-many matching of NCE + the original interaction matrix (left) are the same, indi4.3.3. Denoising Case Study cating uniform weights, whereas those in the relabeled [6] B. Bhushan Damodaran, B. Kellenberger, R. Flamatrix (right) are diferent. The diferent depths of blue mary, D. Tuia, N. Courty, Deepjdot: Deep joint represent the matching cost (the deeper, the lower) and distribution optimal transport for unsupervised dowhite boxes denote the intrinsically preferred items. As main adaptation, in: ECCV, 2018. we can see from the diferent colors, our many-to-many [7] V. Huynh, H. Zhao, D. Phung, Otlda: A geometrymatching matrix efectively discriminates intrinsic and aware optimal transport approach for topic modelnoisy interactions. Moreover, we can also observe the ing, NIPS (2020). personalized thresholding mechanism to work as difer- [8] M. Cuturi, Sinkhorn distances: Lightspeed compuent numbers of items is being replaced for diferent users. tation of optimal transport (2013). Furthermore, we show several movies relevant to user 0. [9] R. Xu, P. Liu, Y. Zhang, F. Cai, J. Wang, S. Liang, Since most interacted movies of user 0 are from the ro- H. Ying, J. Yin, Joint partial optimal transport for mance category, such as Titanic and Shakespeare in Love, open set domain adaptation, in: IJCAI, 2020. the horror movie is abnormal and is unlikely to reflect [10] Y. Hu, Y. Koren, C. Volinsky, Collaborative filtering her/his intrinsic preference. NCE + ProRec automatically for implicit feedback datasets, in: ICDM, 2008. learns to highlight the two romantic movies. After down- [11] Y. Liu, Q. Liu, Y. Tian, C. Wang, Y. Niu, Y. Song, weighing Friday the 13th Part 3: 3D, the scoring of the C. Li, Concept-aware denoising graph neural netuser’s preference changes, and thus the ranking of an- work for micro-video recommendation, in: Proceedother romantic movie (Runaway Bride) rises to show up ings of the 30th ACM International Conference on in the top = 6 candidate list for user 0. Information & Knowledge Management, 2021, pp. 1099–1108. [12] R. Jagerman, H. Oosterhuis, M. de Rijke, To model 5. Conclusion or to intervene: A comparison of counterfactual and online learning to rank from user interactions, In this paper, we propose a novel ProRec framework for in: SIGIR, 2019. the recommendation with implicit feedback. ProRec can [13] H. Lu, M. Zhang, S. Ma, Between clicks and satefectively denoise the user-item interactions by flexibly isfaction: Study on multi-phase user preferences serving as a plug-in, so as to further improve the recom- and satisfaction for online news reading, in: SIGIR, mendation accuracy for both non-deep and deep learning 2018, pp. 435–444.

RS methods. We demonstrate the superior performance [14] T. Schnabel, A. Swaminathan, A. Singh, N. Chanof ProRec in recommendation through extensive experi- dak, T. Joachims, Recommendations as treatments: ments and showcase its efectiveness in denoising user- Debiasing learning and evaluation, JMLR (2016). item interactions through insightful case studies. Since [15] L. Jiang, Z. Zhou, T. Leung, L.-J. Li, L. Fei-Fei, Menwe achieve denoising in a fully unsupervised fashion tornet: Learning data-driven curriculum for very without accessing any additional user or item data, we deep neural networks on corrupted labels, in: ICML, expect ProRec to incur no more negative societal impact 2018. than any basic recommender system. [16] N. Kolkin, J. Salavon, G. Shakhnarovich, Style transfer by relaxed optimal transport and self-similarity, References in: CVPR, 2019.

[17] M. Kusner, Y. Sun, N. Kolkin, K. Weinberger, From [1] Y. Kim, A. Hassan, R. W. White, I. Zitouni, Modeling word embeddings to document distances, in: ICML, dwell time to predict click-level satisfaction, in: 2015, pp. 957–966.

WSDM, 2014. [18] L. Chen, S. Dai, C. Tao, H. Zhang, Z. Gan, D. Shen, [2] H. Lu, M. Zhang, W. Ma, C. Wang, F. xia, Y. Liu, Y. Zhang, G. Wang, R. Zhang, L. Carin, Adversarial L. Lin, S. Ma, Efects of user negative experience in text generation via feature-mover’s distance, in: mobile news streaming, in: SIGIR, 2019. NIPS, 2018. [3] B. Yang, S. Lee, S. Park, S.-g. Lee, Exploiting vari- [19] L. Chen, Y. Zhang, R. Zhang, C. Tao, Z. Gan, ous implicit feedback for collaborative filtering, in: H. Zhang, B. Li, D. Shen, C. Chen, L. Carin, ImWWW, 2012. proving sequence-to-sequence learning via optimal [4] N. Courty, R. Flamary, D. Tuia, A. Rakotomamonjy, transport, in: ICLR, 2019.

Optimal transport for domain adaptation, TPAMI [20] Y. Meng, X. Yan, W. Liu, H. Wu, J. Cheng, Wasser39 (2017) 1853–1865. stein collaborative filtering for item cold-start rec[5] W. Wang, F. Feng, X. He, L. Nie, T.-S. Chua, Denois- ommendation, in: Proceedings of the 28th ACM ing implicit feedback for recommendation, WSDM Conference on User Modeling, Adaptation and Per(2021). sonalization, 2020, pp. 318–322. [21] Q. Xu, J. Xiong, X. Cao, Q. Huang, Y. Yao, From rative filtering, in: SIGIR, 2019. social to individuals: a parsimonious path of multi- [29] P. Cremonesi, Y. Koren, R. Turrin, Performance of level models for crowdsourced preference aggrega- recommender algorithms on top-n recommendation, TPAMI (2018). tion tasks, in: RecSys, 2010. [22] A. Gretton, D. Sejdinovic, H. Strathmann, S. Bal- [30] S. Sedhain, H. Bui, J. Kawale, N. Vlassis, B. Kveton, akrishnan, M. Pontil, K. Fukumizu, B. K. Sriperum- A. K. Menon, T. Bui, S. Sanner, Practical linear modbudur, Optimal kernel choice for large-scale two- els for large-scale one-class collaborative filtering, sample tests, in: NIPS, 2012. in: IJCAI, 2016. [23] S. Boyd, N. Parikh, E. Chu, Distributed optimization [31] D. Liang, R. G. Krishnan, M. D. Hofman, T. Jebara, and statistical learning via the alternating direction Variational autoencoders for collaborative filtering, method of multipliers, 2011. in: WWW, 2018. [24] N. Parikh, S. Boyd, Proximal algorithms, Founda- [32] X. He, K. Deng, X. Wang, Y. Li, Y. Zhang, M. Wang, tions and Trends in optimization (2014). Lightgcn: Simplifying and powering graph convolu[25] N. Angelopoulos, J. Cussens, Exploiting informa- tion network for recommendation, in: SIGIR, 2020, tive priors for bayesian classification and regression pp. 639–648.

trees., in: IJCAI, 2005. [33] Y. Yang, L. Wu, R. Hong, K. Zhang, M. Wang, En[26] J. Ding, Y. Quan, Q. Yao, Y. Li, D. Jin, Simplify and hanced graph learning for collaborative filtering robustify negative sampling for implicit collabora- via mutual information maximization, in: SIGIR, tive filtering (2020). 2021, pp. 71–80. [27] Z.-H. Deng, L. Huang, C.-D. Wang, J.-H. Lai, S. Y. [34] Y. Gao, Y. Du, Y. Hu, L. Chen, X. Zhu, Z. Fang, Philip, Deepcf: A unified framework of represen- B. Zheng, Self-guided learning to denoise for robust tation learning and matching function learning in recommendation, in: SIGIR, 2022. recommender system, in: AAAI, volume 33, 2019, [35] S. Pal, S. Malekmohammadi, F. Regol, Y. Zhang, pp. 61–68. Y. Xu, M. Coates, Non-parametric graph learning [28] G. Wu, M. Volkovs, C. L. Soon, S. Sanner, H. Rai, for bayesian graph neural networks, PMLR (2020).

Noise contrastive estimation for one-class collabo