<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Partial Relaxed Optimal Transport for Denoised Recommendation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yanchao Tan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carl Yang</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiangyu Wei</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ziyue Wu</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Weiming Liu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaolin Zheng</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fujian Key Laboratory of Network Computing and Intelligent Information Processing, Fuzhou University</institution>
          ,
          <addr-line>Fuzhou 350116</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>The College of Computer Science, Zhejiang University</institution>
          ,
          <addr-line>Hangzhou 310027</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>The College of Computer and Data Science, Fuzhou University</institution>
          ,
          <addr-line>Fuzhou 350116</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>The Department of Computer Science, Emory University</institution>
          ,
          <addr-line>Atlanta 30322, United State</addr-line>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>The School of Management, Zhejiang University</institution>
          ,
          <addr-line>Hangzhou 310058</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The interaction data used by recommender systems (RSs) inevitably include noises resulting from mistaken or exploratory clicks, especially under implicit feedbacks. Without proper denoising, RS models cannot efectively capture users' intrinsic preferences and the true interactions between users and items. To address such noises, existing methods mostly rely on auxiliary data which are not always available. In this work, we ground on Optimal Transport (OT) to globally match a user embedding space and an item embedding space, allowing both non-deep and deep RS models to discriminate intrinsic and noisy interactions without supervision. Specifically, we firstly leverage the OT framework via Sinkhorn distance to compute the continuous many-to-many user-item matching scores. Then, we relax the regularization in Sinkhorn distance to achieve a closed-form solution with a reduced time complexity. Finally, to consider individual user behaviors for denoising, we develop a partial OT framework to adaptively relabel user-item interactions through a personalized thresholding mechanism. Extensive experiments show that our framework can significantly boost the performances of existing RS models.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>interactions for more accurate recommendations?</p>
      <p>In this work, we approach the denoised
recommenWith the rapid growth of various activities on the Web, dation problem in two steps: 1. distinguishing intrinsic
recommender systems (RSs) become fundamental in help- and noisy interactions; 2. stressing intrinsic preferences
ing users alleviate the problem of information overload. and reducing noisy ones for recommendation. Inspired
However, users can click some items by mistake or out by a least modeling efort principle in an unsupervised
of curiosity, and many RSs will also recommend some fashion [4, 5], we refer to noises as minor abnormal data
less relevant items for exploration every now and then. that needs higher modeling efort than the intrinsic ones.
Take movie watching for example. Since a user cannot Based on the above rationale, we propose to
distinalways distinguish horrible movies from romantic ones guish noises by ranking the global matching cost between
by movie names, he/she can easily click a horrible movie the user and item embedding spaces, and flexibly
inteby mistake among the many romantic movies he/she usu- grate it with both non-deep and deep RS methods. In
ally watches. In this case, a traditional method like CF this way, the key to distinguishing noises boils down
cannot efectively reduce the probability of recommend- to finding a global matching matrix with the minimum
ing a horrible movie unless the noisy clicks are extremely matching cost. Subsequently, we advocate a principled
rare, which may further seduce even more noisy clicks. denoised RS framework and calculate the matching
ma</p>
      <p>Recently, to get out of such mistakes, existing research trix grounding on Optimal Transport (OT), which has
discovers users’ intrinsic preferences with the help of been introduced to address the unsupervised matching
external user behaviors [1], auxiliary item features [2] or problem between two distributions or spaces in many
extra feedbacks [3]. A key limitation is their reliance on fields [ 6, 7]. Through minimizing the overall cost of the
auxiliary data, which costs significant efort and is not mismatched user-item pairs, OT finds a global optimal
always attainable in RSs. Without supervision, can we matching matrix, whose values represent the matching
develop a framework to automatically denoise user-item cost (i.e., modeling efort). Specifically, the low user-item
DL4SR’22: Workshop on Deep Learning for Search and Recommen- matching cost indicates a user’s intrinsic preferences while
dation, co-located with the 31st ACM International Conference on the high value indicates noisy ones.
Information and Knowledge Management (CIKM), October 17-21, 2022, However, to apply OT to achieve denoised
recommenAtlanta, USA dation, three problems still remain: 1. Unlike the one-hot
* Corresponding author. constraint in previous applications of OT, the nature of
w$eiyxcyt@anz@juf.zeud.ue.dcun.c(Xn.(YW. eTia);nw);uj.zciayrulye@anzgj@u.eedmuo.crny.e(Zdu. W(Cu.)Y;ang); RSs requires a many-to-many matching; 2. The large
21831010@zju.edu.cn (W. Liu); xlzheng@zju.edu.cn (X. Zheng) number of users and items makes the matching process
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License time-consuming; 3. A mechanism is needed to properly
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org)
leverage the learned matching matrix and finally elimi- Optimal transport (OT) theory with guaranteed
optimalnate the efect of noises for denoised recommendation. ity in the global matching and can be flexibly integrated</p>
      <p>To address these challenges, we first leverage the OT with both deep and non-deep RS methods.
framework via Sinkhorn distance [8]. Unlike a one-hot OT aims to find a matching between two distributions
matching matrix that satisfies the requirements of the or spaces, which has been most widely used in transfer
many widely studied tasks (e.g., transfer learning [6, 9]), learning. It can match one instance in the source domain
we compute a continuous many-to-many approxima- to another in the target domain with the least cost.
Extion to the discrete OT, so as to meet the nature of RSs. isting works mainly study discrete OT for one-to-one
Furthermore, we propose to relax the regularization in matching. For example, in computer vision, OT has been
Sinkhorn to achieve a closed-form solution, which can applied to domain adaption [4, 6] and style transfer [16].
reduce the time complexity for large-scale data from OT has also been successfully applied in many natural
(max(,  )3) to (max(,  )2) . Then, we rank language processing tasks, such as topic modeling [7, 17],
the learned continuous matching cost to diferentiate in- text generation [18], and sequence-to-sequence learning
trinsic and noisy interactions for each user. Finally, we [19]. Considering the efect of non-matched pairs, [ 9]
take into account that individual users’ behaviors can proposed a partial one-to-one matching that leverages
vary in RSs, which should be handled diferently during one shared threshold to address domain shift for domain
denoising. To this end, we design a personalized thresh- adaption instead of taking all pairs into account.
olding mechanism for relabeling user-item interactions, However, the above discrete one-to-one matching
canwhich eficiently finds a splitting point for each individual not satisfy the nature of many-to-many user-item
reuser according to the ranked matching cost. lationships in RSs, where each user can interact with</p>
      <p>Experiment results on one real-world dataset with syn- many items and each item can interact with many users.
thesized noises demonstrate our proposed ProRec’s abil- Though OT is applied into cold-start recommendation
ity of denoising under various levels of noises. Further problem in [20], they directly measure user-item
similarexperiments on three original real-world datasets also ity without taking the efect of noises into consideration.
verify the advantages of ProRec, which demonstrate its Moreover, partial matching with a shared threshold
cansignificant performance boosts upon multiple popular RS not reflect the individual user behaviors, and thus leads
models, in terms of efectiveness and eficiency. to suboptimal recommendation performances.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>Many research studies have pointed out that there exist In this section, we detail the proposed Partial relaxed
a large proportion of noisy interactions in the implicit optimal Transport for the denoised Recommendation
feedbacks (i.e., the clicks) [10, 1, 11]. These feedbacks are (ProRec) framework, as shown in Figure 1. Specifically,
easily afected by diferent factors, such as the position we first formulate the denoised recommendation as a
twobias [12], caption bias [13] and exposure bias [10, 14]. step process. Then, we introduce the Optimal Transport
Therefore, there exist large gaps between the implicit (OT) framework for distinguishing noises through global
feedbacks and the actual user preferences due to various user-item matching. Furthermore, we elaborate on our
reasons, which are detrimental to the users’ continuous proposed ProRec framework, which can be applied to
behaviors and overall satisfaction [2, 5]. achieve denoised recommendations.</p>
      <p>To alleviate the above issues, many researchers have
considered incorporating auxiliary feedbacks to the iden- 3.1. Problem Statement
tification of noises in the implicit feedbacks, such as dwell
time [1] and scroll intervals [13]. These works usually Following the setting of recommendation, we denote
design features manually and label items with external the user-item interaction matrix as  ∈ {0, 1}×  ,
help (e.g., from domain experts) [13, 2], which require where  denotes the number of users and  denotes the
extensive efort and cost, which are not always available number of items. We construct user-item relationships
across RSs. To this end, sheer observations on training across a user embedding space  and an item embedding
losses have recently been explored to denoise the implicit space . The problem is that user-item pairs contain
feedbacks during training in an unsupervised fashion [5]. some noisy interactions which need to be relabeled.
The method is purely heuristic and can be integrated In this work, we approach denoised recommendation
with deep learning methods only. Based on a principle through two steps. Firstly, we distinguish intrinsic and
of the least modeling efort in diferent fields [ 4, 15], our noisy interactions across  and  by a global
matchproposed framework aims to denoise the user-item in- ing matrix  * . Then, to stress users’ intrinsic preferences
teractions without supervision, which is supported by and reduce noises, we rank the learned  * and keep only
A
B
C
Interaction Matrix X
Intrinsic preference
Relabeled data
the top  interactions of each user via a personalized
thresholding mechanism. In this way, we learn a
relabeled matrix  ∈ ×  for denoised recommendation.</p>
      <sec id="sec-3-1">
        <title>3.2. Global Matching for Distinguishing</title>
      </sec>
      <sec id="sec-3-2">
        <title>Noises</title>
        <p>equal to 1.  is the number of users and  is the number
of items.  * denote the optimal transport plan.</p>
        <p>Specifically, the matching cost matrix ℳ directly
relfects the modeling efort, and thus the low user-item
matching cost can indicate users’ intrinsic preferences
while the high values can indicate noises. Based on the
definition of ℳ, we can integrate the denoising process
with both non-deep and deep RS models as follows:</p>
        <sec id="sec-3-2-1">
          <title>As motivated in Section 1, a framework that can denoise</title>
          <p>user-item interactions for more accurate recommenda- ℒ( ,  ) = || −    ||2 +  (|| ||22 + || ||22), (2)
tion is in great need. However, without supervision, it is ℒ( ,  ) = (, ( ,  )) +  (|| ||22 + || ||22),
hard to define noisy interactions and then stress intrinsic (3)
preferences together while reducing noisy ones. where  ∈ R×  and  ∈ R×  represent the user</p>
          <p>To distinguish noises in an unsupervised fashion, sev- and item embeddings.  denotes a classification
neteral works [4, 15, 21, 5] have pointed out a principled work. The hyperparameter  is to balance the loss and the
way to locate noises by finding the data that needs the regularization term. In non-deep learning models, Eq. 2
most modeling efort or cost. Inspired by this principle can be solved efectively through many methods ( e.g.,
Alof least modeling efort, in this work, we propose to refer ternative Least Square). Since the prediction scores
repto the noises as minor abnormal data. Compared with resent the user-item similarities where ℳ has negative
the interactions from intrinsic preferences, noises take values, we calculate the cost matrix as ℳ = −   
much efort or cost to model. by using inner product. In deep learning models, we</p>
          <p>Based on the above least efort rationale, we propose can optimize Eq. 3 by adopting the widely used binary
a novel denoising framework to distinguish noises by cross-entropy loss (i.e., (· , · )). Then, similar
calcularanking the global matching between user and item em- tions can be made for those deep learning ones, where
bedding spaces. The framework is served as a plug-in, so
as to be flexibly integrated to both non-deep and deep
RS methods. In this scenario, the key to distinguishing
noises boils down to finding a global user-item matching
across the user space  and the item space  with
minimal matching cost.</p>
          <p>To address the unsupervised matching problem
between two spaces, we advocate a principled denoised RS
framework and ground on Optimal Transport (OT), which
has been successfully applied in many fields such as CV
[6] and NLP [7]. As studied theoretically in [22], the
OT formulation derived from the Earth Movers Distance
(EMD) can be expressed as
 
∑︁ ∑︁
ℳ = − ( ,  ).</p>
          <p>However, there are three technical challenges when
applying OT to denoised recommendation:
Challenge 1. In RSs, each user can interact with many
items and each item interact with many users, which
corresponds to a many-to-many interaction matrix. Unlike
previous applications of OT in CV or NLP, directly
applying the one-hot constraint specified in Eq. 1 is
counterintuitive for recommendation.</p>
          <p>Challenge 2. In RSs, the number of users and items in
the real world tends to be large, making the matching
process time-consuming.</p>
          <p>Challenge 3. Besides distinguishing noises, it is
necessary to design a mechanism to eliminate the efect
min ℳ   , of noisy interactions. A naïve way is to set one global
 ∈ (,) =1 =1 (1) thresholding hyperparameter  to keep the top  items
.. 1  = 1 1 , 1   = 1 1 , wgliothbatlhtehlreeasshtomldaitncghihnygpceorsptafroarmeaectehrucsaenr.nHotoawcecvoemr,
mthoewhere  (, ) denotes the joint probability distribu- date individual user behaviors.
tion of user embedding space  and item embedding
space . 1 ∈ R1×  , 1 ∈ R1×  with all elements
3.3. The ProRec Framework
where ( ) = ∑︀
=1
∑︀
=1</p>
          <p>(log(  ) − 1). 1 ∈
a hyperparameter to balance the entropy regularization
R1×  , 1 ∈ R1×  are with all elements equal to 1.  is In this case, we can obtain the solution of   as
= ℳ +  log   −  = 0,
ℳ   +  ( ) − (1</p>
          <p>− )  , (8)
(4)
where   is Lagrangian multipliers.
gradient   ℒ</p>
          <p>= 0, which means:</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Then, in order to find the optimal solution, we let the</title>
          <p>where ⊘ corresponds to the element-wise division.
Proof. Based on Eq. 6 and using the Lagrange multiplier
method we have:</p>
          <p>ℒ = ∑︁ ∑︁
=1 =1
ℒ</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>In this section, we propose a Partial relaxed optimal</title>
          <p>Transport for denoised Recommendation (ProRec) frame- closed-form solution for the global optimal matching
mawork to address the above three challenges.
threshold  ,  , we can relabel by  ∈ R as follows:
,
where  is a monotonically increasing function
according to the value of   −  ,  .  is a hyperparameter to</p>
        </sec>
        <sec id="sec-3-2-4">
          <title>Finally, to make the relabeling process more flexible,</title>
          <p>we propose to use a hyperparameter  to control the
degree of relabeling and obtain a new interaction matrix
where ⊘</p>
          <p>corresponds to element-wise division.</p>
          <p>Following the same derivation, we can obtain the
mean value of top  items while 2 = 1− 
represents mean of the rest  −  items. After obtaining
the index of splitting points  , we have the
correspond]︃
(20)
as noises during training. For real data experiments, we
use Amazon music (AMusic)1, Amazon toys (AToy)1, and
Yahoo2. These datasets have been widely adopted in
previous literature [27, 5, 28], and their statistics are
summarized in Table 1. We do a 5:2:3 splitting for them
following [28]. We evaluate the ability to distinguish noise
, by Hit Ratio, and recommendation performances by
normalized discounted cumulative, mean average precision,
and recall.</p>
        </sec>
        <sec id="sec-3-2-5">
          <title>1https://github.com/familyld/DeepCF/tree/master/Data 2https://webscope.sandbox.yahoo.com/catalog.php?datatype=r</title>
        </sec>
        <sec id="sec-3-2-6">
          <title>Note that, the closed-form solution yields a time com</title>
          <p>plexity of (max(,  )2). This shows that by relaxing
the OT formulation with auxiliary limitation from only
one side, our model can achieve a lower time complexity
than the (max(,  )3) reported in [6], which uses OT
for one-to-one matching.</p>
          <p>Personalized thresholding mechanism for denoised
recommendation.</p>
          <p>To flexibly discriminate the
intrinsic and noisy user-item interactions for individual
users, we propose a non-parametric personalized
thresholding mechanism. We first normalize the matching
cost by each user as   *′
 *
= ∑︀  * , and rank them
as   = sort( *′ ). According to the definition of the
matching matrix, each row represents users’ preferences
towards the items. We define 
= { 1, . . . ,   } as the
index of the threshold which can filter out users’ noisy
preferences.   denotes user ’s threshold. As shown
in Figure 1, some users with consistent (narrow)
interests (i.e., user ) would benefit from a reasonably low
threshold since a few items can already represent their
preferences and more items can introduce more noises.</p>
        </sec>
        <sec id="sec-3-2-7">
          <title>Meanwhile, some users with diverse (wide) interests (i.e., user ) would require the threshold to be relatively high</title>
          <p>[︃</p>
          <p>=1
the model requires an automatic splitting mechanism.</p>
        </sec>
        <sec id="sec-3-2-8">
          <title>Inspired by the threshold selection mechanism used in</title>
          <p>CART [25], we eficiently learn the personalized
thresholds   for diferent users by minimizing the following
sum of square errors:
  = arg min

∑︁ (︀   − 1 )︀ 2

+
∑︁ (︀   − 2 )︀ 2</p>
          <p>= +1
where   is the -th value of the sorted   and the index
 ∈ {1, 2, · · · ,  − 1}. 1 = 1 ∑︀
=1   represents the
 =   + (1 −  ) ⊙ ,
where ⊙</p>
          <p>corresponds to the element-wise product.
Considering that the noises in sparse positive interactions
would induce worse efects in recommendation [ 5], in
this work, we only target the part of the original dataset
where  = 1. In this way, the newly generated  can
keep users’ intrinsic preferences and reduce the noise
ones. We summarize the learning procedure of our
proposed ProRec in Algorithm 1, which is proved to
guarantee the local convergence for denoised recommendation.
(22)</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <p>In this section, we first conduct controlled experiments
with synthetic noises on one dataset, so as to
investigate ProRec’s performance to diferent levels of noisy
interactions. Then, we evaluate ProRec’s performance
on three original real-world datasets of recommendation
with implicit feedback.</p>
      <sec id="sec-4-1">
        <title>4.1. Experimental Settings</title>
        <p>4.1.1. Dataset and evaluation protocols.</p>
        <p>We use ML-100k for the noise-controlled experiments
and do a 4:1 random splitting for train/test data
followto model a broader range of preferences and satisfy the
ing [26]. To simulate the noise, we randomly select
need for exploration. To find a personalized splitting</p>
        <sec id="sec-4-1-1">
          <title>5%/10%/15%/20% of ground-truth records and flip labels</title>
          <p>point   for each user according to the learned matrix, in the train set. The selected records can be regarded</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>Algorithm 1: Partial Relaxed Optimal Transport</title>
          <p>for Denoised Recommendation (ProRec)
Data: The noisy user-item interactions , user
popularity  and item popularity 
Result: Denoised user embeddings  and item
embeddings  , a relabeled matrix .
1 while not converged do
2 1. Update  and  ;
3 if Learning  and  via non-deep methods
then Update  and  via
|| −    ||2 +  (|| ||22 + || ||22 in</p>
          <p>Eq. 2 and calculate ℳ = −    ;
4 else Update  and  via
(, ( ,  )) +  (|| ||22 + || ||22) in
Eq. 3, where (· , · ) is binary cross-entropy
loss; Then calculate ℳ = − ( ,  ),
where  is a classification network;
2. Update ;
Model denoised recommendation problem by
minimizing ∑︀ =1 ∑︀=1 ℳ   +  ( ),
where 1  = , 1   = , in Eq. 4;
Compute a global optimal matching matrix
 * by the closed-form solution
 * = max( ℳ˜ diag( ⊘ 1 ℳ˜), diag( ⊘
1 ℳ˜ ) ℳ˜) in Eq. 7;
Compute personalized thresholds  by
automatic splitting mechanism in Eq. 20;
Relabel user-item interaction by sorting  in
Eq. 21;
Update  by  =   + (1 −  ) ⊙  in
Eq. 22;
4.1.2. Methods for comparison
trix Factorization, which is a deep MF framework;
VAECF [31] – Variational Autoencoder for Collaborative
Filtering, which is one of the state-of-the-art deep learning
based recommendation algorithm [28]; LightGCN [32] –
LightGCN devises a light graph convolution for training
eficiency and generation ability on recommendation
scenario; EGLN [33] – EGLN enhances graph learning and
node embedding modules iteratively based on mutual
information maximization.</p>
          <p>Besides the above methods, T-CE [5] and SGDL [34]
are the existing frameworks for denoised
recommendations. The first one discards the large-loss samples with
a dynamic threshold in each iteration and the second one
collects memorized interactions at the early stage of the
training and leverages those data as denoising signals to
guide the following training. Both of them are designed
as plug-ins for deep learning methods. We add CDAE
+ T-CE and LightGCN + SGDL as baselines to compare
the ability of denoising with the proposed ProRec, which
are shown to win the best performance in their original
works [5, 34].</p>
          <p>For all the algorithms (except for CDAE + T-CE and
LightGCN + SGDL), we add the proposed ProRec on top
of them, allowing both non-deep and deep RS models to
discriminate intrinsic and noisy interactions in an
unsupervised fashion. Specifically, we have SVD + ProRec,
NCE + ProRec, CDAE + ProRec, WRMF + ProRec,
VAECF + ProRec, LightGCN + ProRec, and EGLN + ProRec,
respectively.
4.1.3. Implementation details</p>
        </sec>
        <sec id="sec-4-1-3">
          <title>Implementations of the compared baselines are from</title>
          <p>open-source projects (SVD3, NCE3, CDAE3, WRMF3,
VAE-CF3, LightGCN4, EGLN5, T-CE6 and SGDL7). We
optimize ProRec with standard Adam. We tune all
hyperparameters through grid search. In particular, learning
rate in {0.0005, 0.001, 0.005, 0.01},  in {0.25, 0.5, 0.75,
1.0},  in {0.05, 0.075, 0.1, 0.125, 0.15, 0.175},  in {1, 5,
10, 20, 50},  in {0.0005, 0.001, 0.005, 0.01}, and the batch
size in {100, 250, 500, 1000} for diferent datasets. We
also carefully tuned the hyperparameters of all baselines
through cross-validation to achieve their best
performances.</p>
        </sec>
        <sec id="sec-4-1-4">
          <title>We compare two types of algorithms for recommenda</title>
          <p>tion. The first type is non-deep algorithms, which include:
SVD [29] – A similarity-based method that constructs
a similarity matrix through SVD decomposition of the
implicit matrix ; NCE [28] – Noise contrastive
estimation item embeddings combined with a personalized
user model based on PLRec [30]. The second type is
deep learning methods, including CDAE [10] – Collabo- 3https://github.com/wuga214/NCE_Projected_LRec
rative Denoising Autoencoder is a deep learning method 54hhttttppss::////ggiitthhuubb..ccoomm//gyuimsyuet1ia2n3y4a/Lnigg/hStIGGCIRN2-0P2y1T-EorGcLhN
specifically optimized for implicit feedback recommen- 6https://github.com/WenjieWWJ/DenoisingRec
dation tasks; WRMF [10] – Weighted Regularized Ma- 7https://github.com/zealscott/SGDL</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Controlled Experiments (Synthetic</title>
      </sec>
      <sec id="sec-4-3">
        <title>Noises)</title>
        <sec id="sec-4-3-1">
          <title>Before looking at the performance of recommendation, we first inspect ProRec’s ability to distinguish noises.</title>
        </sec>
        <sec id="sec-4-3-2">
          <title>We compare the recommendation results of the proposed</title>
          <p>ProRec to those of the baseline models. Table 2 shows
the NDCG, MAP, and Recall scores on three datasets. We
Since we know the index of the randomly added noises in have the following observations.
the dataset, we evaluate the level of distinguished noises In three diferent domains, by integrating ProRec with
by Hit Ratio (HR) of the real added noises in Figure 2(a). both non-deep and deep methods, we can consistently
imSpecifically, we observe a consistent high HR of NCE + prove strong baseline models and achieve start-of-the-art
ProRec around 90% when the level of noises increases. performance. Since the graph-based model can leverage
Furthermore, we test Recall of NCE and NCE + ProRec high-order relations among users and items, EGLN
outunder diferent levels of noises. It can be clearly observed performs all baselines on three datasets. Furthermore, our
that NCE + ProRec always performs better than NCE, as proposed EGLN + ProRec outperforms all deep-learning
the former can accurately relabel some noises to elimi- baselines on three datasets, ranging from 3.34% (achieved
nate the efect of noises. For example, when the noise on AToy on Recall@5) to 6.38% (achieved on Yahoo on
level is at 20%, NCE + ProRec significantly improves NCE Recall@5), while NCE + ProRec outperforms all non
deepfrom 0.1579 to 0.1596. learning baselines on three datasets, ranging from 1.91%
(achieved on AToy on MAP@5) to 7.97% (achieved on
4.3. Real Data Experiments AMusic on Recall@5). All of these show that ProRec is
capable of efective recommendation on diferent datasets.
4.3.1. Overall Performance Comparison One step further, we observe that the methods based
on the proposed ProRec framework all outperform the
original models, which indicates the advantage of the
denoising process. Compared with existing denoising</p>
          <p>Table 3 Sinkhorn, it is inefective to directly adopt one-to-one
Impact of diferent model components of ProRec on AMusic. matching (i.e., NCE + EMD) due to the removal of many
matched interactions. For example, NCE + Sinkhorn
Metric N@5 R@5 Time N@5 R@5 Time significantly improves NCE from 0.0276 to 0.0281 (on
BMaestehod 0.0375 N0C.0E276 64s 0.0422 EG0.L0N386 82s Recall@5 on AMusic) while both metrics of NCE + EMD
+EMD 0.0360 0.0259 98s 0.0406 0.0371 124s are lower than those of the original NCE.
+Sinkhorn 0.0384 0.0281 132s 0.0431 0.0398 174s The eficiency of relaxed regularization. By balancing
+ROT 0.0383 0.0281 76s 0.0430 0.0392 109s the performance and the runtime via relaxing the
opti+ProRec 0.0396 0.0298 82s 0.0445 0.0406 118s mized constraints, we can observe that NCE + ROT can
reduce around half of the runtime while keeping almost
frameworks (i.e., T-CE and SGDL), ProRec can not only the same NDCG and Recall on AMusic and Yahoo.
be flexibly integrated with both deep and non-deep RS The efect of personalized thresholding mechanism.
methods but also gain more improvements. We first use one global threshold for ROT to study the</p>
          <p>Moreover, on top of existing RS methods, our proposed efectiveness of the personalized thresholding
mechaProRec in Eq. 4 does not introduce significant compu- nism. In the experiment, we search the hyperparameter
tations. In the third and sixth columns in Table 3, we of global thresholds  in {5, 10, 15, 20, 25} and find that
compare the model’s training time under the same exper- NCE + ROT achieves the best performance when  = 10.
imental setting (e.g., embedding dimension). As can be Compared with NCE + ROT, NCE + ProRec can
individuobserved, the training time of the proposed NCE + ProRec ally discriminate intrinsic preferences and noises without
and LightGCN + ProRec are within the same scale as NCE tuning the hyperparameter of the threshold, and then
and LightGCN on AMusic dataset. improve the performance of the partial OT framework.</p>
          <p>Note that the improvements brought by the denois- For example, the performance of NCE + ProRec achieves
ing process on the denser datasets (e.g., the performance 0.0298, which outperforms 0.0281 of NCE + ROT on
Regains between 0% noise level and 5% noise level on ML- call@5 on AMusic dataset.
100k shown in Figure 2(a)) are less than those on the The efect of hyperparameters. Our proposed ProRec
sparse datasets (e.g., the performance gains between framework introduces four hyperparameters:  ,  ,  and
EGLN + ProRecn and EGLN on Yahoo in Table 2). Al-  , where we empirically set  to 20 and  to 0.001. 
though we can explicitly eliminate some efects of noisy controls the degree of the relabeling and  controls the
interactions (as shown in Figure 2(a)), they only have a sparsity of the matching matrix. Take NCE + ProRec for
minor impact on the learned embeddings when interac- example (shown in Figure 2(b)-2(c)), the optimal  and 
tions are enough in the observed data, which is consistent on AMusic are found to be 0.5 and 0.1, respectively. In
with the results in recent studies [35]. general, we can set  to 0.5 for denser datasets and 0.25
for the sparser datasets and always set  to 0.1.
4.3.2. Model Ablation and Hyperparameter Study</p>
        </sec>
        <sec id="sec-4-3-3">
          <title>In this section, three groups of ablation studies are con</title>
          <p>ducted to evaluate the impacts of our proposed many-to- To demonstrate the advantages of the denoising process,
many matching, relaxed regularization, and personalized we visualize the interaction matrix given by ML-100k on
thresholding mechanisms, as well as the efects of hyper- the left and the learned matrix of the proposed NCE +
parameters. ProRec on the right (as shown in Figure 3). The
numThe efect of continuous many-to-many matching. bers in the boxes are actual movie IDs. The colors in
Compared with the many-to-many matching of NCE + the original interaction matrix (left) are the same,
indi4.3.3. Denoising Case Study
cating uniform weights, whereas those in the relabeled [6] B. Bhushan Damodaran, B. Kellenberger, R.
Flamatrix (right) are diferent. The diferent depths of blue mary, D. Tuia, N. Courty, Deepjdot: Deep joint
represent the matching cost (the deeper, the lower) and distribution optimal transport for unsupervised
dowhite boxes denote the intrinsically preferred items. As main adaptation, in: ECCV, 2018.
we can see from the diferent colors, our many-to-many [7] V. Huynh, H. Zhao, D. Phung, Otlda: A
geometrymatching matrix efectively discriminates intrinsic and aware optimal transport approach for topic
modelnoisy interactions. Moreover, we can also observe the ing, NIPS (2020).
personalized thresholding mechanism to work as difer- [8] M. Cuturi, Sinkhorn distances: Lightspeed
compuent numbers of items is being replaced for diferent users. tation of optimal transport (2013).
Furthermore, we show several movies relevant to user 0. [9] R. Xu, P. Liu, Y. Zhang, F. Cai, J. Wang, S. Liang,
Since most interacted movies of user 0 are from the ro- H. Ying, J. Yin, Joint partial optimal transport for
mance category, such as Titanic and Shakespeare in Love, open set domain adaptation, in: IJCAI, 2020.
the horror movie is abnormal and is unlikely to reflect [10] Y. Hu, Y. Koren, C. Volinsky, Collaborative filtering
her/his intrinsic preference. NCE + ProRec automatically for implicit feedback datasets, in: ICDM, 2008.
learns to highlight the two romantic movies. After down- [11] Y. Liu, Q. Liu, Y. Tian, C. Wang, Y. Niu, Y. Song,
weighing Friday the 13th Part 3: 3D, the scoring of the C. Li, Concept-aware denoising graph neural
netuser’s preference changes, and thus the ranking of an- work for micro-video recommendation, in:
Proceedother romantic movie (Runaway Bride) rises to show up ings of the 30th ACM International Conference on
in the top  = 6 candidate list for user 0. Information &amp; Knowledge Management, 2021, pp.
1099–1108.
[12] R. Jagerman, H. Oosterhuis, M. de Rijke, To model
5. Conclusion or to intervene: A comparison of counterfactual
and online learning to rank from user interactions,
In this paper, we propose a novel ProRec framework for in: SIGIR, 2019.
the recommendation with implicit feedback. ProRec can [13] H. Lu, M. Zhang, S. Ma, Between clicks and
satefectively denoise the user-item interactions by flexibly isfaction: Study on multi-phase user preferences
serving as a plug-in, so as to further improve the recom- and satisfaction for online news reading, in: SIGIR,
mendation accuracy for both non-deep and deep learning 2018, pp. 435–444.</p>
          <p>RS methods. We demonstrate the superior performance [14] T. Schnabel, A. Swaminathan, A. Singh, N.
Chanof ProRec in recommendation through extensive experi- dak, T. Joachims, Recommendations as treatments:
ments and showcase its efectiveness in denoising user- Debiasing learning and evaluation, JMLR (2016).
item interactions through insightful case studies. Since [15] L. Jiang, Z. Zhou, T. Leung, L.-J. Li, L. Fei-Fei,
Menwe achieve denoising in a fully unsupervised fashion tornet: Learning data-driven curriculum for very
without accessing any additional user or item data, we deep neural networks on corrupted labels, in: ICML,
expect ProRec to incur no more negative societal impact 2018.
than any basic recommender system. [16] N. Kolkin, J. Salavon, G. Shakhnarovich, Style
transfer by relaxed optimal transport and self-similarity,
References in: CVPR, 2019.</p>
          <p>[17] M. Kusner, Y. Sun, N. Kolkin, K. Weinberger, From
[1] Y. Kim, A. Hassan, R. W. White, I. Zitouni, Modeling word embeddings to document distances, in: ICML,
dwell time to predict click-level satisfaction, in: 2015, pp. 957–966.</p>
          <p>WSDM, 2014. [18] L. Chen, S. Dai, C. Tao, H. Zhang, Z. Gan, D. Shen,
[2] H. Lu, M. Zhang, W. Ma, C. Wang, F. xia, Y. Liu, Y. Zhang, G. Wang, R. Zhang, L. Carin, Adversarial
L. Lin, S. Ma, Efects of user negative experience in text generation via feature-mover’s distance, in:
mobile news streaming, in: SIGIR, 2019. NIPS, 2018.
[3] B. Yang, S. Lee, S. Park, S.-g. Lee, Exploiting vari- [19] L. Chen, Y. Zhang, R. Zhang, C. Tao, Z. Gan,
ous implicit feedback for collaborative filtering, in: H. Zhang, B. Li, D. Shen, C. Chen, L. Carin,
ImWWW, 2012. proving sequence-to-sequence learning via optimal
[4] N. Courty, R. Flamary, D. Tuia, A. Rakotomamonjy, transport, in: ICLR, 2019.</p>
          <p>Optimal transport for domain adaptation, TPAMI [20] Y. Meng, X. Yan, W. Liu, H. Wu, J. Cheng,
Wasser39 (2017) 1853–1865. stein collaborative filtering for item cold-start
rec[5] W. Wang, F. Feng, X. He, L. Nie, T.-S. Chua, Denois- ommendation, in: Proceedings of the 28th ACM
ing implicit feedback for recommendation, WSDM Conference on User Modeling, Adaptation and
Per(2021). sonalization, 2020, pp. 318–322.
[21] Q. Xu, J. Xiong, X. Cao, Q. Huang, Y. Yao, From rative filtering, in: SIGIR, 2019.
social to individuals: a parsimonious path of multi- [29] P. Cremonesi, Y. Koren, R. Turrin, Performance of
level models for crowdsourced preference aggrega- recommender algorithms on top-n
recommendation, TPAMI (2018). tion tasks, in: RecSys, 2010.
[22] A. Gretton, D. Sejdinovic, H. Strathmann, S. Bal- [30] S. Sedhain, H. Bui, J. Kawale, N. Vlassis, B. Kveton,
akrishnan, M. Pontil, K. Fukumizu, B. K. Sriperum- A. K. Menon, T. Bui, S. Sanner, Practical linear
modbudur, Optimal kernel choice for large-scale two- els for large-scale one-class collaborative filtering,
sample tests, in: NIPS, 2012. in: IJCAI, 2016.
[23] S. Boyd, N. Parikh, E. Chu, Distributed optimization [31] D. Liang, R. G. Krishnan, M. D. Hofman, T. Jebara,
and statistical learning via the alternating direction Variational autoencoders for collaborative filtering,
method of multipliers, 2011. in: WWW, 2018.
[24] N. Parikh, S. Boyd, Proximal algorithms, Founda- [32] X. He, K. Deng, X. Wang, Y. Li, Y. Zhang, M. Wang,
tions and Trends in optimization (2014). Lightgcn: Simplifying and powering graph
convolu[25] N. Angelopoulos, J. Cussens, Exploiting informa- tion network for recommendation, in: SIGIR, 2020,
tive priors for bayesian classification and regression pp. 639–648.</p>
          <p>trees., in: IJCAI, 2005. [33] Y. Yang, L. Wu, R. Hong, K. Zhang, M. Wang,
En[26] J. Ding, Y. Quan, Q. Yao, Y. Li, D. Jin, Simplify and hanced graph learning for collaborative filtering
robustify negative sampling for implicit collabora- via mutual information maximization, in: SIGIR,
tive filtering (2020). 2021, pp. 71–80.
[27] Z.-H. Deng, L. Huang, C.-D. Wang, J.-H. Lai, S. Y. [34] Y. Gao, Y. Du, Y. Hu, L. Chen, X. Zhu, Z. Fang,
Philip, Deepcf: A unified framework of represen- B. Zheng, Self-guided learning to denoise for robust
tation learning and matching function learning in recommendation, in: SIGIR, 2022.
recommender system, in: AAAI, volume 33, 2019, [35] S. Pal, S. Malekmohammadi, F. Regol, Y. Zhang,
pp. 61–68. Y. Xu, M. Coates, Non-parametric graph learning
[28] G. Wu, M. Volkovs, C. L. Soon, S. Sanner, H. Rai, for bayesian graph neural networks, PMLR (2020).</p>
          <p>Noise contrastive estimation for one-class
collabo</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>