1. Introduction

Triplet losses-based matrix factorization for robust recom mendations

Flavio Giobergia

flavio.giobergia@polito.it 0 0 Department of Control and Computer Engineering , Politecnico di Torino, Turin , Italy

Much like other learning-based models, recommender systems can be afected by biases in the training data. While typical evaluation metrics (e.g. hit rate) are not concerned with them, some categories of final users are heavily afected by these biases. In this work, we propose using multiple triplet losses terms to extract meaningful and robust representations of users and items. We empirically evaluate the soundness of such representations through several “bias-aware” evaluation metrics, as well as in terms of stability to changes in the training set and agreement of the predictions variance w.r.t. that of each user.

recommender systems matrix factorization contrastive learning

1. Introduction

Recommender systems are a fundamental part of almost any experience of online users. The possibility of recommending options tailored to each individual user is one of the key contributors to the success of many companies and services. The metrics that are commonly used in literature to evaluate these models (e.g. hit rate) are typically only concerned with the overall quality of the model, regardless of the behaviors of such models on particular partitions of data. This results in recommender systems typically learning the preferences of the “majority”. This in turn implies a poorer quality of recommendations for users/items that belong to the long tail of the distribution. In an efort to steer the research focus to addressing this problem, the EvalRS challenge [ 1 ]. This challenge, based on the RecList framework [ 2 ], proposes a recommendation problem with a multi-faceted evaluation, where the quality of any solution is not only evaluated in terms of overall performance, but also based on the results obtained on various partitions of users and items. In this paper, we present a possible recommender system that addresses the problem proposed by EvalRS. The solution is based on matrix factorization by framing an objective function that aligns users and items in the same embedding space. The matrices are learned by minimizing a loss function that includes multiple triplet losses terms. Diferently from what is typically done (i.e. aligning an anchor user to a positive and a negative item), in this work we propose additionally using triplet terms for users and items separately.

The full extent of the challenge is described in detail in [ 1 ]. In short, the goal of the challenge is to recommend EvalRS at CIKM 2022 nEvelop-O 2.

Methodology

In this section we present the proposed methodology, highlighting the main aspects of interest. No data preprocessing has been applied to the original data, although some approaches have been attempted (see Section 4). The proposed methodology, as explained below, allows ranking all items based on estimated compatibility with any given user. We produce the final list of recommendations by stochastically selecting items from the ordered list of songs, weighting each song with the inverse of its position in the list.

2.1. Loss definition

Matrix factorization techniques have long been known to achieve high performance in various recommendation challenges [ 4 ]. This approach consists in aligning vector representations for two separate entities, users and items (songs, in this case). This alignment task is a recurring one: a commonly adopted approach to solving this problem is through the optimization of a triplet loss [ 5 ].

A triplet loss is a loss that requires identifying an anchor point, as well as a positive and a negative point, i.e. points that should either lie close to (positive) or far from (negative) the anchor point.

Users and songs can thus be projected to a common embedding space in a way that users are placed close to songs they like and away from songs they do not like.

1https://github.com/fgiobergia/CIKM-evalRS-2022

This can be done by choosing a user as the anchor, and two songs as the positive and negative points. A reasonable choice for the positive song is one that has been listened by the user. The choice for the negative song is trickier. Random songs, or songs not listened by the user are possible choices. However, more sophisticated strategies can be adopted to choose negative points that are dificult for the model to separate from the anchor. These are called hard negatives and have been shown in literature to be beneficial to the training of models [ 6 ].

We decided to use a simple policy for the selection of a negative song: a negative song for user is extracted from the pool of songs that have been listened by one of the nearest neighbors of and have not been listened by . By doing so, we aim to reduce the extent to which the model relies on other users’ preferences to make a recommendation. The concept of neighboring users is obtained by comparing the similarity between embedding representations of all users. Due to the computational cost of this operation, it is only performed at the beginning of each training epoch.

We can thus define the triplets (

, , ) to be used songs respectively. for the definition of a triplet loss. Here, represent the vector for the anchor user, whereas and represent the vectors for the positive and negative is used to

Similar approaches where users are aligned to songs they did or did not like are Bayesian Personalized Ranking (BPR) [ 7 ], where negative songs are sampled randomly, and WARP [ 8 ], where negative items are sampled so as to be “hard” (based on their proximity of the anchor w.r.t. the positive item). To improve the robustness of the representations built, we are additionally interested in aligning similar songs and similar users. To this end, we introduce two additional triplet terms to the loss func tion, one that is based on ( , , )and one on (

, , Based on the previously defined concepts, we choose as a song listened by , and and as users who re spectively listened to and . Other alternatives have been considered, but were ultimately not selected due to ). a higher computational cost.

We define the final loss as: ℒ = ∑ ({( 1{( 2{(

, ) − (

, ) − ( , , , ) − ( , ) + 0, 0}+ ) + 1, 0}+ ) + 2, 0}) (1)

Where (⋅) is a distance function between any pair of vectors. In this work, the cosine distance is used. is a margin enforced between positive and negative pairs. In this work, since all elements are projected to a common embedding, we used 0 = 1 = 2. Finally, is a weight that is assigned to each entry, which is discussed in Subsection 2.2.

useranc songanc songneg

userneg userpos songpos user-song loss song-song loss user-user loss on the vectors. Arrow directions represent whether elements are pulled towards or pushed away from the anchors. the loss on the embedding vectors learned.

2.2. Popularity weight

To make the minority entities more relevant, we adopted a weighting scheme that modulates the previously described loss so as to weigh rows more if they belong to “rarer” entities and less for common ones. In accordance with [ 1 ], we identified five factors to be kept into account.

Based on these, a coeficient has been defined for each entry in the training set. The final weight is given by a weighted average of these coeficients. The following is a list of factors, along with the way the respective coeficients have been computed (logarithms are used for factors that follow a power law distribution). All coeficients are normalized to sum to 1 across the respective population.

• Gender (

): in accordance with the original dataset, a relevance coeficient is provided for the categories male, female, and undisclosed 2. The coeficient is proportional to the inverse of the occurrences of each gender in the list of known 2This simplified perspective on gender does not reflect that of the • Country ( ): the coeficient related to the We therefore introduce the consistency metric, which country is calculated as the inverse of the loga- quantifies the variance of the model when tested across rithm of the number of occurrences of the specific multiple folds, or datasets. A higher variance in percountry of the users in the training set. formance would be associated with a lower consistency • Artist popularity ( ): a proxy for the popular- (or higher inconsistency). For a single metric, the consisity of an artist is obtained by the number of times tency could be defined as the variance of the metric across songs by that artist have been played in the train- the folds. However, when multiple metrics are involved ing set. The inverse logarithm of this quantity is (as is the case with this competition), a normalization step used as coeficients. should be introduced. We thus instead use the coeficient • Song popularity ( ): a proxy for the popularity of variation, defined as the standard deviation divided by of a song is provided by the number of times that the mean value, to quantify the inconsistency of a model song has been played in the training set. The with respect to a metric . We compute the consistency inverse logarithm of this quantity is used as coef- for a metric as 1 - inconsistency. The overall consistency ifcients. is therefore computed as the mean consistency across all • User activity ( ): the overall activity of a user metrics: can be quantified in terms of the number of songs that they have listened to across the training set. = 1 ∑ (1 − ) (2) The inverse logarithm of this quantity is used as | | ∈ | | coeficients.

Where represents the set of all metrics used, while and are the arithmetic mean and standard deviation computed over all the folds, for a metric . We use the absolute value of the mean to make the results comparable regardless of sign. Alternatively, the ratio 2 / 2 could be used to assign a lower penalty in case of small deviations. The maximum possible eficiency, 1, would be assigned to a model that presents the same exact performance across all folds for all metrics. Section 3 reports the consistency, measured in these terms, for the proposed solution.

The weighted sum of the above-mentioned coeficients constitutes the weight in Equation 1. The weights used for each coeficient have been searched as a part of the tuning of the model and are presented in Section 3.

2.3. Model initialization

The initial values assigned to the users’ and items’ vectors greatly afects the entire learning process. A good initialization can make the convergence process faster and/or allows reaching a better minimum. We used initial vectors for users and items based on an adaptation 2.5. Variance agreement of the word2vec algorithm [ 9 ]. We built a corpus of sen- Diferent users may have diferent interests in terms of tences, one for each song known, composed of users variety. In the “music” context a user may, for example, who listened to that song, artists and albums, all in the listen to songs from very few authors, whereas others form token-type=token-value (e.g. song=1234). We then may be more interested in a wider variety of artists. A trained word2vec to learn representations for all of the similar concept may be applied to other contexts (e.g. tokens involved. We used as initial vectors the vectors in terms of brand loyalty for products). It is therefore obtained for the users and songs tokens. desirable that a recommender system should provide a

Word2vec places tokens close in the embedding space wider variety of recommendations for users that are inbased on their adoption in similar contexts. For this rea- clined to them, and vice versa. We introduce the concept son, based on the definition of sentences, this approach of variance agreement w.r.t. a variable, which quantifies already brings close users with similar tastes – in terms how the variance in recommendations correlates to each of songs, artists, albums, as well as similar songs – in user’s interest in variance, as dictated by their previous terms of users that listened to them, artists the produced interactions, in terms of the variable of interest. In this them, albums they are found in. context, we use the artists that produced songs as the variable of interest. 2.4. Model consistency We quantify the variance of a set of songs as the Gini impurity over that set, where each song is mapped to the As we will discuss in Section 3, we empirically observed respective artist. We can thus assign an impurity to any that the proposed solution presents high variance in the given user, , as the impurity within the set of songs performance obtained across the various folds. While they listened to in the training set. For that same user, this is not directly measured as a part of the core metrics we can define the model’s impurity, ̂, as the impurity of EvalRS, we still believe it is important to account for of the set of songs recommended by the model for that this aspect in a well-rounded evaluation. user.

If is low, the user listens to a limited set of artists is due to the multi-faceted nature of the overall score (if 0, the user has only listened to one artist in all of its function. mending songs from a limited set of artists. interactions). Similarly, if ̂ is low, the model is recom

Despite the eforts made toward reducing the efect of the dataset imbalances on the final model, we still To measure the agreement between users and model’s observed that the performance of the model is not always variance, we compute the Pearson correlation on the consistent. In other words, there is a relatively high paired data [( , ̂) | ∈ ] , with being the set of all variance in the performance across the various folds. provides a very interesting perspective on the strengths the artists the user listens to are the same ones being and weaknesses of the proposed solution. In particular, recommended) – that information may be quantified by the model is highly inconsistent for some of the fairnessother metrics concerned with the accuracy of models, oriented metrics – as highlighted by the low consistency rather than their suitability over a heterogeneous set of obtained for track popularity and gender. While this does

3. Experimental results

In this section we present the results obtained in terms of the main metrics identified by [ 1 ], as well as some additional considerations on the proposed solution.

The model has been trained and fine-tuned to identify well-performing values for the main hyperparameters. The best configuration of parameters found is reported in Table 1.

Parameter

Value 0 = 1 = 2 1 2 128 2.5 2.5 0.25 5 100 104 105 104 tion that outperforms all others across all metrics. This not necessarily imply poor performance, it is a symptom that the model may be susceptible to fluctuations in performance as the dataset used for training is changed. Other metrics, such as the behavioral and the “standard” ones, show instead a more consistent behavior.

We additionally evaluated the proposed methodology in terms of variance agreement for the “artist” variable. We achieved an agreement of 0.2479, whereas a random model would achieve ≈ 0. This indicates that the model does take into account, to some extent, the individual user’s variance preference. However, there is room for improvements in these terms.

3.1. Ablation study

To understand the efect of the various choices made, we introduce an ablation study where we remove some portions of the proposed methodology. In particular, we study the situations where (1) no user-user loss is consider, (2) no item-item is considered, (3) a random initialization is used instead of word2vec and (4) all training records are weighted the same, regardless of their rarity. score, for all situations. From this we can observe that all proposed approaches bring a benefit to the overall result, with the removal of the additional loss terms being the most important. It should be noted, however, that this ablation study has been carried out using the hyperparameters that produced the best performance for the “full” approach. As such, the ablated performance may be afected by a lack of hyperparameters fine-tuning, thus possibly resulting in a lower score.

4. Failed approaches I have not failed. I’ve just found 10,000 ways that won’t work. Thomas A. Edison

In this section we describe some attempts that have samples from the training set (points that are never sampled), whereas using a weight for each row makes sure that all rows are seen during training.

Data augmentation: to increase the breadth of the data available, we tried to synthesize new user-song interactions, to be then used for training. In particular, we first quantified the proclivity of users to listen to a limited number of artists, by means of the Gini impurity (the more homogeneous the choice of artists, the lower the Gini index). We can then sample users based on this factor, and add user-song relationships, where songs are chosen to belong to the most “likely artists” (i.e. the artists that are more commonly listened by each sampled user).

5. Conclusions

In this paper we presented a possible solution to the EvalRS challenge. The solution uses matrix factorization based on multiple triplet losses combined together to Table 4 align users and songs in the same space. A weighting Ablation study of the proposed solution. Performance is re- scheme has been introduced to assign more importance ported in terms of the overall score adopted for the competi- to uncommon users/items – thus improving the quality tion. of the model in terms of fairness. By introducing the consistency metric, we show some of the main weaknesses of the proposed approach: namely, the fact that it is not been made, but that have not brought any improvement consistent w.r.t. some metrics. We consider this to be one in terms of performance. of the main problems to be addressed. We additionally

Entity resolution: the list of known songs contains covered some of the failed attempts made, in the hope some duplicates. We tried using a naive entity resolu- that others will either not make the same mistakes, or tion approach (songs with matching artists and matching ifgure out how to improve upon them. titles are considered to be the same song). Since this problem afected only a small fraction (a few percent) of songs, the ER step did not produce any significant Acknowledgments improvement and has thus been discarded.

Dataset resampling: we attempted to resample the This work has been supported by the DataBase and Data training set with a weighting scheme similar to the one Mining Group and the SmartData center at Politecnico already used to weigh each training sample based on their di Torino. uniqueness. Worse performance have been observed as a result of this approach: it can be argued that the reason for this is that the resampling outright removes some 3A score of -100 has been assigned to solutions that did not reach a hit rate of 0.015.

[1]

Tagliabue ,

Bianchi ,

Schnabel , G. Attanasio,

Greco ,

G. d. S. P.

Moreira ,

P. J.

Chia , Evalrs: a rounded evaluation of recommender systems , 2022 . URL: https://arxiv.org/abs/2207.05772. doi:1 0 . 4 8 5 5 0

/ A R X I

V . 2 2 0 7 . 0 5 7 7 2 .

[2]

P. J.

Chia ,

Tagliabue ,

Bianchi ,

He ,

Ko , Beyond ndcg: behavioral testing of recommender systems with reclist , in: Companion Proceedings of the Web Conference 2022 , 2022 , pp. 99 - 104 .

[3]

Schedl , The lfm-1b dataset for music retrieval and recommendation , in: Proceedings of the 2016 ACM on international conference on multimedia retrieval , 2016 , pp. 103 - 110 .

[4]

Koren ,

Bell ,

Volinsky , Matrix factorization techniques for recommender systems , Computer 42 ( 2009 ) 30 - 37 .

[5]

Schrof ,

Kalenichenko ,

Philbin , Facenet: A uniifed embedding for face recognition and clustering , in: Proceedings of the IEEE conference on computer vision and pattern recognition , 2015 , pp. 815 - 823 .

[6]

Xuan ,

Stylianou ,

Liu ,

Pless , Hard negative examples are hard, but useful , in: European Conference on Computer Vision , Springer, 2020 , pp. 126 - 142 .

[7]

Rendle ,

Freudenthaler ,

Gantner , L. SchmidtThieme, Bpr: Bayesian personalized ranking from implicit feedback , arXiv preprint arXiv:1205.2618 ( 2012 ).

[8]

Weston ,

Bengio ,

Usunier , Large scale image annotation: learning to rank with joint word-image embeddings , Machine learning 81 ( 2010 ) 21 - 35 .

[9]

Mikolov , I. Sutskever,

Chen ,

G. S.

Corrado ,

Dean , Distributed representations of words and phrases and their compositionality , Advances in neural information processing systems 26 ( 2013 ).