Triplet losses-based matrix factorization for robust recommendations Flavio Giobergia1 1 Department of Control and Computer Engineering, Politecnico di Torino, Turin, Italy Abstract Much like other learning-based models, recommender systems can be affected by biases in the training data. While typical evaluation metrics (e.g. hit rate) are not concerned with them, some categories of final users are heavily affected by these biases. In this work, we propose using multiple triplet losses terms to extract meaningful and robust representations of users and items. We empirically evaluate the soundness of such representations through several β€œbias-aware” evaluation metrics, as well as in terms of stability to changes in the training set and agreement of the predictions variance w.r.t. that of each user. Keywords recommender systems, matrix factorization, contrastive learning 1. Introduction songs to a set of users, given their previous interactions with other songs. The provided dataset is based on a Recommender systems are a fundamental part of almost subset of the openly available LFM-1b dataset [3]. The any experience of online users. The possibility of rec- source code for the proposed solution has been made ommending options tailored to each individual user is available on GitHub 1 . one of the key contributors to the success of many com- panies and services. The metrics that are commonly used in literature to evaluate these models (e.g. hit rate) 2. Methodology are typically only concerned with the overall quality of the model, regardless of the behaviors of such models In this section we present the proposed methodology, on particular partitions of data. This results in recom- highlighting the main aspects of interest. No data prepro- mender systems typically learning the preferences of the cessing has been applied to the original data, although β€œmajority”. This in turn implies a poorer quality of recom- some approaches have been attempted (see Section 4). mendations for users/items that belong to the long tail The proposed methodology, as explained below, allows of the distribution. In an effort to steer the research fo- ranking all items based on estimated compatibility with cus to addressing this problem, the EvalRS challenge [1]. any given user. We produce the final list of π‘˜ recommen- This challenge, based on the RecList framework [2], pro- dations by stochastically selecting items from the ordered poses a recommendation problem with a multi-faceted list of songs, weighting each song with the inverse of its evaluation, where the quality of any solution is not only position in the list. evaluated in terms of overall performance, but also based on the results obtained on various partitions of users and 2.1. Loss definition items. In this paper, we present a possible recommender system that addresses the problem proposed by EvalRS. Matrix factorization techniques have long been known The solution is based on matrix factorization by fram- to achieve high performance in various recommendation ing an objective function that aligns users and items in challenges [4]. This approach consists in aligning vec- the same embedding space. The matrices are learned by tor representations for two separate entities, users and minimizing a loss function that includes multiple triplet items (songs, in this case). This alignment task is a recur- losses terms. Differently from what is typically done (i.e. ring one: a commonly adopted approach to solving this aligning an anchor user to a positive and a negative item), problem is through the optimization of a triplet loss [5]. in this work we propose additionally using triplet terms A triplet loss is a loss that requires identifying an an- for users and items separately. chor point, as well as a positive and a negative point, i.e. The full extent of the challenge is described in detail in points that should either lie close to (positive) or far from [1]. In short, the goal of the challenge is to recommend (negative) the anchor point. Users and songs can thus be projected to a common EvalRS at CIKM 2022 embedding space in a way that users are placed close to Envelope-Open flavio.giobergia@polito.it (F. Giobergia) songs they like and away from songs they do not like. Orcid 0000-0001-8806-7979 (F. Giobergia) Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) https://github.com/fgiobergia/CIKM-evalRS-2022 This can be done by choosing a user as the anchor, and two songs as the positive and negative points. A rea- songneg sonable choice for the positive song is one that has been listened by the user. The choice for the negative song useranc is trickier. Random songs, or songs not listened by the user are possible choices. However, more sophisticated strategies can be adopted to choose negative points that userneg are difficult for the model to separate from the anchor. These are called hard negatives and have been shown in songanc literature to be beneficial to the training of models [6]. We decided to use a simple policy for the selection of userpos a negative song: a negative song for user 𝑒 is extracted from the pool of songs that have been listened by one of the nearest neighbors of 𝑒 and have not been listened by 𝑒. By doing so, we aim to reduce the extent to which the songpos model relies on other users’ preferences to make a recom- mendation. The concept of neighboring users is obtained user-song loss by comparing the similarity between embedding repre- sentations of all users. Due to the computational cost of song-song loss this operation, it is only performed at the beginning of user-user loss each training epoch. 𝑝 We can thus define the triplets (π‘’π‘–π‘Ž , 𝑠𝑖 , 𝑠𝑖𝑛 ) to be used Figure 1: Representation of the action of each part of the loss for the definition of a triplet loss. Here, π‘’π‘–π‘Ž is used to on the vectors. Arrow directions represent whether elements 𝑝 represent the vector for the anchor user, whereas 𝑠𝑖 and are pulled towards or pushed away from the anchors. 𝑠𝑖𝑛 represent the vectors for the positive and negative songs respectively. Similar approaches where users are aligned to songs they did or did not like are Bayesian Personalized Rank- Figure 1 summarizes the effect of the various terms of ing (BPR) [7], where negative songs are sampled ran- the loss on the embedding vectors learned. domly, and WARP [8], where negative items are sampled so as to be β€œhard” (based on their proximity of the anchor 2.2. Popularity weight w.r.t. the positive item). To improve the robustness of To make the minority entities more relevant, we adopted the representations built, we are additionally interested a weighting scheme that modulates the previously de- in aligning similar songs and similar users. To this end, scribed loss so as to weigh rows more if they belong to we introduce two additional triplet terms to the loss func- 𝑝 𝑝 β€œrarer” entities and less for common ones. In accordance tion, one that is based on (π‘ π‘–π‘Ž , 𝑠𝑖 , 𝑠𝑖𝑛 ) and one on (π‘’π‘–π‘Ž , 𝑒𝑖 , 𝑒𝑖𝑛 ). with [1], we identified five factors to be kept into account. Based on the previously defined concepts, we choose π‘ π‘–π‘Ž 𝑝 Based on these, a coefficient has been defined for each as a song listened by π‘’π‘–π‘Ž , and 𝑒𝑖 and 𝑒𝑖𝑛 as users who re- entry in the training set. The final weight is given by 𝑝 spectively listened to 𝑠𝑖 and 𝑠𝑖𝑛 . Other alternatives have a weighted average of these coefficients. The following been considered, but were ultimately not selected due to is a list of factors, along with the way the respective co- a higher computational cost. efficients have been computed (logarithms are used for We define the final loss as: factors that follow a power law distribution). All coeffi- 𝑝 cients are normalized to sum to 1 across the respective β„’ = βˆ‘ 𝑀𝑖 (π‘šπ‘Žπ‘₯{𝑑(π‘’π‘–π‘Ž , 𝑠𝑖 ) βˆ’ 𝑑(π‘’π‘–π‘Ž , 𝑠𝑖𝑛 ) + π‘š0 , 0}+ 𝑖 population. 𝑝 πœ†1 π‘šπ‘Žπ‘₯{𝑑(π‘’π‘–π‘Ž , 𝑒𝑖 ) βˆ’ 𝑑(π‘’π‘–π‘Ž , 𝑒𝑖𝑛 ) + π‘š1 , 0}+ (1) β€’ Gender (πœƒπ‘”π‘’π‘›π‘‘π‘’π‘Ÿ ): in accordance with the original 𝑝 πœ†2 π‘šπ‘Žπ‘₯{𝑑(π‘ π‘–π‘Ž , 𝑠𝑖 ) βˆ’ 𝑑(π‘ π‘–π‘Ž , 𝑠𝑖𝑛 ) + π‘š2 , 0}) dataset, a relevance coefficient is provided for the categories male, female, and undisclosed 2 . The Where 𝑑(β‹…) is a distance function between any pair of coefficient is proportional to the inverse of the vectors. In this work, the cosine distance is used. π‘šπ‘— is a occurrences of each gender in the list of known margin enforced between positive and negative pairs. In users. this work, since all elements are projected to a common embedding, we used π‘š0 = π‘š1 = π‘š2 . Finally, 𝑀𝑖 is a weight that is assigned to each entry, which is discussed 2 This simplified perspective on gender does not reflect that of the in Subsection 2.2. author β€’ Country (πœƒπ‘π‘œπ‘’π‘›π‘‘π‘Ÿπ‘¦ ): the coefficient related to the We therefore introduce the consistency metric, which country is calculated as the inverse of the loga- quantifies the variance of the model when tested across rithm of the number of occurrences of the specific multiple folds, or datasets. A higher variance in per- country of the users in the training set. formance would be associated with a lower consistency β€’ Artist popularity (πœƒπ‘Žπ‘Ÿπ‘‘π‘–π‘ π‘‘ ): a proxy for the popular- (or higher inconsistency). For a single metric, the consis- ity of an artist is obtained by the number of times tency could be defined as the variance of the metric across songs by that artist have been played in the train- the folds. However, when multiple metrics are involved ing set. The inverse logarithm of this quantity is (as is the case with this competition), a normalization step used as coefficients. should be introduced. We thus instead use the coefficient β€’ Song popularity (πœƒπ‘ π‘œπ‘›π‘” ): a proxy for the popularity of variation, defined as the standard deviation divided by of a song is provided by the number of times that the mean value, to quantify the inconsistency of a model song has been played in the training set. The with respect to a metric π‘š. We compute the consistency inverse logarithm of this quantity is used as coef- for a metric as 1 - inconsistency. The overall consistency ficients. is therefore computed as the mean consistency across all β€’ User activity (πœƒπ‘Žπ‘π‘‘π‘–π‘£π‘–π‘‘π‘¦ ): the overall activity of a user metrics: can be quantified in terms of the number of songs 𝜎 1 that they have listened to across the training set. 𝑐= βˆ‘ (1 βˆ’ π‘š ) (2) |𝑀| π‘šβˆˆπ‘€ |πœ‡π‘š | The inverse logarithm of this quantity is used as coefficients. Where 𝑀 represents the set of all metrics used, while πœ‡π‘š and πœŽπ‘š are the arithmetic mean and standard devi- The weighted sum of the above-mentioned coefficients ation computed over all the folds, for a metric π‘š. We constitutes the weight 𝑀𝑖 in Equation 1. The weights used use the absolute value of the mean to make the results for each coefficient have been searched as a part of the comparable regardless of sign. Alternatively, the ratio tuning of the model and are presented in Section 3. πœŽπ‘š2 /πœ‡π‘š 2 could be used to assign a lower penalty in case of small deviations. The maximum possible efficiency, 2.3. Model initialization 1, would be assigned to a model that presents the same exact performance across all folds for all metrics. Section The initial values assigned to the users’ and items’ vec- 3 reports the consistency, measured in these terms, for tors greatly affects the entire learning process. A good the proposed solution. initialization can make the convergence process faster and/or allows reaching a better minimum. We used ini- tial vectors for users and items based on an adaptation 2.5. Variance agreement of the word2vec algorithm [9]. We built a corpus of sen- Different users may have different interests in terms of tences, one for each song known, composed of users variety. In the β€œmusic” context a user may, for example, who listened to that song, artists and albums, all in the listen to songs from very few authors, whereas others form token-type=token-value (e.g. song=1234). We then may be more interested in a wider variety of artists. A trained word2vec to learn representations for all of the similar concept may be applied to other contexts (e.g. tokens involved. We used as initial vectors the vectors in terms of brand loyalty for products). It is therefore obtained for the users and songs tokens. desirable that a recommender system should provide a Word2vec places tokens close in the embedding space wider variety of recommendations for users that are in- based on their adoption in similar contexts. For this rea- clined to them, and vice versa. We introduce the concept son, based on the definition of sentences, this approach of variance agreement w.r.t. a variable, which quantifies already brings close users with similar tastes – in terms how the variance in recommendations correlates to each of songs, artists, albums, as well as similar songs – in user’s interest in variance, as dictated by their previous terms of users that listened to them, artists the produced interactions, in terms of the variable of interest. In this them, albums they are found in. context, we use the artists that produced songs as the variable of interest. 2.4. Model consistency We quantify the variance of a set of songs as the Gini impurity over that set, where each song is mapped to the As we will discuss in Section 3, we empirically observed respective artist. We can thus assign an impurity to any that the proposed solution presents high variance in the given user, 𝑔𝑒 , as the impurity within the set of songs performance obtained across the various folds. While they listened to in the training set. For that same user, this is not directly measured as a part of the core metrics we can define the model’s impurity, 𝑔𝑒̂ , as the impurity of EvalRS, we still believe it is important to account for of the set of π‘˜ songs recommended by the model for that this aspect in a well-rounded evaluation. user. If 𝑔𝑒 is low, the user listens to a limited set of artists is due to the multi-faceted nature of the overall score (if 0, the user has only listened to one artist in all of its function. interactions). Similarly, if 𝑔𝑒̂ is low, the model is recom-Despite the efforts made toward reducing the effect mending songs from a limited set of artists. of the dataset imbalances on the final model, we still To measure the agreement between users and model’s observed that the performance of the model is not always variance, we compute the Pearson correlation on the consistent. In other words, there is a relatively high paired data [(𝑔𝑒 , 𝑔𝑒̂ ) | 𝑒 ∈ 𝒰], with 𝒰 being the set of all variance in the performance across the various folds. users. Table 3 shows the overall consistency of the model, as Note that the variance agreement is not concerned well as the consistency computed for each metric. This with the correctness of the predictions (e.g. whether provides a very interesting perspective on the strengths the artists the user listens to are the same ones being and weaknesses of the proposed solution. In particular, recommended) – that information may be quantified by the model is highly inconsistent for some of the fairness- other metrics concerned with the accuracy of models, oriented metrics – as highlighted by the low consistency rather than their suitability over a heterogeneous set ofobtained for track popularity and gender. While this does user. not necessarily imply poor performance, it is a symptom that the model may be susceptible to fluctuations in per- formance as the dataset used for training is changed. 3. Experimental results Other metrics, such as the behavioral and the β€œstandard” ones, show instead a more consistent behavior. In this section we present the results obtained in terms We additionally evaluated the proposed methodology of the main metrics identified by [1], as well as some in terms of variance agreement for the β€œartist” variable. additional considerations on the proposed solution. We achieved an agreement of 0.2479, whereas a random The model has been trained and fine-tuned to identify model would achieve β‰ˆ 0. This indicates that the model well-performing values for the main hyperparameters. does take into account, to some extent, the individual The best configuration of parameters found is reported user’s variance preference. However, there is room for in Table 1. improvements in these terms. Parameter Value 𝑑 128 3.1. Ablation study πœ†1 2.5 πœ†2 2.5 To understand the effect of the various choices made, π‘š0 = π‘š 1 = π‘š 2 0.25 we introduce an ablation study where we remove some πœƒπ‘”π‘’π‘›π‘‘π‘’π‘Ÿ 5 portions of the proposed methodology. In particular, πœƒπ‘π‘œπ‘’π‘›π‘‘π‘Ÿπ‘¦ 100 we study the situations where (1) no user-user loss is πœƒπ‘Žπ‘Ÿπ‘‘π‘–π‘ π‘‘ 104 consider, (2) no item-item is considered, (3) a random ini- πœƒπ‘ π‘œπ‘›π‘” 105 tialization is used instead of word2vec and (4) all training πœƒπ‘Žπ‘π‘‘π‘–π‘£π‘–π‘‘π‘¦ 104 records are weighted the same, regardless of their rarity. Table 1 Table 4 reports the results obtained, in terms of the final List of main hyperparameters configured. Horizontal lines score, for all situations. From this we can observe that separate categories of hyperparameters that have been tuned all proposed approaches bring a benefit to the overall re- together. sult, with the removal of the additional loss terms being the most important. It should be noted, however, that Table 2 presents the results obtained on the challenge this ablation study has been carried out using the hyper- leaderboard for the top 10 entries. parameters that produced the best performance for the The table clearly shows that the proposed solution β€œfull” approach. As such, the ablated performance may be does not outperform the top achievers in most aspects. affected by a lack of hyperparameters fine-tuning, thus It is unfortunately impossible to currently compare the possibly resulting in a lower score. techniques adopted by the rest of the participants to draw more meaningful conclusions. However, from the results, 4. Failed approaches it is clear that the proposed solution fares reasonably well in the partition-based metrics, while it struggles I have not failed. I’ve just found 10,000 ways that to achieve comparable performance in terms of other won’t work. metrics (most notably, the β€œbeing less wrong” one). It is also interesting to note how other solutions still Thomas A. Edison have performance trade-offs, since there is no clear solu- tion that outperforms all others across all metrics. This In this section we describe some attempts that have Country User activity Track popularity Artist popularity Gender Being Latent Solution Score Hit rate MRR (MRED) (MRED) (MRED) (MRED) (MRED) less wrong diversity team#1 1.702570 0.015484 0.005859 -0.004070 -0.006932 -0.002044 -0.001688 -0.000956 0.424817 -0.121655 team#2 1.552977 0.016065 0.001727 -0.003727 -0.002913 -0.002307 -0.001047 -0.000692 0.363927 -0.296403 Proposed solution 1.330388 0.015565 0.001677 -0.004036 -0.003504 -0.004444 -0.000867 -0.000797 0.281863 -0.272944 team#4 1.184669 0.021619 0.002044 -0.005366 -0.004417 -0.003191 -0.001542 -0.001299 0.320594 -0.317348 team#5 1.138580 0.018819 0.001071 -0.005213 -0.005174 -0.005043 -0.001234 -0.002774 0.280727 -0.244437 team#6 1.028222 0.015006 0.001127 -0.005448 -0.007534 -0.005261 -0.001202 -0.003369 0.316226 -0.309870 team#7 0.752576 0.017874 0.001655 -0.006193 -0.010749 -0.004483 -0.002132 -0.004541 0.322594 -0.324841 team#8 0.429596 0.018173 0.001632 -0.005482 -0.011190 -0.007588 -0.002513 -0.004329 0.322767 -0.324794 team#9 -1.154413 0.032359 0.002054 -0.010624 -0.013020 -0.021992 -0.003122 -0.008458 0.295921 -0.330054 baseline -1.212213 0.036343 0.003694 -0.009016 -0.022362 -0.011082 -0.007150 -0.006084 0.375774 -0.307977 Table 2 Performance in terms of the various metrics adopted in [1] for the top 10 participants (MRR = Mean Reciprocal Rank, MRED = Miss Rate Equality Difference). Underlined are the solutions that outperform the proposed one in term of the specific metric. In bold are the best performers for each metric. Metric Consistency samples from the training set (points that are never sam- Hit rate 0.9579 pled), whereas using a weight for each row makes sure MRR 0.8885 that all rows are seen during training. Country (MRED) 0.8057 User activity (MRED) 0.9312 Data augmentation: to increase the breadth of the data Track popularity (MRED) 0.6938 available, we tried to synthesize new user-song interac- Artist popularity (MRED) 0.7828 tions, to be then used for training. In particular, we first Gender (MRED) 0.4573 quantified the proclivity of users to listen to a limited Being less wrong 0.9949 number of artists, by means of the Gini impurity (the Latent diversity 0.9976 more homogeneous the choice of artists, the lower the Overall 0.8344 Gini index). We can then sample users based on this Table 3 factor, and add user-song relationships, where songs are Consistency obtained for each metric, as measured across chosen to belong to the most β€œlikely artists” (i.e. the π‘˜ = 4 folds. artists that are more commonly listened by each sampled user). Ablated term Score No user-user loss -1003 No item-item loss 0.5292 5. Conclusions No word2vec initialization 0.8554 No weighting scheme 0.8427 In this paper we presented a possible solution to the Full approach 1.3304 EvalRS challenge. The solution uses matrix factorization based on multiple triplet losses combined together to Table 4 align users and songs in the same space. A weighting Ablation study of the proposed solution. Performance is re- scheme has been introduced to assign more importance ported in terms of the overall score adopted for the competi- to uncommon users/items – thus improving the quality tion. of the model in terms of fairness. By introducing the con- sistency metric, we show some of the main weaknesses of the proposed approach: namely, the fact that it is not been made, but that have not brought any improvement consistent w.r.t. some metrics. We consider this to be one in terms of performance. of the main problems to be addressed. We additionally Entity resolution: the list of known songs contains covered some of the failed attempts made, in the hope some duplicates. We tried using a naive entity resolu- that others will either not make the same mistakes, or tion approach (songs with matching artists and matching figure out how to improve upon them. titles are considered to be the same song). Since this problem affected only a small fraction (a few percent) of songs, the ER step did not produce any significant Acknowledgments improvement and has thus been discarded. Dataset resampling: we attempted to resample the This work has been supported by the DataBase and Data training set with a weighting scheme similar to the one Mining Group and the SmartData center at Politecnico already used to weigh each training sample based on their di Torino. uniqueness. Worse performance have been observed as a result of this approach: it can be argued that the reason 3 A score of -100 has been assigned to solutions that did not reach a for this is that the resampling outright removes some hit rate of 0.015. References [1] J. Tagliabue, F. Bianchi, T. Schnabel, G. Attanasio, C. Greco, G. d. S. P. Moreira, P. J. Chia, Evalrs: a rounded evaluation of recommender systems, 2022. URL: https://arxiv.org/abs/2207.05772. doi:1 0 . 4 8 5 5 0 / ARXIV.2207.05772. [2] P. J. Chia, J. Tagliabue, F. Bianchi, C. He, B. Ko, Be- yond ndcg: behavioral testing of recommender sys- tems with reclist, in: Companion Proceedings of the Web Conference 2022, 2022, pp. 99–104. [3] M. Schedl, The lfm-1b dataset for music retrieval and recommendation, in: Proceedings of the 2016 ACM on international conference on multimedia retrieval, 2016, pp. 103–110. [4] Y. Koren, R. Bell, C. Volinsky, Matrix factorization techniques for recommender systems, Computer 42 (2009) 30–37. [5] F. Schroff, D. Kalenichenko, J. Philbin, Facenet: A uni- fied embedding for face recognition and clustering, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823. [6] H. Xuan, A. Stylianou, X. Liu, R. Pless, Hard neg- ative examples are hard, but useful, in: European Conference on Computer Vision, Springer, 2020, pp. 126–142. [7] S. Rendle, C. Freudenthaler, Z. Gantner, L. Schmidt- Thieme, Bpr: Bayesian personalized ranking from implicit feedback, arXiv preprint arXiv:1205.2618 (2012). [8] J. Weston, S. Bengio, N. Usunier, Large scale image annotation: learning to rank with joint word-image embeddings, Machine learning 81 (2010) 21–35. [9] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, Advances in neural information processing systems 26 (2013).