Triplet losses-based matrix factorization for robust
recommendations
Flavio Giobergia1
1
    Department of Control and Computer Engineering, Politecnico di Torino, Turin, Italy


                                             Abstract
                                             Much like other learning-based models, recommender systems can be affected by biases in the training data. While typical
                                             evaluation metrics (e.g. hit rate) are not concerned with them, some categories of final users are heavily affected by these
                                             biases. In this work, we propose using multiple triplet losses terms to extract meaningful and robust representations of users
                                             and items. We empirically evaluate the soundness of such representations through several “bias-aware” evaluation metrics, as
                                             well as in terms of stability to changes in the training set and agreement of the predictions variance w.r.t. that of each user.

                                             Keywords
                                             recommender systems, matrix factorization, contrastive learning


1. Introduction                                             songs to a set of users, given their previous interactions
                                                            with other songs. The provided dataset is based on a
Recommender systems are a fundamental part of almost subset of the openly available LFM-1b dataset [3]. The
any experience of online users. The possibility of rec- source code for the proposed solution has been made
ommending options tailored to each individual user is available on GitHub 1 .
one of the key contributors to the success of many com-
panies and services. The metrics that are commonly
used in literature to evaluate these models (e.g. hit rate) 2. Methodology
are typically only concerned with the overall quality of
the model, regardless of the behaviors of such models In this section we present the proposed methodology,
on particular partitions of data. This results in recom- highlighting the main aspects of interest. No data prepro-
mender systems typically learning the preferences of the cessing has been applied to the original data, although
“majority”. This in turn implies a poorer quality of recom- some approaches have been attempted (see Section 4).
mendations for users/items that belong to the long tail The proposed methodology, as explained below, allows
of the distribution. In an effort to steer the research fo- ranking all items based on estimated compatibility with
cus to addressing this problem, the EvalRS challenge [1]. any given user. We produce the final list of 𝑘 recommen-
This challenge, based on the RecList framework [2], pro- dations by stochastically selecting items from the ordered
poses a recommendation problem with a multi-faceted list of songs, weighting each song with the inverse of its
evaluation, where the quality of any solution is not only position in the list.
evaluated in terms of overall performance, but also based
on the results obtained on various partitions of users and                                                                            2.1. Loss definition
items. In this paper, we present a possible recommender
system that addresses the problem proposed by EvalRS.                                                                                 Matrix factorization techniques have long been known
The solution is based on matrix factorization by fram-                                                                                to achieve high performance in various recommendation
ing an objective function that aligns users and items in                                                                              challenges [4]. This approach consists in aligning vec-
the same embedding space. The matrices are learned by                                                                                 tor representations for two separate entities, users and
minimizing a loss function that includes multiple triplet                                                                             items (songs, in this case). This alignment task is a recur-
losses terms. Differently from what is typically done (i.e.                                                                           ring one: a commonly adopted approach to solving this
aligning an anchor user to a positive and a negative item),                                                                           problem is through the optimization of a triplet loss [5].
in this work we propose additionally using triplet terms                                                                                 A triplet loss is a loss that requires identifying an an-
for users and items separately.                                                                                                       chor point, as well as a positive and a negative point, i.e.
   The full extent of the challenge is described in detail in                                                                         points that should either lie close to (positive) or far from
[1]. In short, the goal of the challenge is to recommend                                                                              (negative) the anchor point.
                                                                                                                                         Users and songs can thus be projected to a common
EvalRS at CIKM 2022                                                                                                                   embedding space in a way that users are placed close to
Envelope-Open flavio.giobergia@polito.it (F. Giobergia)                                                                               songs they like and away from songs they do not like.
Orcid 0000-0001-8806-7979 (F. Giobergia)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).
                                                                                                                                      1
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                                            https://github.com/fgiobergia/CIKM-evalRS-2022
This can be done by choosing a user as the anchor, and
two songs as the positive and negative points. A rea-                                                                    songneg
sonable choice for the positive song is one that has been
listened by the user. The choice for the negative song                                               useranc
is trickier. Random songs, or songs not listened by the
user are possible choices. However, more sophisticated
strategies can be adopted to choose negative points that                                                                       userneg
are difficult for the model to separate from the anchor.
These are called hard negatives and have been shown in                                      songanc
literature to be beneficial to the training of models [6].
    We decided to use a simple policy for the selection of
                                                                                                                    userpos
a negative song: a negative song for user 𝑢 is extracted
from the pool of songs that have been listened by one of
the nearest neighbors of 𝑢 and have not been listened by
𝑢. By doing so, we aim to reduce the extent to which the                                              songpos
model relies on other users’ preferences to make a recom-
mendation. The concept of neighboring users is obtained                                                     user-song loss
by comparing the similarity between embedding repre-
sentations of all users. Due to the computational cost of                                                   song-song loss
this operation, it is only performed at the beginning of                                                    user-user loss
each training epoch.
                                                  𝑝
    We can thus define the triplets (𝑢𝑖𝑎 , 𝑠𝑖 , 𝑠𝑖𝑛 ) to be used
                                                                             Figure 1: Representation of the action of each part of the loss
for the definition of a triplet loss. Here, 𝑢𝑖𝑎 is used to                   on the vectors. Arrow directions represent whether elements
                                                                𝑝
represent the vector for the anchor user, whereas 𝑠𝑖 and                     are pulled towards or pushed away from the anchors.
𝑠𝑖𝑛 represent the vectors for the positive and negative
songs respectively.
    Similar approaches where users are aligned to songs
they did or did not like are Bayesian Personalized Rank-                       Figure 1 summarizes the effect of the various terms of
ing (BPR) [7], where negative songs are sampled ran-                         the loss on the embedding vectors learned.
domly, and WARP [8], where negative items are sampled
so as to be “hard” (based on their proximity of the anchor                   2.2. Popularity weight
w.r.t. the positive item). To improve the robustness of
                                                                             To make the minority entities more relevant, we adopted
the representations built, we are additionally interested
                                                                             a weighting scheme that modulates the previously de-
in aligning similar songs and similar users. To this end,
                                                                             scribed loss so as to weigh rows more if they belong to
we introduce two additional triplet terms to the loss func-
                                   𝑝                            𝑝            “rarer” entities and less for common ones. In accordance
tion, one that is based on (𝑠𝑖𝑎 , 𝑠𝑖 , 𝑠𝑖𝑛 ) and one on (𝑢𝑖𝑎 , 𝑢𝑖 , 𝑢𝑖𝑛 ).
                                                                             with [1], we identified five factors to be kept into account.
Based on the previously defined concepts, we choose 𝑠𝑖𝑎
                                      𝑝                                      Based on these, a coefficient has been defined for each
as a song listened by 𝑢𝑖𝑎 , and 𝑢𝑖 and 𝑢𝑖𝑛 as users who re-                  entry in the training set. The final weight is given by
                         𝑝
spectively listened to 𝑠𝑖 and 𝑠𝑖𝑛 . Other alternatives have                  a weighted average of these coefficients. The following
been considered, but were ultimately not selected due to                     is a list of factors, along with the way the respective co-
a higher computational cost.                                                 efficients have been computed (logarithms are used for
    We define the final loss as:                                             factors that follow a power law distribution). All coeffi-
                                   𝑝                                         cients are normalized to sum to 1 across the respective
     ℒ = ∑ 𝑤𝑖 (𝑚𝑎𝑥{𝑑(𝑢𝑖𝑎 , 𝑠𝑖 ) − 𝑑(𝑢𝑖𝑎 , 𝑠𝑖𝑛 ) + 𝑚0 , 0}+
              𝑖                                                              population.
                                   𝑝
                  𝜆1 𝑚𝑎𝑥{𝑑(𝑢𝑖𝑎 , 𝑢𝑖 ) − 𝑑(𝑢𝑖𝑎 , 𝑢𝑖𝑛 ) + 𝑚1 , 0}+      (1)
                                                                                      • Gender (𝜃𝑔𝑒𝑛𝑑𝑒𝑟 ): in accordance with the original
                                    𝑝
                    𝜆2 𝑚𝑎𝑥{𝑑(𝑠𝑖𝑎 , 𝑠𝑖 ) − 𝑑(𝑠𝑖𝑎 , 𝑠𝑖𝑛 ) + 𝑚2 , 0})                      dataset, a relevance coefficient is provided for the
                                                                                        categories male, female, and undisclosed 2 . The
   Where 𝑑(⋅) is a distance function between any pair of                                coefficient is proportional to the inverse of the
vectors. In this work, the cosine distance is used. 𝑚𝑗 is a                             occurrences of each gender in the list of known
margin enforced between positive and negative pairs. In                                 users.
this work, since all elements are projected to a common
embedding, we used 𝑚0 = 𝑚1 = 𝑚2 . Finally, 𝑤𝑖 is a
weight that is assigned to each entry, which is discussed                    2
                                                                                 This simplified perspective on gender does not reflect that of the
in Subsection 2.2.                                                               author
     • Country (𝜃𝑐𝑜𝑢𝑛𝑡𝑟𝑦 ): the coefficient related to the             We therefore introduce the consistency metric, which
       country is calculated as the inverse of the loga-            quantifies the variance of the model when tested across
       rithm of the number of occurrences of the specific           multiple folds, or datasets. A higher variance in per-
       country of the users in the training set.                    formance would be associated with a lower consistency
     • Artist popularity (𝜃𝑎𝑟𝑡𝑖𝑠𝑡 ): a proxy for the popular-       (or higher inconsistency). For a single metric, the consis-
       ity of an artist is obtained by the number of times          tency could be defined as the variance of the metric across
       songs by that artist have been played in the train-          the folds. However, when multiple metrics are involved
       ing set. The inverse logarithm of this quantity is           (as is the case with this competition), a normalization step
       used as coefficients.                                        should be introduced. We thus instead use the coefficient
     • Song popularity (𝜃𝑠𝑜𝑛𝑔 ): a proxy for the popularity         of variation, defined as the standard deviation divided by
       of a song is provided by the number of times that            the mean value, to quantify the inconsistency of a model
       song has been played in the training set. The                with respect to a metric 𝑚. We compute the consistency
       inverse logarithm of this quantity is used as coef-          for a metric as 1 - inconsistency. The overall consistency
       ficients.                                                    is therefore computed as the mean consistency across all
     • User activity (𝜃𝑎𝑐𝑡𝑖𝑣𝑖𝑡𝑦 ): the overall activity of a user   metrics:
       can be quantified in terms of the number of songs                                             𝜎
                                                                                           1
       that they have listened to across the training set.                           𝑐=       ∑ (1 − 𝑚 )                    (2)
                                                                                          |𝑀| 𝑚∈𝑀   |𝜇𝑚 |
       The inverse logarithm of this quantity is used as
       coefficients.                                          Where 𝑀 represents the set of all metrics used, while
                                                           𝜇𝑚 and 𝜎𝑚 are the arithmetic mean and standard devi-
   The weighted sum of the above-mentioned coefficients ation computed over all the folds, for a metric 𝑚. We
constitutes the weight 𝑤𝑖 in Equation 1. The weights used use the absolute value of the mean to make the results
for each coefficient have been searched as a part of the comparable regardless of sign. Alternatively, the ratio
tuning of the model and are presented in Section 3.        𝜎𝑚2 /𝜇𝑚
                                                                 2 could be used to assign a lower penalty in case
                                                           of small deviations. The maximum possible efficiency,
2.3. Model initialization                                  1, would be assigned to a model that presents the same
                                                           exact performance across all folds for all metrics. Section
The initial values assigned to the users’ and items’ vec- 3 reports the consistency, measured in these terms, for
tors greatly affects the entire learning process. A good the proposed solution.
initialization can make the convergence process faster
and/or allows reaching a better minimum. We used ini-
tial vectors for users and items based on an adaptation 2.5. Variance agreement
of the word2vec algorithm [9]. We built a corpus of sen- Different users may have different interests in terms of
tences, one for each song known, composed of users variety. In the “music” context a user may, for example,
who listened to that song, artists and albums, all in the listen to songs from very few authors, whereas others
form token-type=token-value (e.g. song=1234). We then may be more interested in a wider variety of artists. A
trained word2vec to learn representations for all of the similar concept may be applied to other contexts (e.g.
tokens involved. We used as initial vectors the vectors in terms of brand loyalty for products). It is therefore
obtained for the users and songs tokens.                   desirable that a recommender system should provide a
   Word2vec places tokens close in the embedding space wider variety of recommendations for users that are in-
based on their adoption in similar contexts. For this rea- clined to them, and vice versa. We introduce the concept
son, based on the definition of sentences, this approach of variance agreement w.r.t. a variable, which quantifies
already brings close users with similar tastes – in terms how the variance in recommendations correlates to each
of songs, artists, albums, as well as similar songs – in user’s interest in variance, as dictated by their previous
terms of users that listened to them, artists the produced interactions, in terms of the variable of interest. In this
them, albums they are found in.                            context, we use the artists that produced songs as the
                                                                    variable of interest.
2.4. Model consistency                                                 We quantify the variance of a set of songs as the Gini
                                                                    impurity over that set, where each song is mapped to the
As we will discuss in Section 3, we empirically observed            respective artist. We can thus assign an impurity to any
that the proposed solution presents high variance in the            given user, 𝑔𝑢 , as the impurity within the set of songs
performance obtained across the various folds. While                they listened to in the training set. For that same user,
this is not directly measured as a part of the core metrics         we can define the model’s impurity, 𝑔𝑢̂ , as the impurity
of EvalRS, we still believe it is important to account for          of the set of 𝑘 songs recommended by the model for that
this aspect in a well-rounded evaluation.                           user.
    If 𝑔𝑢 is low, the user listens to a limited set of artists
                                                         is due to the multi-faceted nature of the overall score
(if 0, the user has only listened to one artist in all of its
                                                         function.
interactions). Similarly, if 𝑔𝑢̂ is low, the model is recom-Despite the efforts made toward reducing the effect
mending songs from a limited set of artists.             of the dataset imbalances on the final model, we still
    To measure the agreement between users and model’s   observed that the performance of the model is not always
variance, we compute the Pearson correlation on the      consistent. In other words, there is a relatively high
paired data [(𝑔𝑢 , 𝑔𝑢̂ ) | 𝑢 ∈ 𝒰], with 𝒰 being the set of all
                                                         variance in the performance across the various folds.
users.                                                      Table 3 shows the overall consistency of the model, as
    Note that the variance agreement is not concerned    well as the consistency computed for each metric. This
with the correctness of the predictions (e.g. whether    provides a very interesting perspective on the strengths
the artists the user listens to are the same ones being  and weaknesses of the proposed solution. In particular,
recommended) – that information may be quantified by     the model is highly inconsistent for some of the fairness-
other metrics concerned with the accuracy of models,     oriented metrics – as highlighted by the low consistency
rather than their suitability over a heterogeneous set ofobtained for track popularity and gender. While this does
user.                                                    not necessarily imply poor performance, it is a symptom
                                                         that the model may be susceptible to fluctuations in per-
                                                         formance as the dataset used for training is changed.
3. Experimental results                                  Other metrics, such as the behavioral and the “standard”
                                                         ones, show instead a more consistent behavior.
In this section we present the results obtained in terms
                                                            We additionally evaluated the proposed methodology
of the main metrics identified by [1], as well as some
                                                         in terms of variance agreement for the “artist” variable.
additional considerations on the proposed solution.
                                                         We achieved an agreement of 0.2479, whereas a random
   The model has been trained and fine-tuned to identify
                                                         model would achieve ≈ 0. This indicates that the model
well-performing values for the main hyperparameters.
                                                         does take into account, to some extent, the individual
The best configuration of parameters found is reported
                                                         user’s variance preference. However, there is room for
in Table 1.
                                                         improvements in these terms.
                   Parameter        Value
                         𝑑           128                         3.1. Ablation study
                        𝜆1            2.5
                        𝜆2            2.5                        To understand the effect of the various choices made,
                  𝑚0 = 𝑚 1 = 𝑚 2     0.25                        we introduce an ablation study where we remove some
                      𝜃𝑔𝑒𝑛𝑑𝑒𝑟          5                         portions of the proposed methodology. In particular,
                     𝜃𝑐𝑜𝑢𝑛𝑡𝑟𝑦        100                         we study the situations where (1) no user-user loss is
                       𝜃𝑎𝑟𝑡𝑖𝑠𝑡       104                         consider, (2) no item-item is considered, (3) a random ini-
                       𝜃𝑠𝑜𝑛𝑔         105                         tialization is used instead of word2vec and (4) all training
                     𝜃𝑎𝑐𝑡𝑖𝑣𝑖𝑡𝑦       104
                                                                 records are weighted the same, regardless of their rarity.
Table 1                                                          Table 4 reports the results obtained, in terms of the final
List of main hyperparameters configured. Horizontal lines        score, for all situations. From this we can observe that
separate categories of hyperparameters that have been tuned      all proposed approaches bring a benefit to the overall re-
together.                                                        sult, with the removal of the additional loss terms being
                                                                 the most important. It should be noted, however, that
   Table 2 presents the results obtained on the challenge        this ablation study has been carried out using the hyper-
leaderboard for the top 10 entries.                              parameters that produced the best performance for the
   The table clearly shows that the proposed solution            “full” approach. As such, the ablated performance may be
does not outperform the top achievers in most aspects.           affected by a lack of hyperparameters fine-tuning, thus
It is unfortunately impossible to currently compare the          possibly resulting in a lower score.
techniques adopted by the rest of the participants to draw
more meaningful conclusions. However, from the results,          4. Failed approaches
it is clear that the proposed solution fares reasonably
well in the partition-based metrics, while it struggles
                                                                           I have not failed. I’ve just found 10,000 ways that
to achieve comparable performance in terms of other
                                                                           won’t work.
metrics (most notably, the “being less wrong” one).
   It is also interesting to note how other solutions still                                                Thomas A. Edison
have performance trade-offs, since there is no clear solu-
tion that outperforms all others across all metrics. This          In this section we describe some attempts that have
                                                         Country     User activity    Track popularity   Artist popularity     Gender        Being       Latent
     Solution         Score      Hit rate    MRR
                                                         (MRED)        (MRED)             (MRED)              (MRED)          (MRED)      less wrong   diversity
      team#1         1.702570    0.015484   0.005859     -0.004070     -0.006932         -0.002044            -0.001688       -0.000956    0.424817    -0.121655
      team#2          1.552977   0.016065   0.001727    -0.003727     -0.002913           -0.002307           -0.001047      -0.000692      0.363927    -0.296403
 Proposed solution   1.330388    0.015565   0.001677     -0.004036     -0.003504          -0.004444          -0.000867        -0.000797     0.281863    -0.272944
      team#4         1.184669    0.021619   0.002044     -0.005366     -0.004417          -0.003191           -0.001542       -0.001299     0.320594    -0.317348
      team#5         1.138580    0.018819   0.001071     -0.005213     -0.005174          -0.005043           -0.001234       -0.002774     0.280727    -0.244437
      team#6         1.028222    0.015006   0.001127     -0.005448     -0.007534          -0.005261           -0.001202       -0.003369     0.316226    -0.309870
      team#7         0.752576    0.017874   0.001655     -0.006193     -0.010749          -0.004483           -0.002132       -0.004541     0.322594    -0.324841
      team#8         0.429596    0.018173   0.001632     -0.005482     -0.011190          -0.007588           -0.002513       -0.004329     0.322767    -0.324794
      team#9         -1.154413   0.032359   0.002054     -0.010624     -0.013020          -0.021992           -0.003122       -0.008458     0.295921    -0.330054
     baseline        -1.212213   0.036343   0.003694     -0.009016     -0.022362          -0.011082           -0.007150       -0.006084     0.375774    -0.307977

Table 2
Performance in terms of the various metrics adopted in [1] for the top 10 participants (MRR = Mean Reciprocal Rank, MRED =
Miss Rate Equality Difference). Underlined are the solutions that outperform the proposed one in term of the specific metric.
In bold are the best performers for each metric.


           Metric                               Consistency                          samples from the training set (points that are never sam-
           Hit rate                               0.9579
                                                                                     pled), whereas using a weight for each row makes sure
           MRR                                    0.8885
                                                                                     that all rows are seen during training.
           Country (MRED)                         0.8057
           User activity (MRED)                   0.9312                                Data augmentation: to increase the breadth of the data
           Track popularity (MRED)                0.6938                             available, we tried to synthesize new user-song interac-
           Artist popularity (MRED)               0.7828                             tions, to be then used for training. In particular, we first
           Gender (MRED)                          0.4573                             quantified the proclivity of users to listen to a limited
           Being less wrong                       0.9949                             number of artists, by means of the Gini impurity (the
           Latent diversity                       0.9976                             more homogeneous the choice of artists, the lower the
           Overall                                0.8344                             Gini index). We can then sample users based on this
Table 3                                                                              factor, and add user-song relationships, where songs are
Consistency obtained for each metric, as measured across                             chosen to belong to the most “likely artists” (i.e. the
𝑘 = 4 folds.                                                                         artists that are more commonly listened by each sampled
                                                                                     user).
                Ablated term                           Score
                No user-user loss                       -1003
                No item-item loss                      0.5292
                                                                                     5. Conclusions
                No word2vec initialization             0.8554
                No weighting scheme                    0.8427
                                                              In this paper we presented a possible solution to the
                Full approach                          1.3304 EvalRS challenge. The solution uses matrix factorization
                                                              based on multiple triplet losses combined together to
Table 4                                                       align users and songs in the same space. A weighting
Ablation study of the proposed solution. Performance is re- scheme has been introduced to assign more importance
ported in terms of the overall score adopted for the competi- to uncommon users/items – thus improving the quality
tion.
                                                              of the model in terms of fairness. By introducing the con-
                                                              sistency metric, we show some of the main weaknesses
                                                              of the proposed approach: namely, the fact that it is not
been made, but that have not brought any improvement consistent w.r.t. some metrics. We consider this to be one
in terms of performance.                                      of the main problems to be addressed. We additionally
   Entity resolution: the list of known songs contains covered some of the failed attempts made, in the hope
some duplicates. We tried using a naive entity resolu- that others will either not make the same mistakes, or
tion approach (songs with matching artists and matching figure out how to improve upon them.
titles are considered to be the same song). Since this
problem affected only a small fraction (a few percent)
of songs, the ER step did not produce any significant Acknowledgments
improvement and has thus been discarded.
   Dataset resampling: we attempted to resample the This work has been supported by the DataBase and Data
training set with a weighting scheme similar to the one Mining Group and the SmartData center at Politecnico
already used to weigh each training sample based on their di Torino.
uniqueness. Worse performance have been observed as a
result of this approach: it can be argued that the reason 3
                                                               A score of -100 has been assigned to solutions that did not reach a
for this is that the resampling outright removes some hit rate of 0.015.
References
[1] J. Tagliabue, F. Bianchi, T. Schnabel, G. Attanasio,
    C. Greco, G. d. S. P. Moreira, P. J. Chia, Evalrs: a
    rounded evaluation of recommender systems, 2022.
    URL: https://arxiv.org/abs/2207.05772. doi:1 0 . 4 8 5 5 0 /
    ARXIV.2207.05772.
[2] P. J. Chia, J. Tagliabue, F. Bianchi, C. He, B. Ko, Be-
    yond ndcg: behavioral testing of recommender sys-
    tems with reclist, in: Companion Proceedings of the
    Web Conference 2022, 2022, pp. 99–104.
[3] M. Schedl, The lfm-1b dataset for music retrieval and
    recommendation, in: Proceedings of the 2016 ACM
    on international conference on multimedia retrieval,
    2016, pp. 103–110.
[4] Y. Koren, R. Bell, C. Volinsky, Matrix factorization
    techniques for recommender systems, Computer 42
    (2009) 30–37.
[5] F. Schroff, D. Kalenichenko, J. Philbin, Facenet: A uni-
    fied embedding for face recognition and clustering,
    in: Proceedings of the IEEE conference on computer
    vision and pattern recognition, 2015, pp. 815–823.
[6] H. Xuan, A. Stylianou, X. Liu, R. Pless, Hard neg-
    ative examples are hard, but useful, in: European
    Conference on Computer Vision, Springer, 2020, pp.
    126–142.
[7] S. Rendle, C. Freudenthaler, Z. Gantner, L. Schmidt-
    Thieme, Bpr: Bayesian personalized ranking from
    implicit feedback, arXiv preprint arXiv:1205.2618
    (2012).
[8] J. Weston, S. Bengio, N. Usunier, Large scale image
    annotation: learning to rank with joint word-image
    embeddings, Machine learning 81 (2010) 21–35.
[9] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado,
    J. Dean, Distributed representations of words and
    phrases and their compositionality, Advances in
    neural information processing systems 26 (2013).