1. Introduction

Logic Tensor Networks for Top-N Recom mendation

Tommaso Carraro

tcarraro@fbk.eu 0 1

Alessandro Daniele

daniele@fbk.eu 0

Fabio Aiolli

aiolli@math.unipd.it 1

Luciano Serafini

serafini@fbk.eu 0 0 Data and Knowledge Management, Fondazione Bruno Kessler , Via Sommarive, 18, 38123 Povo , Italy 1 Department of Mathematics, University of Padova , Via Trieste, 63, 35131 Padova , Italy

2022

Despite being studied for more than twenty years, state-of-the-art recommendation systems still sufer from important drawbacks which limit their usage in real-world scenarios. Among the well-known issues of recommender systems, there are data sparsity and the cold-start problem. These limitations can be addressed by providing some background knowledge to the model to compensate for the scarcity of data. Following this intuition, we propose to use Logic Tensor Networks (LTNs) to tackle the top-n item recommendation problem. In particular, we show how LTNs can be used to easily and efectively inject commonsense recommendation knowledge inside a recommender system. We evaluate our method on MindReader, a knowledge graph-based movie recommendation dataset containing plentiful side information. In particular, we perform an experiment to show how the benefits of the knowledge increase with the sparsity of the dataset. Eventually, a comparison with a standard Matrix Factorization approach reveals that our model is able to reach and, in many cases, outperform state-of-the-art performance.

recommender systems top-n recommendation logic tensor networks neural-symbolic integration

1. Introduction

(L. Serafini) Factorization [14] and Factorization Machines [15, 16] have been proposed recently. These models allow to efectively extend the user-item matrix by adding new dimensions containing content (e.g., movie genres, demographic information) and/or contextual side information (e.g., location, time). Though these techniques have been shown to improve the recommendation performance, they are usually specifically designed for one type of side information (e.g., the user or item content) and lack explainability [17, 18]. Novel recommendation datasets (e.g., [19]) provide manifold side information (e.g., ratings on movie genres, actors, directors), and hence models which can exploit all the available information are required.

Neural-Symbolic Integration (NeSy) [20] and Statistical Relational Learning (SRL) [21] represent good candidates to incorporate knowledge with learning. These two branches of Artificial Intelligence study approaches for the integration of some form of prior knowledge, usually expressed through First-Order Logic (FOL), with statistical models. The integration has been shown beneficial to address data scarcity [ 22].

In this paper, we propose to use a Logic Tensor Network (LTN) [23] to inject commonsense knowledge into a standard Matrix Factorization model for the top-n item recommendation task. LTN is a NeSy framework that allows using logical formulas to instruct the learning of a neural model. We propose to use the MindReader dataset [19] to test our model. This dataset includes a variety of information, such as users’ tastes across movie genres, actors, and directors. In this work, we show how LTN can naturally and efectively exploit all this various information to improve the generalization capabilities of the MF model. In addition, an experiment that drastically reduces the density of the training ratings reveals that our model can efectively mitigate the sparsity of data, outperforming the standard MF model, especially in the most challenging scenarios.

2. Related works

The integration of logical reasoning and learning in RSs is still in its early stages. Among the NeSy approaches for RSs, the most prominent is NCR [24]. In this work, the recommendation problem is formalized into a logical reasoning problem. In particular, the user’s ratings are represented using logical variables, then, logical operators are used to construct formulas that express facts about them. Afterward, NCR maps the variables to logical embeddings and the operators to neural networks which act on those embeddings. By doing so, each logical expression can be equivalently organized as a neural network, so that logical reasoning and prediction can be conducted in a continuous space. In [25], the idea of NCR is applied to knowledge graphs for RSs, while [26] uses a NeSy approach to tackle the explainability of RSs.

The seminal approach that successfully applied SRL to RSs has been HyPER [27], which is based on Probabilistic Soft Logic (PSL) [ 28]. In particular, HyPER exploits the expressiveness of FOL to encode knowledge from a wide range of information sources, such as multiple user and item similarity measures, content, and social information. Then, Hinge-Loss Markov Random Fields are used to learn how to balance the diferent information types. HyPER is highly related to our work since the logical formulas that we use resemble the ones used in HyPER. After HyPER, other SRL approaches have been proposed for RSs [29, 30].

3. Background 3.1. Notation

This section provides useful notation and terminology used in the remainder of the paper. Bold notation is used to diferentiate between vectors, e.g., x = [3.2, 2.1], and scalars, e.g., = 5 . Matrices and tensors are denoted with upper case bold notation, e.g., X. Then, X is used to denote the -th row of X, while X, to denote the position at row and column . We refer to the set of users of a RS with , where | | = . Similarly, the set of items is referred to as ℐ such that |ℐ | = . We use to denote a dataset. is defined as a set of triples = {(, , ) () }=1 , where ∈ , ∈ ℐ , and ∈ ℕ is a rating. We assume that a user cannot give more than one rating to an item , namely ∄ 1, 2 ∈ ℕ, 1 ≠ 2 ∶ {(, , 1)} ∪ {(, , 2)} ⊆ . can be reorganized in the so-called user-item matrix R ∈ ℕ× , where users are on the rows and items on the columns, such that R, = if (, , ) ∈ , 0 otherwise.

3.2. Matrix Factorization

Matrix Factorization (MF) is a Latent Factor Model that aims at factorizing the user-item matrix R into the product of two lower-dimensional rectangular matrices, denoted as U and I. U ∈ ℝ× and I ∈ ℝ× are matrices containing the users’ and items’ latent factors, respectively, where is the number of latent factors. The objective of MF is to find U and I such that R ≈ U ⋅ I⊤. An efective way to learn the latent factors is by using gradient-descent optimization. Given the dataset , a MF model seeks to minimize the following loss function:

L( ) = 1

∑ || −̃ || 2 + || ||2 (,,)∈ ( 1 ) where ̃= U ⋅ I⊤ and = {U, I}. The first term of Equation ( 1 ) is the Mean Squared Error (MSE) between the predicted and target ratings, while the second one is an 2 regularization term. is an hyper-parameter to set strength of the regularization.

3.3. Logic Tensor Networks

Logic Tensor Networks [23] (LTNs) are a Neural-Symbolic framework that enables efective integration of deep learning and logical reasoning. It allows to define a knowledge base composed of a set of logical axioms and to use them as the objective of a neural model. To define the knowledge base, LTN uses a specific first-order language, called Real Logic, which forms the basis of the framework. It is fully diferentiable and has a concrete semantics that allows mapping every symbolic expression into the domain of real numbers. Thanks to Real Logic, LTN can convert logical formulas into computational graphs that enable gradient-based optimization based on fuzzy logic semantics.

Real Logic is defined on a first-order language ℒ with a signature that contains a set of constant symbols, a set of variable symbols, a set ℱ of functional symbols, and a set of predicate symbols. A term is constructed recursively from constants, variables, and functional symbols. An expression formed by applying a predicate symbol to some term(s) is called an atomic formula. Complex formulas are constructed recursively using connectives (i.e., ¬, ∧, ∨, ⟹ , ↔) and quantifiers (i.e., ∀, ∃).

To emphasize the fact that symbols are grounded onto real-valued features, we use the term grounding1, denoted by . In particular, each individual is grounded as a tensor of real features, functions as real functions, and predicates as real functions that specifically project onto a value in the interval [0, 1]. A variable is grounded to a sequence of individuals from a domain, with ∈ ℕ+, > 0. As a consequence, a term () or a formula P(), constructed recursively with a free variable , will be grounded to a sequence of values too. Afterward, connectives are grounded using fuzzy semantics, while quantifiers using special aggregation functions. In this paper, we use the product configuration , which is better suited for gradientbased optimization [31]. Specifically, conjunctions are grounded using the product t-norm negations using the standard fuzzy negation N , implications using the Reichenbach implication I , and the universal quantifier using the generalized mean w.r.t the error values other connectives and quantifiers are not used in this paper, hence not reported. ME . The

T Connective operators are applied element-wise to the tensors in input, while aggregators aggregate the dimension of the tensor in input that corresponds to the quantified variable. Real Logic provides also a special type of quantification, called diagonal quantification, denoted as Diag( 1, … , ). It applies only to variables that have the same number of individuals (i.e., 1 = 2 = ⋯ = ) and allows to quantify over specific tuples of individuals, such that the -th tuple contains the -th individual of each of the variables in the argument of Diag. An intuition about how these operations work in practice is given in Appendix D. parameters ∗ that maximize the satisfaction of :

Given a Real Logic knowledge base = { 1, … , }, where 1, … , are closed formulas, LTN allows to learn the grounding of constants, functions, and predicates appearing in them. In particular, if constants are grounded as embeddings, and functions/predicates onto neural networks, their grounding depends on some learnable parameters . We denote a parametric grounding as (⋅| ). In LTN, the learning of parametric groundings is obtained by finding ∗ = argmax SatAgg∈ (| ) where, SatAgg ∶ [0, 1]∗ ↦ [0, 1]is a formula aggregating operator, often defined using

Because Real Logic grounds expressions in real and continuous domains, LTN attaches gradients to every sub-expression and consequently learns through gradient-descent optimization. 1Notice that this is diferent from the common use of the term grounding in logic, which indicates the operation of replacing the variables of a term or formula with constants or terms containing no variables. To avoid confusion, we use the synonym instantiation for this purpose.

( 2 ) ME .

4. Method

Our approach uses a Logic Tensor Network to train a basic Matrix Factorization (MF) model for the top-n item recommendation task. The LTN is trained using a Real Logic knowledge base containing commonsense knowledge facts about the movie recommendation domain. This section formalizes the knowledge base used by our model, how the symbols appearing in it are grounded in the real field, and how the learning of the LTN takes place.

4.1. Knowledge base

following axioms.

The Real Logic knowledge base that our model seeks to maximally satisfy is composed of the 1 ∶ ∀ Diag( , , )(

Sim(Likes( , ), )) 2 ∶ ∀( , , )(¬ ⟹ Sim(Likes( , ),

−)) LikesGenre( , ) ∧

HasGenre( , ) ( 3 ) ( 4 ) where , , , and

are variable symbols to denote the users of the system, the items of the system, the ratings given by the users to the items, and the genres of the movies, respectively. − is a constant symbol denoting the negative rating. Likes(, ) is a functional symbol returning the prediction for the rating given by user to movie . Sim( 1, 2) is a predicate symbol measuring the similarity between two ratings, 1 and 2. LikesGenre(, ) is a predicate symbol denoting whether the user likes the genre . HasGenre(, ) is a predicate symbol denoting whether the movie belongs to the genre .

Notice the use of the diagonal quantification on Axiom ( 3 ). When , , and are grounded with three sequences of values, the -th value of each variable matches with the values of the other variables. This is useful in this case since the dataset comes as a set of triples. Diagonal quantification allows forcing the satisfaction of Axiom ( 3 ) for these triples only, rather than any combination of users, items, and ratings in .

4.2. Grounding of the knowledge base

The grounding allows to define how the symbols of the language are mapped onto the real field, and hence how they can be used to construct the architecture of the LTN. In particular, given = {(, , )} indexes in . ( ) = ⟨ indexes in . ( ) = ⟨ =1 , ( ) = ⟨

() ⟩=1 , namely () ⟩=1 , namely

is grounded as a sequence of the user is grounded as a sequence of the movie () ⟩ =1 with () ∈ {0, 1} ∀ , namely is grounded as a sequence of the ratings in , where 0 denotes a negative rating and 1 a positive one. ( namely − is grounded as the negative rating. ( ) = ⟨1, … , ⟩, namely grounded as a sequence of genre indexes, where is the number of genres appearing in −) = 0, is the movies of . ( Likes |U, I) ∶ , ↦

U ⋅ I⊤ , namely Likes is grounded onto a function that takes as input a user index and a movie index and returns the prediction of the MF model for user at index and movie at index , where U ∈ ℝ× and I× are the matrices of the users’ and items’ latent factors, respectively. ( LikesGenre) ∶ , ↦ {0, 1} , namely LikesGenre is grounded onto a function that takes as input a user index and a genre index and returns 1 if the user likes the genre in the dataset, 0 otherwise. Similarly, ( ) ∶ , ↦ {0, 1} , namely HasGenre is grounded onto a function that takes as input a movie index and a genre index and returns 1 if the movie belongs to genre in the dataset, 0 otherwise. Finally, ( Sim) ∶ ,̃ ↦ exp(−|| −̃ || 2), namely Sim is grounded onto a function that computes the similarity between a predicted rating ̃ and a target rating . The use of the exponential allows to treat Sim as a predicate since the output is restricted in the interval [0, 1]. The squared is used to give more penalty to larger errors in the optimization. is an hyper-parameter to change the smoothness of the function.

Intuitively, Axiom ( 3 ) states that for each user-movie-rating triple in the dataset = {(, , ) () }=1 , the prediction computed by the MF model for the user and movie should be similar to the target rating provided by the user for the movie . Instead, Axiom ( 4 ) states that for each possible combination of users, movies, and genres, taken from the dataset, if the user does not like a genre of the movie , then the prediction computed by the MF model for the user and movie should be similar to the negative rating −, namely the user should not like the movie . By forcing the satisfaction of Axiom ( 3 ), the model learns to factorize the user-item matrix using the ground truth, while Axiom ( 4 ) acts as a kind of regularization for the latent factors of the MF model.

4.3. Learning of the LTN

The objective of our LTN is to learn the latent factors in U and I such that the axioms in the knowledge base = { 1, 2} are maximally satisfied, namely argmax SatAgg∈ (,,)← (| )2, where = {U, I}. In practice, this objective corresponds to the following loss function:

L( ) = (1 −SatAgg∈ (,,)←ℬ (| )) + || ||2 ( 5 ) where ℬ denotes a batch of training triples randomly sampled from . An 2 regularization term has been added to the loss to prevent overfitting. Hyper-parameter is used to define the strength of the regularization. Notice that the loss does not specify how the variable is grounded. Its grounding depends on the sampled batch ℬ. In our experiments, we grounded it with the sequence of genres of the movies in the batch.

It is worth highlighting that the loss function depends on the semantics used to approximate the logical connectives, quantifiers, and formula aggregating operator. In our experiments, we used the stable product configuration , a stable version of the product configuration introduced in [23]. Then, we selected ME as formula aggregating operator, with = 2 . 2In the notation, (, , ) ← means that variables , , and taken from the dataset , namely takes the sequence of user indexes, the sequence of ratings.

are grounded with the triples the sequence of movie indexes, and

5. Experiments

This section presents the experiments we have performed with our method. They have been executed on an Apple MacBook Pro (2019) with a 2,6 GHz 6-Core Intel Core i7. The model has been implemented in Python using PyTorch. In particular, we used the LTNtorch3 library. Our source code is available at URL4.

5.1. Dataset

In our experiments, we used the MindReader [19] dataset. It contains 102,160 explicit ratings collected from 1,174 real users on 10,030 entities (e.g., movies, actors, movie genres) taken from a knowledge graph in the movie domain. The explicit ratings in the dataset can be of three types: like ( 1 ), dislike (−1), or unknown (0). The dataset is subdivided in 10 splits. In our experiments, we used split 0. Each split has a training set, a validation set, and a test set. The training set contains both ratings given on movies and on the other entities, while validation and test sets contain only ratings given on movies. The validation and test sets are built in such a way to perform a leave-one-out evaluation. In particular, for each user of the training set, one random positive movie rating is held out for the validation set, and one for the test set. The validation/test example of the user is completed by adding 100 randomly sampled negative movie ratings from the dataset. To improve the quality of the dataset, we removed the unknown ratings. Moreover, we removed the top 2% of popular movies from the test set to see how the model performs on non-trivial recommendations, as suggested in [19]. Afterward, we considered only the training ratings given on movies and movie genres since our model uses only this information. After these steps, we converted the negative ratings from -1 to 0. Our ifnal dataset contains 962 users, 3,034 movies, 164 genres, 16,351 ratings on movies, and 10,889 ratings on movie genres. The density of the user-movie ratings is 0.37%.

5.2. Experimental setting

In our experiments, we compared the performance of three models: ( 1 )a standard MF model trained on the movie ratings of MindReader using Equation ( 1 ), denoted as MF, ( 2 )a LTN model trained on the movie ratings of MindReader using Equation ( 5 ) with = { 1}, denoted as LTN, and ( 3 )a LTN model trained on the movie and genre ratings of MindReader using Equation ( 5 ) with = { 1, 2}, denoted as LTNgenres. To compare the performance of the models, we used two widely used ranking-based metrics, namely hit@k and ndcg@k, explained in Appendix A. In our experiments, we used the following procedure: ( 1 )we generated additional training sets by randomly sampling the 80%, 60%, 40%, and 20% of the movie ratings of each user from the entire training set, referred to as 100%. Then, ( 2 )for each training set ∈ {100%, 80%, 60%, 40%, 20%} and for each model ∈ { MF, LTN, LTNgenres}: (2) we performed a grid search of model on training set to find the best hyper-parameters on the validation set using hit@10 as validation metric; then, (2) we tested the performance of the best model on the test set in terms of hit@10 and ndcg@10. We repeated this procedure 30 times using seeds from 0 to 29. The test metrics 3https://github.com/logictensornetworks/LTNtorch 4https://github.com/tommasocarraro/LTNrec have been averaged across these runs and reported in Table 1. Due to computational time, the grid search has been computed only for the first run. Starting from the second run, step (2) is replaced with the training of model on the training set with the best hyper-parameters found during the first run. A description of the hyper-parameters tested in the grid searches as well as the training details of the models is explained in Appendix B.

6. Results

A comparison between MF, LTN, and LTNgenres is reported in Table 1. The table reports the performance of the three models on a variety of tasks with diferent sparsity. By looking at the table, it is possible to observe that LTN outperforms MF in all the five tasks. In particular, for the dataset with 20% of training ratings, the improvement is drastic (27.33% on hit@10). We want to emphasize that the two models only difer in the loss function. This demonstrates that the loss based on fuzzy logic semantics of LTN is beneficial to deal with the sparsity of data. Then, with the addition of knowledge regarding the users’ tastes across the movie genres, it is possible to further improve the results, as shown in the last column of the table. LTNgenres outperforms the other models on almost all the tasks. For the dataset with the 20% of the ratings, the hit@10 of LTNgenres is slightly worse compared to LTN. This could be related to the quality of the training ratings sampled from the original dataset. This is also suggested by the higher standard deviation associated with the datasets with higher sparsity. For considerations about the training times of the models refer to Appendix C.

7. Conclusions

In this paper, we proposed to use Logic Tensor Networks to tackle the top-n recommendation task. We showed how, by design, LTN permits to easily integrate side information inside a recommendation model. We compared our LTN models with a standard MF model, in a variety of tasks with diferent sparsity, showing the benefits provided by the background knowledge, especially when the task is challenging due to data scarcity. [13] M. Polato, F. Aiolli, Exploiting sparsity to build eficient kernel based collaborative ifltering for top-n item recommendation, Neurocomputing 268 (2017) 17–26. URL: https:// www.sciencedirect.com/science/article/pii/S0925231217307592. doi:https://doi.org/10. 1016/j.neucom.2016.12.090, advances in artificial neural networks, machine learning and computational intelligence. [14] P. Bhargava, T. Phan, J. Zhou, J. Lee, Who, what, when, and where: Multi-dimensional collaborative recommendations using tensor factorization on sparse user-generated data, in: Proceedings of the 24th International Conference on World Wide Web, WWW ’15, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 2015, p. 130–140. URL: https://doi.org/10.1145/2736277.2741077. doi:10. 1145/2736277.2741077. [15] S. Rendle, Factorization machines, in: 2010 IEEE International Conference on Data Mining, 2010, pp. 995–1000. doi:10.1109/ICDM.2010.127. [16] X. Xin, B. Chen, X. He, D. Wang, Y. Ding, J. Jose, Cfm: Convolutional factorization machines for context-aware recommendation, in: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, International Joint Conferences on Artificial Intelligence Organization, 2019, pp. 3926–3932. URL: https: //doi.org/10.24963/ijcai.2019/545. doi:10.24963/ijcai.2019/545. [17] Y. Zhang, X. Chen, Explainable recommendation: A survey and new perspectives, Foundations and Trends® in Information Retrieval 14 (2020) 1–101. URL: https://doi.org/10.1561% 2F1500000066. doi:10.1561/1500000066. [18] T. Carraro, M. Polato, F. Aiolli, A look inside the black-box: Towards the interpretability of conditioned variational autoencoder for collaborative filtering, in: Adjunct Publication of the 28th ACM Conference on User Modeling, Adaptation and Personalization, UMAP ’20 Adjunct, Association for Computing Machinery, New York, NY, USA, 2020, p. 233–236.

URL: https://doi.org/10.1145/3386392.3399305. doi:10.1145/3386392.3399305. [19] A. H. Brams, A. L. Jakobsen, T. E. Jendal, M. Lissandrini, P. Dolog, K. Hose, Mindreader: Recommendation over knowledge graph entities with explicit user ratings, CIKM ’20, Association for Computing Machinery, New York, NY, USA, 2020, p. 2975–2982. URL: https://doi.org/10.1145/3340531.3412759. doi:10.1145/3340531.3412759. [20] T. R. Besold, A. d. Garcez, S. Bader, H. Bowman, P. Domingos, P. Hitzler, K.-U. Kuehnberger, L. C. Lamb, D. Lowd, P. M. V. Lima, L. de Penning, G. Pinkas, H. Poon, G. Zaverucha, Neural-symbolic learning and reasoning: A survey and interpretation, 2017. URL: https: //arxiv.org/abs/1711.03902. doi:10.48550/ARXIV.1711.03902. [21] L. D. Raedt, K. Kersting, Statistical Relational Learning, Springer US, Boston, MA, 2010, pp. 916–924. URL: https://doi.org/10.1007/978-0-387-30164-8_786. doi:10.1007/ 978- 0- 387- 30164- 8_786. [22] A. Daniele, L. Serafini, Neural networks enhancement with logical knowledge, 2020. URL: https://arxiv.org/abs/2009.06087. doi:10.48550/ARXIV.2009.06087. [23] S. Badreddine, A. d'Avila Garcez, L. Serafini, M. Spranger, Logic tensor networks, Artificial Intelligence 303 (2022) 103649. URL: https://doi.org/10.1016%2Fj.artint.2021.103649. doi:10. 1016/j.artint.2021.103649. [24] H. Chen, S. Shi, Y. Li, Y. Zhang, Neural collaborative reasoning, in: Proceedings of the Web Conference 2021, ACM, 2021. URL: https://doi.org/10.1145%2F3442381.3449973. doi:10.1145/3442381.3449973. [25] H. Chen, Y. Li, S. Shi, S. Liu, H. Zhu, Y. Zhang, Graph collaborative reasoning, in: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, WSDM ’22, Association for Computing Machinery, New York, NY, USA, 2022, p. 75–84. URL: https://doi.org/10.1145/3488560.3498410. doi:10.1145/3488560.3498410. [26] Y. Xian, Z. Fu, H. Zhao, Y. Ge, X. Chen, Q. Huang, S. Geng, Z. Qin, G. de Melo, S. Muthukrishnan, Y. Zhang, Cafe: Coarse-to-fine neural symbolic reasoning for explainable recommendation, in: Proceedings of the 29th ACM International Conference on Information Knowledge Management, CIKM ’20, Association for Computing Machinery, New York, NY, USA, 2020, p. 1645–1654. URL: https://doi.org/10.1145/3340531.3412038. doi:10.1145/3340531.3412038. [27] P. Kouki, S. Fakhraei, J. Foulds, M. Eirinaki, L. Getoor, Hyper: A flexible and extensible probabilistic framework for hybrid recommender systems, RecSys ’15, Association for Computing Machinery, New York, NY, USA, 2015, p. 99–106. URL: https://doi.org/10.1145/ 2792838.2800175. doi:10.1145/2792838.2800175. [28] A. Kimmig, S. Bach, M. Broecheler, B. Huang, L. Getoor, A short introduction to probabilistic soft logic, Mansinghka, Vikash, 2012, pp. 1–4. URL: https://lirias.kuleuven.be/retrieve/ 204697. [29] R. Catherine, W. Cohen, Personalized recommendations using knowledge graphs: A probabilistic logic programming approach, in: Proceedings of the 10th ACM Conference on Recommender Systems, RecSys ’16, Association for Computing Machinery, New York, NY, USA, 2016, p. 325–332. URL: https://doi.org/10.1145/2959100.2959131. doi:10.1145/ 2959100.2959131. [30] M. Gridach, Hybrid deep neural networks for recommender systems, Neurocomputing 413 (2020) 23–30. URL: https://www.sciencedirect.com/science/article/pii/S0925231220309966. doi:https://doi.org/10.1016/j.neucom.2020.06.025. [31] E. van Krieken, E. Acar, F. van Harmelen, Analyzing diferentiable fuzzy logic operators, Artificial Intelligence 302 (2022) 103602. URL: https://doi.org/10.1016%2Fj.artint.2021.103602. doi:10.1016/j.artint.2021.103602.

A. Metrics

To validate and test our models, we selected two widely used ranking-based metrics, namely • hit@k: Hit Ratio measures whether a testing item is placed in the top-k positions of the ranking, considering the presence of an item as a hit; • ndcg@k: Normalized Discounted Cumulative Gain measures the quality of the recommendation based on the position of the target item in the ranking. In particular, it uses a monotonically increasing discount to emphasize the importance of higher ranks versus lower ones.

Formally, let us define ( ) as the item at rank , [⋅] as the indicator function, and as the set of held-out items for user . hit@k for user is defined as Truncated discounted cumulative gain (dcg@k) for user is defined as =1 ∑ =1

[(∑ [( ) ∈ ]) ≥ 1] . 2 [()∈ ] − 1 log( + 1) . all the held-out items are ranked at the top. Notice that in this paper | | = 1.

B. Training details

The hyper-parameters tested during the grid searches explained in Section 5.2 vary depending on the model. For all the models, we tried a number of latent factors ∈ {1, 5, 10, 25} , regularization coeficient ∈ {0.001, 0.0001} , batch size in {32, 64}, and whether it was better to add users’ and items’ biases to the model. For LTN and LTNgenres, we tried ∈ {0.05, 0.1, 0.2} for the predicate Sim and used = 2 for the aggregator ME of Axiom ( 3 ). For LTNgenres, we tried ∈ {2, 5} for the aggregator ME of Axiom ( 4 ). Notice that lim→∞ ME ( 1, … , ) = min{ 1, … , }.

Intuitively, ofers flexibility to account for outliers in the data. The higher the , the more focus the model will have on the outliers.

For all the models, the latent factors U and I, for users and items, respectively, have been randomly initialized using the Glorot initialization, while the biases with values sampled from a normal distribution with 0 mean and unitary variance. All the models have been trained for 200 epochs by using the Adam optimizer with a learning rate of 0.001. For each training, we used early stopping to stop the learning if after 20 epochs no improvements were found on the validation metric (i.e., hit@10).

C. Training time

A comparison of the training times required by the models on the diferent datasets is presented in Table 2. The models have been trained for 200 epochs with a learning rate of 0.001, batch size of 64, one latent factor (i.e., = 1 ), without bias terms, and without early stopping. The other hyper-parameters do not afect training time. In particular, LTN increases the time complexity considerably. This is due to Axiom 4, which has to be evaluated for each possible combination of users, items, and genres. This drawback can limit the application of LTN in datasets with a higher number of users and items. However, it is possible to boost training time using GPUs or by designing logical axioms which make use of diagonal quantification.

D. Intuition of Real Logic grounding

In Real Logic, diferently from first-order logic, a variable is grounded as a sequence of individuals (i.e., tensors) from a domain, with ∈ ℕ+, > 0. As a direct consequence, a term () or a formula P(), with a free variable , is grounded to a sequence of values too. For = 1 , where is the -th individual of example, P() returns a vector in [0, 1 ] , namely ⟨P( )⟩ . Similarly, ( ) returns a matrix in ℝ × , assuming that maps to individuals in ℝ . This formalization is intuitively extended to terms and formulas with arity greater than one. In such cases, Real Logic organizes the output tensor in such a way that it has a dimension for each free variable involved in the expression. For instance, 2(, ) returns a tensor in ℝ × × , assuming that 2 maps to individuals in ℝ . In particular, at position (, ) there is the evaluation of 2( , ), where denotes the -th individual of and the -th individual of . Similarly, P2(, ) returns a tensor in [0, 1 ] × , where at position (, ) there is the evaluation of P( , ).

The connective operators are applied element-wise to the tensors in input. For instance, ¬ P2(, ) returns a tensor in [0, 1 ] × , where at position (, ) there is the evaluation of ¬ P2( , ), namely N (i.e., ¬) is applied to each truth value in the tensor P2(, ) ∈ [0, 1] × . For binary connectives, the behavior is similar. For instance, let Q be a predicate symbol and a variable. Then, P2(, ) ∧ Q(, ) returns a tensor in [0, 1 ] × × , where at position (, , ) there is the evaluation of the formula on the -th individual of , -th individual of , and -th individual of .

The quantifiers aggregate the dimension that corresponds to the quantified variable. For instance, ∀ P2(, ) returns a tensor in [0, 1 ] , namely the aggregation is performed across the dimension of . Since is the only free variable remaining in the expression, the output has one single dimension, corresponding to the dimension of . Specifically, the framework computes P2(, ) ∈ [0, 1] × first, then it aggregates the dimension corresponding to . Similarly, ∀(, ) P2(, ) returns a scalar in [0, 1], namely the aggregation is performed across the dimensions of both variables and . In the case of diagonal quantification, the framework behaves diferently. For instance, ∀ Diag( , ) P2( , ) , where and are two variables with the same number of individuals = , returns a scalar in [0, 1], which is the result of the aggregation of truth values, namely P2( 1, 1),P2( 2, 2), … ,P2( , ). Without diagonal quantification (i.e., ∀( , ) P2( , ) ), the framework performs an aggregation across the dimensions of both variables, involving 2 values, namely P2( 1, 1),P2( 1, 2), … ,P2( , −1),P2( , ). Intuitively, ∀( , ) aggregates all the values in [0, 1 ] × , while ∀ Diag( , ) aggregates only the values in the diagonal.

[1]

Ricci ,

Rokach ,

Shapira , Recommender Systems: Introduction and Challenges , Springer

, Boston, MA, 2015 , pp. 1 - 34 . URL: https://doi.org/10.1007/978-1- 4899 -7637- 6 _1. doi: 10 .1007/978-1- 4899 -7637- 6 _ 1 .

[2]

Su ,

T. M.

Khoshgoftaar , A survey of collaborative filtering techniques, Adv . in Artif. Intell. 2009 ( 2009 ). URL: https://doi.org/10.1155/ 2009 /421425. doi: 10 .1155/ 2009 /421425.

[3]

Koren ,

Bell , Advances in Collaborative Filtering, Springer, Boston, MA, 2011 , pp. 145 - 186 . URL: https://doi.org/10.1007/978-0- 387 -85820- 3 _5. doi: 10 .1007/ 978-0- 387 -85820-3\_5.

[4]

Aiolli , Eficient top-n recommendation for very large scale binary rated datasets , in: Proceedings of the 7th ACM Conference on Recommender Systems , RecSys '13, Association for Computing Machinery, New York, NY, USA, 2013 , p. 273 - 280 . URL: https://doi.org/10. 1145/2507157.2507189. doi: 10 .1145/2507157.2507189.

[5]

Hu ,

Koren ,

Volinsky , Collaborative filtering for implicit feedback datasets , in: 2008 Eighth IEEE International Conference on Data Mining , 2008 , pp. 263 - 272 . doi: 10 .1109/ ICDM. 2008 . 22 .

[6]

Ning , G. Karypis, Slim: Sparse linear methods for top-n recommender systems , in: 2011 IEEE 11th International Conference on Data Mining , 2011 , pp. 497 - 506 . doi: 10 .1109/ ICDM. 2011 . 134 .

[7]

Polato ,

Aiolli , Boolean kernels for collaborative filtering in top-n item recommendation , Neurocomput . 286 ( 2018 ) 214 - 225 . URL: https://doi.org/10.1016/j.neucom. 2018 . 01 .057. doi: 10 .1016/j.neucom. 2018 . 01 .057.

[8]

Liang ,

R. G.

Krishnan ,

M. D.

Hofman , T. Jebara, Variational autoencoders for collaborative filtering , in: Proceedings of the 2018 World Wide Web Conference , WWW '18,

International

World Wide Web Conferences Steering Committee , Republic and Canton of Geneva, CHE, 2018 , p. 689 - 698 . URL: https://doi.org/10.1145/3178876.3186150. doi: 10 .1145/3178876.3186150.

[9]

Shenbin ,

Alekseev ,

Tutubalina ,

Malykh , S. I. Nikolenko , RecVAE: A new variational autoencoder for top-n recommendations with implicit feedback , in: Proceedings of the 13th International Conference on Web Search and Data Mining, ACM , 2020 . URL: https://doi.org/10.1145% 2F3336191 .3371831. doi: 10 .1145/3336191.3371831.

[10]

Steck , Embarrassingly shallow autoencoders for sparse data , in: The World Wide Web Conference , WWW '19, Association for Computing Machinery, New York, NY, USA, 2019 , p. 3251 - 3257 . URL: https://doi.org/10.1145/3308558.3313710. doi: 10 .1145/3308558. 3313710.

[11]

He ,

Liao ,

Zhang ,

Nie ,

Hu , T.-S. Chua, Neural collaborative filtering , in: Proceedings of the 26th International Conference on World Wide Web, WWW '17, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE , 2017 , p. 173 - 182 . URL: https://doi.org/10.1145/3038912.3052569. doi: 10 .1145/3038912.3052569.

[12]

Carraro ,

Polato ,

Aiolli , Conditioned variational autoencoder for top-n item recommendation , 2020 . URL: https://arxiv.org/abs/ 2004 .11141. doi: 10 .48550/ARXIV. 2004 . 11141 .