S. Krajči (ed.): ITAT 2018 Proceedings, pp. 146–151 CEUR Workshop Proceedings Vol. 2203, ISSN 1613-0073, c 2018 Štepán Balcar Preference learning by matrix factorization on island models Štěpán Balcar Charles University, Faculty of Mathematics and Physics, Czech Republic, Prague Stepan.Balcar@mff.cuni.cz Abstract: This paper presents island models, methods, im- on size of population and probability of mutation (other plementation and experiments connecting stochastic opti- parameters are fixed). mization methods and recommendation task (collaborative In this paper, parameters of our methods were chosen one). Our models and methods are based on matrix fac- after experiments on sample data. We will try to improve torization. Parallel run of methods optimizes the RMSE RMSE using parallelization. We will provide results on metric from an island model point-of-view. both ML100k and ML1M data. This paper comments on architecture and some imple- mentation decisions. Main contribution of this paper is: We dealt with two research hypotheses. First, whether island models bring always improvement. We will show • Multiple island model brings statistically significant that almost always yes. Second, whether evolutionary al- improvement of recommendation in comparison to gorithm does or does not always find the best solution. single instance optimization. This will be confirmed only on smaller data. Experiments • Our implementation is able to handle individuals were provided on Movie Lens 100k and 1M data. (matrix factors). Size of our individuals is several or- ders bigger than usual in evolutionary computation. 1 Introduction This paper is organized as follows: Chapter 2 "Related work" concerns recommender systems and matrix factor- This paper studies recommender systems, especially learn- ization; evolutionary algorithms and island model. Chap- ing user/customer preferences. Main idea of this paper is ter 3 "Methods, models and parameters" sketches our view to connect this study to evolutionary island models. Here, of stochastic optimization methods; parameters and set- island models are a computational tool for matrix factor- tings of island model used in tests. Next section "Data" ization. describes realization of experiments and design of tests Instances of one stochastic optimization method run in and computing resources. Finally section 4 "Results" gives parallel on island models, searching the same state space. both numeric and graphic presentation of our results and Such a method traverses the state space of solutions. The discusses confirmation of our hypotheses. After conclu- actual state they visit is influenced by (mutually indepen- sions paper ends with Future work. dent) stochasticity. Hence, different instances can visit dif- ferent parts of the state space. Island model parallelization is based on cooperation of methods during the computa- 2 Related work tion. This resembles ensemble and/or hybrid learning, where This section gives basic notation for recommender sys- different optimization methods are: tems, evolutionary computation used and overviews some relevant literature. • run in parallel 2.1 Recommender systems and matrix factorization • allowed to run in different state spaces It was the Netflix price (announced in October 2006) The solution (usually one of them) is chosen at the end by which excited public and changed the landscape of rec- an additional aggregation/decision method. ommender systems area research. Using evolutionary computing for recommendation is The BellKor squad included Chris Volinsky and his surveyed in the paper of Horvath [1]. This survey shows AT&T colleagues Robert Bell and Yehuda Koren, along that usage of evolutionary methods is quite frequent in rec- with four other engineers from the United States, Austria, ommendation systems (see also [2]). One of approaches is Canada and Israel (BellKor’s Pragmatic Chaos) on June model based. Our approach uses matrix factorization and 26, 2009, finally crossed the 10% finish line. Main in- hence the model is constituted by factors. In [1] they men- gredient of winning solution has been matrix factorization tion only one paper [3] where evolution individuals are (MF), see e.g. [4]. matrix factors. Authors of the article [3] provide results We focus on latent model based recommendation - on ML100k data on minimizing squared error depending which is a part of collaborative filtering technique (Co), Preference learning by matrix factorization on island models 147 asked for in Netflix Price competition. Co was introduced of evolutionary algorithms in combination with the com- by [4]. putation of latent factors. Let us denote U set of user ids and I the set of item ids. Motivated by this, in our paper individuals are repre- Input data can be viewed as partial function r : U ×I −→ sented by concatenation of U| and I| (like a worm d-thick P, here P is the preference scale (e.g. one to five stars and |U | + |I | long) or real numbers). Alternative representation is in the form of algebraic matrix R ∈ P |U |×|I | with dimensions repre-   senting users and items respectively. Rating of user i ∈ U u11 ... u1|U | i11 ... i1|I | of an item j ∈ I is content of corresponding cell of matrix  .. .. .. .. .. ..   . . . . . .  ri j ∈ P. Usually this matrix is very sparse, so in practice xd1 ... ud|U | id1 ... xd|I | input data are represented in a form of a database table R with schema R(uid = user id, iid = item id, rating). Al- though implementation uses database table formal model Figure 1: Individual represented by latent vector of users of [4] description is easier in the language of matrices. The and latent vector of items matrix R is factorized to lower dimensional representation In our experiments we use the RMSE as the fitness func- R ≈ U × I| = R̂ (1) tion where U ∈ P |U |×d , I ∈ P |I |×d are user and item la- v u |U | |I | tent factors matrices and × is a matrix product. d is much u ∑ ∑ e2 t i=1 j=1 i j smaller than |U | and |I |. Approximation of R by R̂ is (3) measured by |U | ∗ |I | ! d e2i j = (ri j − rˆi j )2 = ri j − ∑ uik i jk (2) 2.3 Island model k=1 here rˆi j will serve as approximation of value ri j . Island models are one of approaches how parallelize opti- Our main optimization method is SGD - stochastic gra- mization algorithms. Their characterization is that there is dient descent method. For other approaches and dimen- no central master and populations are distributed. Island sions of research in recommender systems we reffer to [5]. model consists of parallel algorithms enriched by send- ing and accepting migrants. Migrants are individuals from other islands population. The hope is that this propagation 2.2 Evolutionary algorithms of genetic material can speed up convergence and escape Evolutionary algorithms became popular after 1970, from local optima trap. thanks to John Holland [6] as a means for solving opti- Parameters of island model are frequency of sending- mization problems. The solution is represented as an indi- receiving migrants and the way how the migrant is chosen vidual, the set of individuals forms the population. Evo- from local population (remember that parameters of meth- lutionary algorithm is formed by phases: initialization, ods are fixed and are chosen after experiments on a sample selection (parent selection for next generation offspring), data). Optimal choice of these parameters depends on the crossover, mutation and evaluation of the whole popula- type of optimization method and is subject of study. These tion. The stochastic computing of individuals seeks con- aspects of island models are studied in [7]. vergence to the global optimum. Specific type of island models, heterogeneous, were The contribution of evolutionary algorithms in the area used in the application of multi-criteria optimization by of recommender systems was reviewed by Horvath de Car- Martin Pilát and Roman Neruda [8]. They designed a valho in [1]. This review analyzed more than 65 research new algorithm of MOGASOLS (Multiobjective Genetic papers using EC techniques for user recommendations. It Algorithm with Single Objective Local Search) combining analyzed different approaches according to five aspects: (i) multi-criteria genetic algorithm (MOGA) with a single- recommendation technique used, (ii) datasets employed in criteria genetic algorithm (SOGA). It is proved that par- the experiments and (iii) evaluation methods used in the allelization of time-consuming method MOGA achieves experiments, (iv) baselines used to compare with the pro- worse results than parallel running MOGA and SOGA posed methods and (v) reproducibility of the research. methods. They were tested in NSGA-II [9] and IBEA [10] Most of nature-inspired algorithms reviewed in [1] find algorithms. This was further developed in [11]. application in solving partial problems such as feature The first who come up with heterogeneous island mod- weighting extraction, model based approach, clustering els (HeIM), the way we understand them, was Martin Pilat and function learning. ([11]). In these models, the islands may carry diverse com- Computing latent factor models by using of evolution- putational methods, differing not only in parameters but ary algorithms has emerged as a possible approach [3]. also in the structure of the whole stochastic algorithm [11]. Available publications do not mention the parallelization For the choice of optimal methods for the given problem is 148 Štepán Balcar responsible the central planner, which replaces unsuccess- Individual equals solution. Solution is represented by ful method by more successful methods during the whole pair of latent factors of users and items (Figure 1). Mul- computation. In publication [11] the original Balcar’s tool tiplying of these latent factor we get a matrix which rep- [12] was used. resents our estimation of ratings. Length of first vector is In this paper this tool is used with homogeneous islands the number of users and the length of second vector equals to find optimal user and item latent factor. number of items. Size of the individual depends on size Then different optimization techniques are used to con- of input problem. Width of latent vector is an optional pa- verge with factor to optimal one. For illustration we give rameter. formulas for stochastic gradient descent. Recall e2i j , then next generation values of user and item factors equal to 3.2 Parameters and settings of island model used in tests ∂ 2 ∂ 2 u´ik = uik + α e ik´ j = ik j + α e (4) Now we describe island model - the environment in which ∂ uik i j ∂ ik j i j the above methods will be parallelized. It is the envi- Note, that in our implementation we do not consider ronment which ensures dissemination of genetic material. normalization factor, this is left for future. Nevertheless, The main tool for this is the migration. This migration some normalization effect is obtained by migration and does not mean only exchange of best solutions. We would synergy of islands. α is not fixed, just bounded from like to spread across the system (between islands) genetic above. More on our system is in [12]. material which has the potential to contribute to further improvement of population quality and to speed-up the convergence. Migration is inspired by processes in nature 3 Methods, models and parameters ([15], [16]). Most of island models exchange migrants in a synchronized way. In our system exchange of migrants is Island models where originally developed for paralleliza- not synchronized - so in fact our islands are more general. tion of evolutionary algorithms. In this paper we will use Key parameter of island models , as nature shows ([17]), also other stochastic optimization methods. is the frequency of migration and size of local populations. The bigger the frequency of migration the bigger is the 3.1 Stochastic optimization methods chance that islands will help each other. On other side each hardware and software is limited by data through put. Evolutionary algorithms try to balance trade off between Matrix factorization needs migration of much bigger in- exploration and exploitation. This property should help dividuals than e.g. continuous optimization, where indi- evolutionary algorithm to escape from local extremes and vidual is a point in an n-dimensional space. In this paper simultaneously converge to optimal solution. For this abil- we had to change architecture of model used in [11]. In ity they pay a higher time complexity because of parallel [11] input problems were TSP (traveling salesman prob- development of bigger population. Because of this fact, lem of size cca. 1000 cities), BP (bin packing, one dimen- on some types of problems, the winners are other stochas- sional with 1000 items), CO (continuous optimization, 10- tic optimization methods. An example is TSP solution by dimensional function) and vertex cover (1000 vertexes). hill climbing with 2-opt ([13]). So, this is the main rea- Solution of preference learning by matrix factorization on son we consider several additional stochastic optimization island models is represented of much bigger individuals, methods (see Table 1). rough estimation of our state space is > #TSP20 . Evolu- All these methods use the stochastic gradient descent. tionary algorithms use incoming individuals in groups and A general link to description of stochastic optimization after a while. Hence, it is advantageous to store them in a methods is [14]. front. As far as individuals are big and memory is limited Our implementation of stochastic optimization methods we had to limit the size of fronts (see Table 2). was changed in two ways. First, we had to create operators to enable them to work with latent factor based individu- als. Second, methods were modified for parallel run in is- 3.3 Data land models with migration. They were enriched with abil- We used data from the movie recommendation domain in ity to cooperate. They receive individuals and enrich their the experiments. The effectiveness of parallelization has local population. Please notice, that only evolution and been verified on datasets ml-100k1 and ml-1m2 . Datasets differential evolution have bigger population. Remaining are formed by movie evaluation (1-5 stars), for trials we methods have only one individual population. In this case use the trinity (user, movie, rating). enrichment means replacement. They provide their solu- The training set consists of four-fifths from the dataset, tion from local population to neighboring islands. Meth- the rest is the test set. Counting derivatives for SGD lever- ods manipulate incoming individuals in concordance with basic algorithm idea. E.g. the tabu search method inserts 1 http://grouplens.org/datasets/movielens/100k/ a newcomer to the tabu set. 2 https://grouplens.org/datasets/movielens/1m/ Preference learning by matrix factorization on island models 149 Algorithms Tool Parameters HillClimbing SGD random rating from each line numberOfNeighbors=10 RandomSearch generation of latent vectors Evolution uniform crossover + SGD popSize=10, mutationRate=0.9, crossRate=0.1, parentSelector=CompareTwoRandomSelectors BruteForce SGD according to the I-th rating TabuSearch SGD random rating from each line tabuModelSize=50 Sim.Annealing SGD random rating from each line temperature=10000, coolingRate=0.002 Diff.Evolution differential crossover popSize=50, F=0.25 Table 1: Methods and parameters Parameter value Island 1 Island 2 Number of iterations 50, period = 60 seconds Method (agent) Method (agent) Number of islands 4 (AMD Opteron 6376) I. II. Manager Neighbors of method 3 (distributed to everyone) (planner) Migration frequency 5 seconds Monitor (statistic) Table 2: Parameters of the island model Island 3 Method (agent) III. ages the training set. The test set is used to evaluate indi- Figure 2: Architecture of system viduals. 3.4 Realization of the experiment The use of this system is the implementation of methods as an agent, who can develop methods as adaptive com- Our main idea is to increase stochasticity of searching the puting resources. The infrastructure of the system is made state space (which is enormous for movie data). First up of two central points, by Agent-manager that manages level of stochasticity is enabled by stochastic optimization computation and by Agent-monitor, that monitors genetic method. Second level is enabled by several independent is- material in the system. lands. The third level is attained by migration. Our system enables all of these. Moreover, all of these run in parallel Software allows multiple ways of monitoring. The mon- with mutually assisting methods (not competitive). itor statistically processes the quality of the individuals. In order to be able to obtain the best solution from such Another way of observing what happens in the system is a computation resource, the computation must be continu- to produce the pedigrees of an individual. The basic idea ously monitored (Figure 2). Solutions represent individu- is to enrich every individual of the pedigree that contains als that are unpredictably moving across the island model. information on the involvement of concrete islands in its We can never know when and where the best solution will creation. appear (sometimes the best solution does not appear at the For our experiments, were used statistics that are com- end, see Figure 3). puted from each monitored individuals in one iteration of Our implementation uses agent middle-ware Jade3 for planner. We will present the results as a measure of the achievement of higher adaptivity of methods (in our three system, which is monitoring follow-planner-iterations. levels of stochasticity). Our evolutionary methods (Table 1) use uniform Here we were facing main decision. Whether to prefer crossover. In phase of mutation they apply stochastic gra- higher adaptivity (based e.g. on Jade) or better use of ef- dient descent on a randomly chosen rating. fectiveness of specific hardware (based e.g. on C). From Differential evolution combines 3 latent vectors (indi- pure experimental curiosity we went along the agent based viduals which are models/solutions) LV1 which is a con- middle-ware framework direction. catenation of U|1 and I|1 , similarly LV2 and LV3 . Result The central brain of the multi-agent based system is a of the differential operator is latent vector manager of migration which directs the computation and measures genetic material during evolution. LVnew = LV1 + F ∗ (LV2 − LV3 ) 3 http://jade.tilab.com/ here F is 0.25. 150 Štepán Balcar 3.5 Test design and computing resources Comparison of single methods - MFML1m 1.7 Two pairs of tests were created, comparing single meth- Single-BruteForce Single-Evolution ods and island models, differing in the size of the input 1.6 Single-HillClimbing Single-SimulatedAnnealing Movieland dataset. The width of the latent vector was set 1.5 Single-TabuSearch to 10 (best after initial experiments on smaller samples). The tests run on all initial parameter settings (Table 3) 9 1.4 y: fitness RMSE times. As far as our computations are nondeterministic 1.3 this makes difference (see Figures 4 and 5). These are computationally demanding calculations. 1.2 1.1 Parameter value Number of runs 9 1 Computing nodes AMD Opteron 6376, 64GB memory 0.9 Latent factor width 10 0 5 10 15 20 x: time in planner iterations (1x iteration = 60x seconds) 25 30 35 40 45 50 Datasets ml-100k, ml-1m Table 3: Parameters of the test Figure 3: Methods ML1m investigation of median run 4 Results Comparison single methods and island - MFML100k 1 0.98 Our experiments validated two hypotheses. First is that evolution does not necessary give best solution and sec- 0.96 ond that island models improve results. We will present y: fitness RMSE results for two datasets separately. We will show the best 0.94 solution of 9 runs and average of all runs (for distribution 0.92 see Figures 4 and 5). On the smaller data set the winner is always Evolution 0.9 and Islands give always better solution. On the bigger dataset the single island winner is simu- 0.88 lated annealing (in both minimum and average). In Islands Si ng le Is la nd -B Si ng le Is la nd -E Si ng le Is la nd -H Si ng le Is la nd -T -B -E -H -T the winner is hill climbing (in both minimum and aver- ru te Fo rc ru te For vo lu tio n vo lu tio n illC lim bi illC lim bi ab uS ea ab uS ear ce ng rc age). On bigger data we can not always observe advantage e n g h ch brought by parallelization by islands and migration. This can be observed especially by simulated annealing. One Figure 4: Methods ML100k boxplot of results possible explanation is that hill climbing does not risk that much going in wrong direction as simulated annealing is doing (sometimes). Bigger data bring bigger state space 6 Future work and hence risk is necessary, but 50 iterations of the plan- ner is probably not enough. In future research we will run In our future work we plan to extend this to heterogeneous island models with more iterations. island models. We also plan to develop new planners which will be specifically designed for recommendation domain. We would like to consider also islands with statis- 5 Conclusions tical learning methods. Challenge is extension to content based recommendation. We proved that island models are a good computing tool for recommender systems. Our experiments have shown following. Island models brought clear improvement on References smaller data set. On bigger data, it is also true except of simulated annealing. Moreover on bigger data evolution [1] Tomáš Horváth and André C. P. L. F. de Carvalho. Evo- does not find best solution. lutionary computing in recommender systems: a review of Our implementation is publicly available on Github4 recent research. Natural Computing, 16(3):441–462, 2017. and hence enables repeatability of our experiments (see [2] M. Rezaei and R. Boostani. Using genetic algorithm to [1], requirement (v)). enhance nonnegative matrix factorization initialization. In 2010 6th Iranian Conference on Machine Vision and Image 4 https://github.com/sbalcar/distributedea Processing, pages 1–5, Oct 2010. Preference learning by matrix factorization on island models 151 Methods Single-min Islands-min Single-average Islands-average BruteForce 0.98787115 0.94367838 0.99088728 0.94571058 DifferentialEvolution 1.50105714 1.49096267 1.51771324 1.49957429 Evolution 0.88196787 0.87695296 0.88705747 0.87952888 HillClimbing 0.98047412 0.97400051 0.98207866 0.97824895 RandomSearch 1.60448414 1.58333447 1.61703977 1.60802325 SimulatedAnnealing 1.06110779 1.03108626 1.07797515 1.03806128 TabuSearch 0.98048607 0.97681064 0.98217126 0.97963446 Table 4: Comparison of single methods and island models: Dataset - MFML100k, Runs - 9 Methods Single-min Islands-min Single-average Islands-average BruteForce 1.22268511 1.18410586 1.23279212 1.2033589 DifferentialEvolution 1.55227103 1.55057106 1.55780106 1.55422412 Evolution 1.10254503 1.07069968 1.13526207 1.09140823 HillClimbing 0.96832015 0.96727131 0.97065502 0.96840234 RandomSearch 1.65483811 1.66265835 1.66738744 1.6660426 SimulatedAnnealing 0.96655732 0.97118289 0.97048761 0.9729405 TabuSearch 0.96873215 0.96735123 0.97106628 0.96854622 Table 5: Comparison of single methods and island models: Dataset - MFML1m, Runs - 9 (in red there are violations of island improvement) 1998. 0.974 Comparison single methods and island - MFML1m [8] Martin Pilát and Roman Neruda. Combining multiobjective and single-objective genetic algorithms in heterogeneous 0.973 island model. In CEC 2010, pages 1–8, 2010. 0.972 [9] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan. A fast and elitist multiobjective genetic algorithm: Nsga-ii. Trans. y: fitness RMSE 0.971 0.97 Evol. Comp, 6(2):182–197, 2002. [10] Eckart Zitzler, Marco Laumanns, and Lothar Thiele. Spea2: 0.969 Improving the strength pareto evolutionary algorithm. 0.968 Technical report, ETH Zurich, 2001. 0.967 [11] S Balcar and M Pilat. Heterogeneous island model with re- Si Is Si Is Si Is planning of methods. In GECCO’18, accepted as poster, la la la 2018. ng nd ng nd ng nd le -H le -S le -T -Hi -S -T llC illC im im ab ab lim lim ul ul uS uS bi ng bi ng at ed An at ed An ea rch ea rc h [12] Stepan Balcar. Heterogeneous island model with re- ne ne al in g al in g planning of methods (in czech). Master thesis, Martin Pilát supervisor, 2017. [13] G. A. Croes. A method for solving traveling salesman prob- Figure 5: Methods ML1m boxplot of results lems. Operations Research, page 791–812, 1958. [14] R.K. Arora. Optimization: Algorithms and Applications. CRC Press, 2015. [3] Dariush Zandi Navgaran, Parham Moradi, and Fardin [15] Jinliang Wang and Michael C. Whitlock. Estimating effec- Akhlaghian. Evolutionary based matrix factorization tive population size and migration rates from genetic sam- method for collaborative filtering systems. 2013 21st ICEE, ples over space and time. Genetics, 163(1):429–446, 2003. pages 1–5, 2013. [16] William Astle and David J. Balding. Population struc- [4] Y Koren, R Bell, and Volinsky C. Matrix factorization tech- ture and cryptic relatedness in genetic association studies. niques for recommender systems. Computer, 42(8):30–37, Statist. Sci., 24(4):451–471, 11 2009. 2009. [17] John Wakeley. The coalescent in an island model of popu- [5] L. Peska and P. Vojtas. Using implicit preference relations lation subdivision with variation among demes. Theoretical to improve recommender systems. Journal on Data Seman- Population Biology, 59(2):133 – 144, 2001. tics, 6,1:15–30, 2017. [6] John H. Holland. Adaptation in Natural and Artificial Sys- Acknowledgements. This work was supported by a tems. MIT Press, 1992. Ph.D grant SVV2018-260451 under supervision of Peter [7] Darrell Whitley, Soraya Rana, and Robert B. Heckendorn. Vojtas. The island model genetic algorithm: On separability, pop- ulation size and convergence. J.Comp.Inf.Tech., 7:33–47,