1. INTRODUCTION

Analysing Compression Techniques for In-Memory Collaborative Filtering

Saúl Vargas

Craig Macdonald

Iadh Ounis

firstname.lastname}@glasgow.ac.uk

0 0 School of Computing Science, University of Glasgow

2015

Following the recent trend of in-memory data processing, it is a usual practice to maintain collaborative ltering data in the main memory when generating recommendations in academic and industrial recommender systems. In this paper, we study the impact of integer compression techniques for in-memory collaborative ltering data in terms of space and time e ciency. Our results provide relevant observations about when and how to compress collaborative ltering data. First, we observe that, depending on the memory constraints, compression techniques may speed up or slow down the performance of state-of-the-art collaborative ltering algorithms. Second, after comparing di erent compression techniques, we nd the Frame of Reference (FOR) technique to be the best option in terms of space and time e ciency under di erent memory constraints.

Recommender Systems Collaborative Filtering Index Compression

1. INTRODUCTION

With the decrease of memory costs, having servers with hundreds of GBs of main memory is nowadays an a ordable option [ 10 ]. With such infrastructure, performing in-memory data processing has become a common and feasible option in both single and multi-node environments. This trend can be observed in di erent areas of computing systems such as databases [ 12 ], search indices [ 2 ] and recommendation engines [ 7 ], more speci cally in recommendation engines relying on collaborative ltering (CF) techniques. Indeed, keeping CF data in memory is a usual practice, particularly in the publicly available datasets for research purposes. In realworld datasets, however, CF data is usually several orders of magnitude larger than the public datasets, and thus e cient representations of in-memory CF data may be needed. As CF data is commonly represented as lists of numerical ids of users and items, integer compression techniques [ 1, 3, 4, 6, 9 ] can be used to signi cantly reduce the amount of memory required for representing them.

In this work, we study the use of integer compression techniques to compress CF data. Although there has been prior work studying the bene ts of compression techniques for CF data [ 5 ], that work focused on a scenario where the data is mainly stored on disk. In that setting, data transfer between disk and memory can be identi ed as a bottleneck in the computation of recommendations and, consequently, compression techniques consistently help in speeding up the recommendation algorithms. In our fully in-memory setting, depending on the available memory, compression techniques may speed up computing time as well, but may also slow it down. Furthermore, we explore a wider range of compression techniques, nding that the Frame of Reference (FOR) [ 6 ] technique o ers the best solution in terms of space and time e ciency among the compared approaches. 2.

EXPERIMENTAL SETUP

In order to analyse the e ect of compression techniques for CF data, we have conducted a series of experiments with two well known datasets: the dataset from the Net ix prize, containing 100 million ratings from 480,000 users to 17,770 lms, and the Yahoo Music dataset, containing 717 million ratings from 1.8 million users to 136,736 songs. In both datasets, ratings are given in a scale between 1 to 5 stars. These are, to our knowledge, two of the largest CF datasets available for academic research purposes.

Both datasets use numerical ids for identifying users and items. In the case of the Net ix dataset, we map the original user ids consecutive ids determined by the numerical order of the original ids. More elaborate id re-assignation techniques, which may lead to further compression e ciency [ 2, 11 ], are left for future work. The preferences of each user/item are represented with a list of sorted ids of items/users and a list of numerical ratings. The lists of ids are compressed using a variety of compression techniques: xed-length coding ( xlen), where each id is coded by using the minimum number of bits required to store the largest id, coding [ 4 ], EliasFano (EF) [ 3 ], Rice [ 9 ], 3 coding [ 1 ] and Frame of Reference (FOR)1 [ 6 ]. With the exception of xed-length and EliasFano, id arrays are stored with delta-gaps. Rating values in the 1-5 scale are simply compressed with xed-length coding, that is, using 3 bits to represent each rating. We use RankSys2, a Recommender Systems framework written in Java and, on top of it, we use the implementation of FOR in the JavaFastPFOR3 package and the dsiutils4 package for the rest of techniques.

In order to measure the performance of the di erent compression techniques in terms of time e ciency with respect to uncompressed representations of the CF data, we generate 1Results with more sophisticated variations of FOR, such as PFOR, are omitted as they do not di er signi cantly from those of the simpler, original FOR. 2http://ir-uam.github.io/RankSys/ 3https://github.com/lemire/JavaFastPFOR 4http://dsiutils.di.unimi.it/

Net ix Y Music Table 1: Memory usnone 1,608 11,486 age in MB of the Net ix x-len 506 4,051 and Yahoo Music datasets EF 229489 22,,189701 with di erent compresRice 241 2,130 sion techniques for user 3 266 2,396 and item ids. Best results FOR 273 2,354 in bold. recommendations for 1,000 users randomly selected in both datasets with the user-based nearest neighbours algorithm [ 8 ] provided by RankSys. For the purpose of simulating di erent memory constraints and their e ect on the time e ciency of the recommendations, we selected for both datasets three di erent settings for heap size (via the -Xmx parameter in Java): a heap size in which the uncompressed datasets t in memory without problem (4.8 GB for Net ix and 32 GB for Yahoo Music), a heap size slightly higher that what is required to keep the dataset in memory (2.4 GB and 16 GB) and, nally, a heap size slightly higher than what is required to allocate the data with x-len coding but where the uncompressed data cannot be allocated (0.8 GB and 6 GB). Moreover, recommendations were generated in a multi-threaded environment using 8 parallel threads. This simulates a realistic high-demand environment where many of the bene ts of caching are lost and there is a high demand of memory for auxiliary data structures for the CF algorithm.

EXPERIMENTAL EVALUATION

The results of the experimental setup previously described are summarised in Table 1 and Table 2. The results in terms of memory usage in Table 1 indicate similar trends for both datasets: while a simple x-len encoding is capable of reducing notably the size of the CF data in memory (by one third in both datasets), the rest of compression techniques are able to further reduce the usage of memory (below 20% in most cases), being EF and Rice the most space e cient among the compared alternatives.

In terms of time e ciency, Table 2 illustrates the change in performance when di erent memory constraints are considered. Again, the observed trends are equivalent in both datasets. Under the lightest memory constraints, it can be observed that the fastest option is working with uncompressed data. However, with the intermediate memory constraints, all the compression techniques are faster than the uncompressed data. Finally, in the setting with the heaviest memory constraints, the x-len option heavily su ers the scarcity of available memory, whereas the rest of more space-e cient coding alternatives su er a milder time penalty. Among the compression approaches, FOR stands out as the best one, being the fastest approach in every setting and only slightly slower than the uncompressed option in the setting with high memory availability.

To understand better the slowness of the uncompressed data and the simple x-len coding under tight memory constraints, we observed in detail the performance characteristics of the system, and noticed two factors: the higher time spent in the garbage collector and the inability to make full use of the multi-threading capabilities of the system.

On the one hand, these previous observations contrast with prior work [ 5 ], in which CF data was primarily stored on classic disk search indices. In that work, the use of compression techniques always represented a speed up in access to CF data. As we observe, in the case of in-memory CF data this situation does not necessarily happen, provided none

x-len EF Rice 3 FOR that the remaining memory is large enough to support the auxiliary data structures and the variables required for the computation of the algorithms. On the other hand, when comparing the performance of the di erent compression techniques, our results are in line with those of [ 2 ] for search index compression, in which the FOR technique is also found to be the best solution in terms of time e ciency. 4.

CONCLUSIONS AND FUTURE WORK

In this paper we have conducted a study of the performance of compression techniques for keeping CF data in the main memory. We nd that compression techniques in this case, as opposed to when the data is primarily stored on disk, may actually slow down the processing time when the remaining memory available for the processing of CF algorithms is large enough. Under more severe memory constraints, we nd that compression techniques are indeed able to speed up the generation of recommendations. Finally, we nd that the FOR technique is the best compression approach as its space e ciency is at the same level of the rest of compared alternatives while its time e ciency is clearly better than these and close to that of using no compression.

As part of future work, we envisage the usage of hybrid representations of CF data, in which the system may selectively compress part of the data in order to maximise the performance of the system under di erent memory constraints.

[1]

Boldi and

Vigna . Codes for the world wide web . Internet Mathematics , 2 ( 4 ), 2005 .

[2]

Catena ,

Macdonald ,

and I.

Ounis . On inverted index compression for search engine e ciency . In ECIR , 2014 .

[3]

Elias . E cient storage and retrieval by content and address of static les . J. ACM , 21 ( 2 ), 1974 .

[4]

Elias . Universal codeword sets and representations of the integers . IEEE Trans. Inf. Theory , 21 ( 2 ), 1975 .

[5]

Formoso ,

Fernandez ,

Cacheda , and

Carneiro . Using rating matrix compression techniques to speed up collaborative recommendations . Inf . Ret., 16 ( 6 ), 2013 .

[6]

Goldstein ,

Ramakrishnan , and

Shaft . Compressing relations and indexes . In ICDE , 1998 .

[7]

Gupta ,

Goel ,

Lin ,

Sharma ,

Wang , and

Zadeh . WTF: The who to follow service at Twitter . In WWW , 2013 .

[8]

Resnick ,

Iacovou ,

Suchak ,

Bergstrom , and

Riedl . Grouplens: An open architecture for collaborative ltering of netnews . In CSCW , 1994 .

[9]

Rice and

Plaunt . Adaptative variable-length coding for e cient compression of spacecraft television data . Trans. Communication Technology , 19 ( 6 ), 1971 .

[10]

Rowstron ,

Narayanan ,

Donnelly , G.

O'Shea, and

Douglas . Nobody ever got red for using Hadoop on a cluster . In HotCDP , 2012 .

[11]

Silvestri . Sorting out the document identi er assignment problem . In ECIR , 2007 .

[12]

Zhang , G. Chen,

Ooi ,

Tan , and M. Zhang. In-memory big data management and processing: A survey . IEEE TKDE , 27 ( 7 ), 2015 .