=Paper=
{{Paper
|id=Vol-1441/recsys2015_poster2
|storemode=property
|title=Analysing Compression Techniques for In-Memory Collaborative Filtering
|pdfUrl=https://ceur-ws.org/Vol-1441/recsys2015_poster2.pdf
|volume=Vol-1441
|dblpUrl=https://dblp.org/rec/conf/recsys/VargasMO15
}}
==Analysing Compression Techniques for In-Memory Collaborative Filtering==
Analysing Compression Techniques for In-Memory Collaborative Filtering Saúl Vargas, Craig Macdonald and Iadh Ounis {firstname.lastname}@glasgow.ac.uk School of Computing Science, University of Glasgow ABSTRACT in the computation of recommendations and, consequently, Following the recent trend of in-memory data processing, it compression techniques consistently help in speeding up the is a usual practice to maintain collaborative filtering data recommendation algorithms. In our fully in-memory setting, in the main memory when generating recommendations in depending on the available memory, compression techniques academic and industrial recommender systems. In this paper, may speed up computing time as well, but may also slow it we study the impact of integer compression techniques for down. Furthermore, we explore a wider range of compression in-memory collaborative filtering data in terms of space and techniques, finding that the Frame of Reference (FOR) [6] time efficiency. Our results provide relevant observations technique offers the best solution in terms of space and time about when and how to compress collaborative filtering data. efficiency among the compared approaches. First, we observe that, depending on the memory constraints, compression techniques may speed up or slow down the per- 2. EXPERIMENTAL SETUP formance of state-of-the-art collaborative filtering algorithms. In order to analyse the effect of compression techniques Second, after comparing different compression techniques, we for CF data, we have conducted a series of experiments with find the Frame of Reference (FOR) technique to be the best two well known datasets: the dataset from the Netflix prize, option in terms of space and time efficiency under different containing 100 million ratings from 480,000 users to 17,770 memory constraints. films, and the Yahoo Music dataset, containing 717 million ratings from 1.8 million users to 136,736 songs. In both Categories and Subject Descriptors: H.3.3 [Informa- datasets, ratings are given in a scale between 1 to 5 stars. tion Storage & Retrieval]: Information Filtering These are, to our knowledge, two of the largest CF datasets Keywords: Recommender Systems, Collaborative Filtering, available for academic research purposes. Index Compression. Both datasets use numerical ids for identifying users and items. In the case of the Netflix dataset, we map the original 1. INTRODUCTION user ids consecutive ids determined by the numerical order of With the decrease of memory costs, having servers with the original ids. More elaborate id re-assignation techniques, hundreds of GBs of main memory is nowadays an affordable which may lead to further compression efficiency [2, 11], are option [10]. With such infrastructure, performing in-memory left for future work. The preferences of each user/item are data processing has become a common and feasible option represented with a list of sorted ids of items/users and a list in both single and multi-node environments. This trend can of numerical ratings. The lists of ids are compressed using a be observed in different areas of computing systems such variety of compression techniques: fixed-length coding (fix- as databases [12], search indices [2] and recommendation len), where each id is coded by using the minimum number engines [7], more specifically in recommendation engines of bits required to store the largest id, γ coding [4], Elias- relying on collaborative filtering (CF) techniques. Indeed, Fano (EF) [3], Rice [9], ζ3 coding [1] and Frame of Reference keeping CF data in memory is a usual practice, particularly in (FOR)1 [6]. With the exception of fixed-length and Elias- the publicly available datasets for research purposes. In real- Fano, id arrays are stored with delta-gaps. Rating values world datasets, however, CF data is usually several orders of in the 1-5 scale are simply compressed with fixed-length magnitude larger than the public datasets, and thus efficient coding, that is, using 3 bits to represent each rating. We use representations of in-memory CF data may be needed. As RankSys2 , a Recommender Systems framework written in CF data is commonly represented as lists of numerical ids of Java and, on top of it, we use the implementation of FOR in users and items, integer compression techniques [1, 3, 4, 6, the JavaFastPFOR3 package and the dsiutils4 package for 9] can be used to significantly reduce the amount of memory the rest of techniques. required for representing them. In order to measure the performance of the different com- In this work, we study the use of integer compression pression techniques in terms of time efficiency with respect techniques to compress CF data. Although there has been to uncompressed representations of the CF data, we generate prior work studying the benefits of compression techniques for CF data [5], that work focused on a scenario where the 1 Results with more sophisticated variations of FOR, such as data is mainly stored on disk. In that setting, data transfer PFOR, are omitted as they do not differ significantly from between disk and memory can be identified as a bottleneck those of the simpler, original FOR. 2 http://ir-uam.github.io/RankSys/ Copyright is held by the authors. 3 https://github.com/lemire/JavaFastPFOR RecSys 2015 Poster Proceedings, September 16-20, 2015, Vienna, Austria. 4 http://dsiutils.di.unimi.it/ Netflix Y Music Netflix Y Music Table 1: Memory us- 4.8 GB 2.4 GB 0.8 GB 32 GB 16 GB 6 GB none 1,608 11,486 age in MB of the Netflix fix-len 506 4,051 and Yahoo Music datasets none 43.96 109.43 - 100.74 267.66 - γ 298 2,871 fix-len 58.68 60.51 743.05 119.07 122.22 458.50 EF 249 2,190 with different compres- γ 63.03 64.00 71.13 131.64 128.27 131.49 Rice 241 2,130 sion techniques for user EF 64.51 67.71 74.20 134.44 130.74 137.54 ζ3 266 2,396 and item ids. Best results Rice 69.82 69.73 77.42 144.15 135.53 137.85 FOR 273 2,354 ζ3 63.91 66.65 74.81 129.71 145.08 142.80 in bold. FOR 48.79 50.43 55.36 111.33 106.80 108.92 recommendations for 1,000 users randomly selected in both datasets with the user-based nearest neighbours algorithm [8] Table 2: Execution time in seconds of the user-based provided by RankSys. For the purpose of simulating different nearest neighbours algorithm in the Netflix and Ya- memory constraints and their effect on the time efficiency hoo Music datasets with different compression tech- of the recommendations, we selected for both datasets three niques for user and item ids. Best results in bold. different settings for heap size (via the -Xmx parameter in that the remaining memory is large enough to support the Java): a heap size in which the uncompressed datasets fit auxiliary data structures and the variables required for the in memory without problem (4.8 GB for Netflix and 32 GB computation of the algorithms. On the other hand, when for Yahoo Music), a heap size slightly higher that what is comparing the performance of the different compression tech- required to keep the dataset in memory (2.4 GB and 16 niques, our results are in line with those of [2] for search GB) and, finally, a heap size slightly higher than what is index compression, in which the FOR technique is also found required to allocate the data with fix-len coding but where to be the best solution in terms of time efficiency. the uncompressed data cannot be allocated (0.8 GB and 6 GB). Moreover, recommendations were generated in a 4. CONCLUSIONS AND FUTURE WORK multi-threaded environment using 8 parallel threads. This In this paper we have conducted a study of the performance simulates a realistic high-demand environment where many of compression techniques for keeping CF data in the main of the benefits of caching are lost and there is a high demand memory. We find that compression techniques in this case, of memory for auxiliary data structures for the CF algorithm. as opposed to when the data is primarily stored on disk, may actually slow down the processing time when the remaining 3. EXPERIMENTAL EVALUATION memory available for the processing of CF algorithms is large enough. Under more severe memory constraints, we The results of the experimental setup previously described find that compression techniques are indeed able to speed are summarised in Table 1 and Table 2. The results in up the generation of recommendations. Finally, we find that terms of memory usage in Table 1 indicate similar trends for the FOR technique is the best compression approach as its both datasets: while a simple fix-len encoding is capable of space efficiency is at the same level of the rest of compared reducing notably the size of the CF data in memory (by one alternatives while its time efficiency is clearly better than third in both datasets), the rest of compression techniques these and close to that of using no compression. are able to further reduce the usage of memory (below 20% As part of future work, we envisage the usage of hybrid rep- in most cases), being EF and Rice the most space efficient resentations of CF data, in which the system may selectively among the compared alternatives. compress part of the data in order to maximise the perfor- In terms of time efficiency, Table 2 illustrates the change mance of the system under different memory constraints. in performance when different memory constraints are con- sidered. Again, the observed trends are equivalent in both 5. REFERENCES datasets. Under the lightest memory constraints, it can be ob- [1] P. Boldi and S. Vigna. Codes for the world wide web. Internet Mathematics, 2(4), 2005. served that the fastest option is working with uncompressed [2] M. Catena, C. Macdonald, and I. Ounis. On inverted index data. However, with the intermediate memory constraints, compression for search engine efficiency. In ECIR, 2014. all the compression techniques are faster than the uncom- [3] P. Elias. Efficient storage and retrieval by content and address pressed data. Finally, in the setting with the heaviest memory of static files. J. ACM, 21(2), 1974. [4] P. Elias. Universal codeword sets and representations of the constraints, the fix-len option heavily suffers the scarcity of integers. IEEE Trans. Inf. Theory, 21(2), 1975. available memory, whereas the rest of more space-efficient [5] V. Formoso, D. Fernández, F. Cacheda, and V. Carneiro. Using coding alternatives suffer a milder time penalty. Among the rating matrix compression techniques to speed up collaborative compression approaches, FOR stands out as the best one, recommendations. Inf. Ret., 16(6), 2013. [6] J. Goldstein, R. Ramakrishnan, and U. Shaft. Compressing being the fastest approach in every setting and only slightly relations and indexes. In ICDE, 1998. slower than the uncompressed option in the setting with high [7] P. Gupta, A. Goel, J. Lin, A. Sharma, D. Wang, and R. Zadeh. memory availability. WTF: The who to follow service at Twitter. In WWW, 2013. To understand better the slowness of the uncompressed [8] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl. Grouplens: An open architecture for collaborative filtering of data and the simple fix-len coding under tight memory con- netnews. In CSCW, 1994. straints, we observed in detail the performance characteristics [9] R. Rice and J. Plaunt. Adaptative variable-length coding for of the system, and noticed two factors: the higher time spent efficient compression of spacecraft television data. Trans. in the garbage collector and the inability to make full use of Communication Technology, 19(6), 1971. [10] A. Rowstron, D. Narayanan, A. Donnelly, G. O’Shea, and the multi-threading capabilities of the system. A. Douglas. Nobody ever got fired for using Hadoop on a On the one hand, these previous observations contrast cluster. In HotCDP, 2012. with prior work [5], in which CF data was primarily stored [11] F. Silvestri. Sorting out the document identifier assignment on classic disk search indices. In that work, the use of com- problem. In ECIR, 2007. [12] H. Zhang, G. Chen, B. Ooi, K. Tan, and M. Zhang. In-memory pression techniques always represented a speed up in access big data management and processing: A survey. IEEE TKDE, to CF data. As we observe, in the case of in-memory CF 27(7), 2015. data this situation does not necessarily happen, provided