=Paper= {{Paper |id=Vol-1441/recsys2015_poster2 |storemode=property |title=Analysing Compression Techniques for In-Memory Collaborative Filtering |pdfUrl=https://ceur-ws.org/Vol-1441/recsys2015_poster2.pdf |volume=Vol-1441 |dblpUrl=https://dblp.org/rec/conf/recsys/VargasMO15 }} ==Analysing Compression Techniques for In-Memory Collaborative Filtering== https://ceur-ws.org/Vol-1441/recsys2015_poster2.pdf
                            Analysing Compression Techniques
                            for In-Memory Collaborative Filtering

                                      Saúl Vargas, Craig Macdonald and Iadh Ounis
                                          {firstname.lastname}@glasgow.ac.uk
                                        School of Computing Science, University of Glasgow



ABSTRACT                                                                  in the computation of recommendations and, consequently,
Following the recent trend of in-memory data processing, it               compression techniques consistently help in speeding up the
is a usual practice to maintain collaborative filtering data              recommendation algorithms. In our fully in-memory setting,
in the main memory when generating recommendations in                     depending on the available memory, compression techniques
academic and industrial recommender systems. In this paper,               may speed up computing time as well, but may also slow it
we study the impact of integer compression techniques for                 down. Furthermore, we explore a wider range of compression
in-memory collaborative filtering data in terms of space and              techniques, finding that the Frame of Reference (FOR) [6]
time efficiency. Our results provide relevant observations                technique offers the best solution in terms of space and time
about when and how to compress collaborative filtering data.              efficiency among the compared approaches.
First, we observe that, depending on the memory constraints,
compression techniques may speed up or slow down the per-                 2.   EXPERIMENTAL SETUP
formance of state-of-the-art collaborative filtering algorithms.             In order to analyse the effect of compression techniques
Second, after comparing different compression techniques, we              for CF data, we have conducted a series of experiments with
find the Frame of Reference (FOR) technique to be the best                two well known datasets: the dataset from the Netflix prize,
option in terms of space and time efficiency under different              containing 100 million ratings from 480,000 users to 17,770
memory constraints.                                                       films, and the Yahoo Music dataset, containing 717 million
                                                                          ratings from 1.8 million users to 136,736 songs. In both
Categories and Subject Descriptors: H.3.3 [Informa-
                                                                          datasets, ratings are given in a scale between 1 to 5 stars.
tion Storage & Retrieval]: Information Filtering
                                                                          These are, to our knowledge, two of the largest CF datasets
Keywords: Recommender Systems, Collaborative Filtering,                   available for academic research purposes.
Index Compression.                                                           Both datasets use numerical ids for identifying users and
                                                                          items. In the case of the Netflix dataset, we map the original
1.    INTRODUCTION                                                        user ids consecutive ids determined by the numerical order of
   With the decrease of memory costs, having servers with                 the original ids. More elaborate id re-assignation techniques,
hundreds of GBs of main memory is nowadays an affordable                  which may lead to further compression efficiency [2, 11], are
option [10]. With such infrastructure, performing in-memory               left for future work. The preferences of each user/item are
data processing has become a common and feasible option                   represented with a list of sorted ids of items/users and a list
in both single and multi-node environments. This trend can                of numerical ratings. The lists of ids are compressed using a
be observed in different areas of computing systems such                  variety of compression techniques: fixed-length coding (fix-
as databases [12], search indices [2] and recommendation                  len), where each id is coded by using the minimum number
engines [7], more specifically in recommendation engines                  of bits required to store the largest id, γ coding [4], Elias-
relying on collaborative filtering (CF) techniques. Indeed,               Fano (EF) [3], Rice [9], ζ3 coding [1] and Frame of Reference
keeping CF data in memory is a usual practice, particularly in            (FOR)1 [6]. With the exception of fixed-length and Elias-
the publicly available datasets for research purposes. In real-           Fano, id arrays are stored with delta-gaps. Rating values
world datasets, however, CF data is usually several orders of             in the 1-5 scale are simply compressed with fixed-length
magnitude larger than the public datasets, and thus efficient             coding, that is, using 3 bits to represent each rating. We use
representations of in-memory CF data may be needed. As                    RankSys2 , a Recommender Systems framework written in
CF data is commonly represented as lists of numerical ids of              Java and, on top of it, we use the implementation of FOR in
users and items, integer compression techniques [1, 3, 4, 6,              the JavaFastPFOR3 package and the dsiutils4 package for
9] can be used to significantly reduce the amount of memory               the rest of techniques.
required for representing them.                                              In order to measure the performance of the different com-
   In this work, we study the use of integer compression                  pression techniques in terms of time efficiency with respect
techniques to compress CF data. Although there has been                   to uncompressed representations of the CF data, we generate
prior work studying the benefits of compression techniques
for CF data [5], that work focused on a scenario where the                1
                                                                            Results with more sophisticated variations of FOR, such as
data is mainly stored on disk. In that setting, data transfer             PFOR, are omitted as they do not differ significantly from
between disk and memory can be identified as a bottleneck                 those of the simpler, original FOR.
                                                                          2
                                                                            http://ir-uam.github.io/RankSys/
Copyright is held by the authors.                                         3
                                                                            https://github.com/lemire/JavaFastPFOR
RecSys 2015 Poster Proceedings, September 16-20, 2015, Vienna, Austria.   4
                                                                            http://dsiutils.di.unimi.it/
                                                                                         Netflix                   Y Music
           Netflix   Y Music    Table 1:        Memory us-
                                                                                 4.8 GB 2.4 GB 0.8 GB 32 GB 16 GB 6 GB
  none      1,608      11,486   age in MB of the Netflix
  fix-len     506       4,051   and Yahoo Music datasets               none      43.96   109.43      -    100.74 267.66   -
  γ           298       2,871                                          fix-len   58.68    60.51    743.05 119.07 122.22 458.50
  EF          249       2,190   with    different compres-             γ         63.03    64.00     71.13 131.64 128.27 131.49
  Rice        241       2,130   sion techniques for user               EF        64.51    67.71     74.20 134.44 130.74 137.54
  ζ3          266       2,396   and item ids. Best results             Rice      69.82    69.73     77.42 144.15 135.53 137.85
  FOR         273       2,354                                          ζ3        63.91    66.65     74.81 129.71 145.08 142.80
                                in bold.                               FOR       48.79    50.43     55.36 111.33 106.80 108.92
recommendations for 1,000 users randomly selected in both
datasets with the user-based nearest neighbours algorithm [8]     Table 2: Execution time in seconds of the user-based
provided by RankSys. For the purpose of simulating different      nearest neighbours algorithm in the Netflix and Ya-
memory constraints and their effect on the time efficiency        hoo Music datasets with different compression tech-
of the recommendations, we selected for both datasets three       niques for user and item ids. Best results in bold.
different settings for heap size (via the -Xmx parameter in       that the remaining memory is large enough to support the
Java): a heap size in which the uncompressed datasets fit         auxiliary data structures and the variables required for the
in memory without problem (4.8 GB for Netflix and 32 GB           computation of the algorithms. On the other hand, when
for Yahoo Music), a heap size slightly higher that what is        comparing the performance of the different compression tech-
required to keep the dataset in memory (2.4 GB and 16             niques, our results are in line with those of [2] for search
GB) and, finally, a heap size slightly higher than what is        index compression, in which the FOR technique is also found
required to allocate the data with fix-len coding but where       to be the best solution in terms of time efficiency.
the uncompressed data cannot be allocated (0.8 GB and
6 GB). Moreover, recommendations were generated in a
                                                                  4.     CONCLUSIONS AND FUTURE WORK
multi-threaded environment using 8 parallel threads. This            In this paper we have conducted a study of the performance
simulates a realistic high-demand environment where many          of compression techniques for keeping CF data in the main
of the benefits of caching are lost and there is a high demand    memory. We find that compression techniques in this case,
of memory for auxiliary data structures for the CF algorithm.     as opposed to when the data is primarily stored on disk, may
                                                                  actually slow down the processing time when the remaining
3.   EXPERIMENTAL EVALUATION                                      memory available for the processing of CF algorithms is
                                                                  large enough. Under more severe memory constraints, we
   The results of the experimental setup previously described
                                                                  find that compression techniques are indeed able to speed
are summarised in Table 1 and Table 2. The results in
                                                                  up the generation of recommendations. Finally, we find that
terms of memory usage in Table 1 indicate similar trends for
                                                                  the FOR technique is the best compression approach as its
both datasets: while a simple fix-len encoding is capable of
                                                                  space efficiency is at the same level of the rest of compared
reducing notably the size of the CF data in memory (by one
                                                                  alternatives while its time efficiency is clearly better than
third in both datasets), the rest of compression techniques
                                                                  these and close to that of using no compression.
are able to further reduce the usage of memory (below 20%
                                                                     As part of future work, we envisage the usage of hybrid rep-
in most cases), being EF and Rice the most space efficient
                                                                  resentations of CF data, in which the system may selectively
among the compared alternatives.
                                                                  compress part of the data in order to maximise the perfor-
   In terms of time efficiency, Table 2 illustrates the change
                                                                  mance of the system under different memory constraints.
in performance when different memory constraints are con-
sidered. Again, the observed trends are equivalent in both        5.     REFERENCES
datasets. Under the lightest memory constraints, it can be ob-     [1] P. Boldi and S. Vigna. Codes for the world wide web. Internet
                                                                       Mathematics, 2(4), 2005.
served that the fastest option is working with uncompressed        [2] M. Catena, C. Macdonald, and I. Ounis. On inverted index
data. However, with the intermediate memory constraints,               compression for search engine efficiency. In ECIR, 2014.
all the compression techniques are faster than the uncom-          [3] P. Elias. Efficient storage and retrieval by content and address
pressed data. Finally, in the setting with the heaviest memory         of static files. J. ACM, 21(2), 1974.
                                                                   [4] P. Elias. Universal codeword sets and representations of the
constraints, the fix-len option heavily suffers the scarcity of        integers. IEEE Trans. Inf. Theory, 21(2), 1975.
available memory, whereas the rest of more space-efficient         [5] V. Formoso, D. Fernández, F. Cacheda, and V. Carneiro. Using
coding alternatives suffer a milder time penalty. Among the            rating matrix compression techniques to speed up collaborative
compression approaches, FOR stands out as the best one,                recommendations. Inf. Ret., 16(6), 2013.
                                                                   [6] J. Goldstein, R. Ramakrishnan, and U. Shaft. Compressing
being the fastest approach in every setting and only slightly          relations and indexes. In ICDE, 1998.
slower than the uncompressed option in the setting with high       [7] P. Gupta, A. Goel, J. Lin, A. Sharma, D. Wang, and R. Zadeh.
memory availability.                                                   WTF: The who to follow service at Twitter. In WWW, 2013.
   To understand better the slowness of the uncompressed           [8] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl.
                                                                       Grouplens: An open architecture for collaborative filtering of
data and the simple fix-len coding under tight memory con-             netnews. In CSCW, 1994.
straints, we observed in detail the performance characteristics    [9] R. Rice and J. Plaunt. Adaptative variable-length coding for
of the system, and noticed two factors: the higher time spent          efficient compression of spacecraft television data. Trans.
in the garbage collector and the inability to make full use of         Communication Technology, 19(6), 1971.
                                                                  [10] A. Rowstron, D. Narayanan, A. Donnelly, G. O’Shea, and
the multi-threading capabilities of the system.                        A. Douglas. Nobody ever got fired for using Hadoop on a
   On the one hand, these previous observations contrast               cluster. In HotCDP, 2012.
with prior work [5], in which CF data was primarily stored        [11] F. Silvestri. Sorting out the document identifier assignment
on classic disk search indices. In that work, the use of com-          problem. In ECIR, 2007.
                                                                  [12] H. Zhang, G. Chen, B. Ooi, K. Tan, and M. Zhang. In-memory
pression techniques always represented a speed up in access            big data management and processing: A survey. IEEE TKDE,
to CF data. As we observe, in the case of in-memory CF                 27(7), 2015.
data this situation does not necessarily happen, provided