<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Analysing Compression Techniques for In-Memory Collaborative Filtering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Saúl Vargas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Craig Macdonald</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iadh Ounis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>firstname.lastname}@glasgow.ac.uk</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Computing Science, University of Glasgow</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <abstract>
        <p>Following the recent trend of in-memory data processing, it is a usual practice to maintain collaborative ltering data in the main memory when generating recommendations in academic and industrial recommender systems. In this paper, we study the impact of integer compression techniques for in-memory collaborative ltering data in terms of space and time e ciency. Our results provide relevant observations about when and how to compress collaborative ltering data. First, we observe that, depending on the memory constraints, compression techniques may speed up or slow down the performance of state-of-the-art collaborative ltering algorithms. Second, after comparing di erent compression techniques, we nd the Frame of Reference (FOR) technique to be the best option in terms of space and time e ciency under di erent memory constraints.</p>
      </abstract>
      <kwd-group>
        <kwd>Recommender Systems</kwd>
        <kwd>Collaborative Filtering</kwd>
        <kwd>Index Compression</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        With the decrease of memory costs, having servers with
hundreds of GBs of main memory is nowadays an a ordable
option [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. With such infrastructure, performing in-memory
data processing has become a common and feasible option
in both single and multi-node environments. This trend can
be observed in di erent areas of computing systems such
as databases [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], search indices [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and recommendation
engines [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], more speci cally in recommendation engines
relying on collaborative ltering (CF) techniques. Indeed,
keeping CF data in memory is a usual practice, particularly in
the publicly available datasets for research purposes. In
realworld datasets, however, CF data is usually several orders of
magnitude larger than the public datasets, and thus e cient
representations of in-memory CF data may be needed. As
CF data is commonly represented as lists of numerical ids of
users and items, integer compression techniques [
        <xref ref-type="bibr" rid="ref1 ref3 ref4 ref6 ref9">1, 3, 4, 6,
9</xref>
        ] can be used to signi cantly reduce the amount of memory
required for representing them.
      </p>
      <p>
        In this work, we study the use of integer compression
techniques to compress CF data. Although there has been
prior work studying the bene ts of compression techniques
for CF data [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], that work focused on a scenario where the
data is mainly stored on disk. In that setting, data transfer
between disk and memory can be identi ed as a bottleneck
in the computation of recommendations and, consequently,
compression techniques consistently help in speeding up the
recommendation algorithms. In our fully in-memory setting,
depending on the available memory, compression techniques
may speed up computing time as well, but may also slow it
down. Furthermore, we explore a wider range of compression
techniques, nding that the Frame of Reference (FOR) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
technique o ers the best solution in terms of space and time
e ciency among the compared approaches.
2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>EXPERIMENTAL SETUP</title>
      <p>In order to analyse the e ect of compression techniques
for CF data, we have conducted a series of experiments with
two well known datasets: the dataset from the Net ix prize,
containing 100 million ratings from 480,000 users to 17,770
lms, and the Yahoo Music dataset, containing 717 million
ratings from 1.8 million users to 136,736 songs. In both
datasets, ratings are given in a scale between 1 to 5 stars.
These are, to our knowledge, two of the largest CF datasets
available for academic research purposes.</p>
      <p>
        Both datasets use numerical ids for identifying users and
items. In the case of the Net ix dataset, we map the original
user ids consecutive ids determined by the numerical order of
the original ids. More elaborate id re-assignation techniques,
which may lead to further compression e ciency [
        <xref ref-type="bibr" rid="ref11 ref2">2, 11</xref>
        ], are
left for future work. The preferences of each user/item are
represented with a list of sorted ids of items/users and a list
of numerical ratings. The lists of ids are compressed using a
variety of compression techniques: xed-length coding (
xlen), where each id is coded by using the minimum number
of bits required to store the largest id, coding [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],
EliasFano (EF) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], Rice [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], 3 coding [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and Frame of Reference
(FOR)1 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. With the exception of xed-length and
EliasFano, id arrays are stored with delta-gaps. Rating values
in the 1-5 scale are simply compressed with xed-length
coding, that is, using 3 bits to represent each rating. We use
RankSys2, a Recommender Systems framework written in
Java and, on top of it, we use the implementation of FOR in
the JavaFastPFOR3 package and the dsiutils4 package for
the rest of techniques.
      </p>
      <p>In order to measure the performance of the di erent
compression techniques in terms of time e ciency with respect
to uncompressed representations of the CF data, we generate
1Results with more sophisticated variations of FOR, such as
PFOR, are omitted as they do not di er signi cantly from
those of the simpler, original FOR.
2http://ir-uam.github.io/RankSys/
3https://github.com/lemire/JavaFastPFOR
4http://dsiutils.di.unimi.it/</p>
      <p>
        Net ix Y Music Table 1: Memory
usnone 1,608 11,486 age in MB of the Net ix
x-len 506 4,051 and Yahoo Music datasets
EF 229489 22,,189701 with di erent
compresRice 241 2,130 sion techniques for user
3 266 2,396 and item ids. Best results
FOR 273 2,354 in bold.
recommendations for 1,000 users randomly selected in both
datasets with the user-based nearest neighbours algorithm [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
provided by RankSys. For the purpose of simulating di erent
memory constraints and their e ect on the time e ciency
of the recommendations, we selected for both datasets three
di erent settings for heap size (via the -Xmx parameter in
Java): a heap size in which the uncompressed datasets t
in memory without problem (4.8 GB for Net ix and 32 GB
for Yahoo Music), a heap size slightly higher that what is
required to keep the dataset in memory (2.4 GB and 16
GB) and, nally, a heap size slightly higher than what is
required to allocate the data with x-len coding but where
the uncompressed data cannot be allocated (0.8 GB and
6 GB). Moreover, recommendations were generated in a
multi-threaded environment using 8 parallel threads. This
simulates a realistic high-demand environment where many
of the bene ts of caching are lost and there is a high demand
of memory for auxiliary data structures for the CF algorithm.
      </p>
    </sec>
    <sec id="sec-3">
      <title>EXPERIMENTAL EVALUATION</title>
      <p>The results of the experimental setup previously described
are summarised in Table 1 and Table 2. The results in
terms of memory usage in Table 1 indicate similar trends for
both datasets: while a simple x-len encoding is capable of
reducing notably the size of the CF data in memory (by one
third in both datasets), the rest of compression techniques
are able to further reduce the usage of memory (below 20%
in most cases), being EF and Rice the most space e cient
among the compared alternatives.</p>
      <p>In terms of time e ciency, Table 2 illustrates the change
in performance when di erent memory constraints are
considered. Again, the observed trends are equivalent in both
datasets. Under the lightest memory constraints, it can be
observed that the fastest option is working with uncompressed
data. However, with the intermediate memory constraints,
all the compression techniques are faster than the
uncompressed data. Finally, in the setting with the heaviest memory
constraints, the x-len option heavily su ers the scarcity of
available memory, whereas the rest of more space-e cient
coding alternatives su er a milder time penalty. Among the
compression approaches, FOR stands out as the best one,
being the fastest approach in every setting and only slightly
slower than the uncompressed option in the setting with high
memory availability.</p>
      <p>To understand better the slowness of the uncompressed
data and the simple x-len coding under tight memory
constraints, we observed in detail the performance characteristics
of the system, and noticed two factors: the higher time spent
in the garbage collector and the inability to make full use of
the multi-threading capabilities of the system.</p>
      <p>
        On the one hand, these previous observations contrast
with prior work [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], in which CF data was primarily stored
on classic disk search indices. In that work, the use of
compression techniques always represented a speed up in access
to CF data. As we observe, in the case of in-memory CF
data this situation does not necessarily happen, provided
none
      </p>
      <p>
        x-len
EF
Rice
3
FOR
that the remaining memory is large enough to support the
auxiliary data structures and the variables required for the
computation of the algorithms. On the other hand, when
comparing the performance of the di erent compression
techniques, our results are in line with those of [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for search
index compression, in which the FOR technique is also found
to be the best solution in terms of time e ciency.
4.
      </p>
    </sec>
    <sec id="sec-4">
      <title>CONCLUSIONS AND FUTURE WORK</title>
      <p>In this paper we have conducted a study of the performance
of compression techniques for keeping CF data in the main
memory. We nd that compression techniques in this case,
as opposed to when the data is primarily stored on disk, may
actually slow down the processing time when the remaining
memory available for the processing of CF algorithms is
large enough. Under more severe memory constraints, we
nd that compression techniques are indeed able to speed
up the generation of recommendations. Finally, we nd that
the FOR technique is the best compression approach as its
space e ciency is at the same level of the rest of compared
alternatives while its time e ciency is clearly better than
these and close to that of using no compression.</p>
      <p>As part of future work, we envisage the usage of hybrid
representations of CF data, in which the system may selectively
compress part of the data in order to maximise the
performance of the system under di erent memory constraints.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Boldi</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Vigna</surname>
          </string-name>
          .
          <article-title>Codes for the world wide web</article-title>
          .
          <source>Internet Mathematics</source>
          ,
          <volume>2</volume>
          (
          <issue>4</issue>
          ),
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Catena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Macdonald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I.</given-names>
            <surname>Ounis</surname>
          </string-name>
          .
          <article-title>On inverted index compression for search engine e ciency</article-title>
          .
          <source>In ECIR</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Elias</surname>
          </string-name>
          .
          <article-title>E cient storage and retrieval by content and address of static les</article-title>
          .
          <source>J. ACM</source>
          ,
          <volume>21</volume>
          (
          <issue>2</issue>
          ),
          <year>1974</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Elias</surname>
          </string-name>
          .
          <article-title>Universal codeword sets and representations of the integers</article-title>
          .
          <source>IEEE Trans. Inf. Theory</source>
          ,
          <volume>21</volume>
          (
          <issue>2</issue>
          ),
          <year>1975</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V.</given-names>
            <surname>Formoso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Cacheda</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Carneiro</surname>
          </string-name>
          .
          <article-title>Using rating matrix compression techniques to speed up collaborative recommendations</article-title>
          .
          <source>Inf</source>
          . Ret.,
          <volume>16</volume>
          (
          <issue>6</issue>
          ),
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Goldstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>U.</given-names>
            <surname>Shaft</surname>
          </string-name>
          .
          <article-title>Compressing relations and indexes</article-title>
          .
          <source>In ICDE</source>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Goel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Zadeh</surname>
          </string-name>
          . WTF:
          <article-title>The who to follow service at Twitter</article-title>
          .
          <source>In WWW</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P.</given-names>
            <surname>Resnick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Iacovou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Suchak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bergstrom</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Riedl</surname>
          </string-name>
          . Grouplens:
          <article-title>An open architecture for collaborative ltering of netnews</article-title>
          .
          <source>In CSCW</source>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rice</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Plaunt</surname>
          </string-name>
          .
          <article-title>Adaptative variable-length coding for e cient compression of spacecraft television data</article-title>
          .
          <source>Trans. Communication Technology</source>
          ,
          <volume>19</volume>
          (
          <issue>6</issue>
          ),
          <year>1971</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rowstron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Narayanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Donnelly</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>O'Shea, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Douglas</surname>
          </string-name>
          .
          <article-title>Nobody ever got red for using Hadoop on a cluster</article-title>
          .
          <source>In HotCDP</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Silvestri</surname>
          </string-name>
          .
          <article-title>Sorting out the document identi er assignment problem</article-title>
          .
          <source>In ECIR</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , G. Chen,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ooi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Tan</surname>
          </string-name>
          , and
          <string-name>
            <surname>M. Zhang.</surname>
          </string-name>
          <article-title>In-memory big data management and processing: A survey</article-title>
          .
          <source>IEEE TKDE</source>
          ,
          <volume>27</volume>
          (
          <issue>7</issue>
          ),
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>