=Paper=
{{Paper
|id=Vol-3303/paper8
|storemode=property
|title=Monolith: Real Time Recommendation System with Collisionless Embedding Table
|pdfUrl=https://ceur-ws.org/Vol-3303/paper8.pdf
|volume=Vol-3303
|authors=Zhuoran Liu,Leqi Zou,Xuan Zou,Caihua Wang,Biao Zhang,Da Tang,Bolin Zhu,Yijie Zhu,Peng Wu,Ke Wang,Youlong Cheng
|dblpUrl=https://dblp.org/rec/conf/recsys/0004ZZWZTZZWWC22
}}
==Monolith: Real Time Recommendation System with Collisionless Embedding Table==
<pdf width="1500px">https://ceur-ws.org/Vol-3303/paper8.pdf</pdf>
<pre>
Monolith: Real Time Recommendation System With
Collisionless Embedding Table
Zhuoran Liu1 , Leqi Zou1 , Xuan Zou1 , Caihua Wang1 , Biao Zhang1 , Da Tang1 , Bolin Zhu2,† ,
Yijie Zhu1 , Peng Wu1 , Ke Wang1 and Youlong Cheng1,‡
1
    Bytedance Inc.
2
    Fudan University


                                           Abstract
                                           Building a scalable and real-time recommendation system is vital for many businesses driven by time-sensitive customer
                                           feedback, such as short-videos ranking or online ads. Despite the ubiquitous adoption of production-scale deep learning
                                           frameworks like TensorFlow or PyTorch, these general-purpose frameworks fall short of business demands in recommendation
                                           scenarios for various reasons: on one hand, tweaking systems based on static parameters and dense computations for
                                           recommendation with dynamic and sparse features is detrimental to model quality; on the other hand, such frameworks
                                           are designed with batch-training stage and serving stage completely separated, preventing the model from interacting with
                                           customer feedback in real-time. These issues led us to reexamine traditional approaches and explore radically different design
                                           choices. In this paper, we present Monolith1 , a system tailored for online training. Our design has been driven by observations
                                           of our application workloads and production environment that reflects a marked departure from other recommendations
                                           systems. Our contributions are manifold: first, we crafted a collisionless embedding table with optimizations such as expirable
                                           embeddings and frequency filtering to reduce its memory footprint; second, we provide an production-ready online training
                                           architecture with high fault-tolerance; finally, we proved that system reliability could be traded-off for real-time learning.
                                           Monolith has successfully landed in the BytePlus Recommend2 product.


1. Introduction                                                                                                      2. The underlying distribution of training data is
                                                                                                                        non-stationary, a.k.a. Concept Drift [7].
The past decade witnessed a boom of businesses powered
by recommendation techniques. In pursuit of a better                                                            Such differences have posed unique challenges to re-
customer experience, delivering personalized content for                                                      searchers and engineers working on recommendation
each individual user as real-time response is a common                                                        systems.
goal of these business applications. To this end, infor-
mation from a user’s latest interaction is often used as                                                      1.1. Sparsity and Dynamism
the primary input for training a model, as it would best
                                                                                                              The data for recommendation mostly contain sparse cat-
depict a user’s portrait and make predictions of user’s
                                                                                                              egorical features, some of which appear with low fre-
interest and future behaviors.
                                                                                                              quency. The common practice of mapping them to a
   Deep learning have been dominating recommenda-
                                                                                                              high-dimensional embedding space would give rise to a
tion models [1, 2, 3, 4, 5, 6] as the gigantic amount of
                                                                                                              series of issues:
user data is a natural fit for massively data-driven neural
models. However, efforts to leverage the power of deep                                                                • Unlike language models where number of word-
learning in industry-level recommendation systems are                                                                   pieces are limited, the amount of users and rank-
constantly encountered with problems arising from the                                                                   ing items are orders of magnitude larger. Such an
unique characteristics of data derived from real-world                                                                  enormous embedding table would hardly fit into
user behavior. These data are drastically different from                                                                single host memory;
those used in conventional deep learning problems like                                                                • Worse still, the size of embedding table is ex-
language modeling or computer vision in two aspects:                                                                    pected to grow over time as more users and items
                  1. The features are mostly sparse, categorical and                                                    are admitted, while frameworks like [8, 9] uses a
                     dynamically changing;                                                                              fixed-size dense variables to represent embedding
                                                                                                                        table.
ORSUM@ACM RecSys 2022: 5th Workshop on Online Recommender
Systems and User Modeling, jointly with the 16th ACM Conference on                                                                     In practice, many systems adopt low-collision hashing
Recommender Systems, September 23rd, 2022, Seattle, WA, USA                                                                         [1, 10] as a way to reduce memory footprint and to allow
†
    Work done during internship at Bytedance Inc.                                                                                   growing of IDs. This relies on an over-idealistic assump-
‡
    Corresponding author.                                                                                                           tion that IDs in the embedding table is distributed evenly
Envelope-Open youlong.cheng@bytedance.com (Y. Cheng)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License in frequency, and collisions are harmless to the model
                                       Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                                      quality. Unfortunately this is rarely true for a real-world
         Data                  Data                 Training                                                                          Model
                                                                                           Training PS         Serving PS                                       User
     (Batch Training)     (Online Training)         Worker                                                                            Server

                        Historical batch data                                                                                                                          Batch
                                                                     Feature IDs                                                                                       Training
                                                                                                                                                                       Stage
                                                                              Embedding table
                                                                              lookup
                                                                 Feature embeddings


                                      Model forward
                                      and backward pass
                                                               Feature IDs and gradients

                                                                               Apply gradient
                                                                               updates


                                      Online streaming                                              Parameter                                                          Online
                                            data                                                      Sync                                                             Training
                                                                     Feature IDs                                                               User Request
                                                                                                                                                                       Stage
                                                                              Embedding table                           Feature IDs
                                                                              lookup
                                                                 Feature embeddings                      Embedding
                                                                                                         table lookup
                                                                                                                          Feature
                                      Model forward                                                                     embeddings
                                      and backward pass                                              Sync
                                                               Feature IDs and gradients           interval                   Model
                                                                                                                              forward
                                                                                      Apply                                   pass
                                                                                                                                               Ranking Result
                                                                                      gradient
                                           Data of                                    updates
                                                                                                                                                User Actions
                                        features and
                                        user reactions

                                                                                                    Parameter
                                                                                                      Sync


Figure 1: Monolith Online Training Architecture.


recommendation system, where a small group of users                                              that arises from our production, we designed Monolith,
or items have significantly more occurrences. With the                                           a large-scale recommendation system to address these
organic growth of embedding table size, chances of hash                                          pain-points. We did extensive experiments to verify and
key collision increases and lead to deterioration of model                                       iterate our design in the production environment. Mono-
quality [10].                                                                                    lith is able to
  Therefore it is a natural demand for production-scale
                                                                                                          1. Provide full expressive power for sparse features
recommendation systems to have the capacity to capture
                                                                                                             by designing a collisionless hash table and a dy-
as many features in its parameters, and also have the
                                                                                                             namic feature eviction mechanism;
capability of elastically adjusting the number of users
and items it tries to book-keep.                                                                          2. Loop serving feedback back to training in real-
                                                                                                             time with online training.

1.2. Non-stationary Distribution                               Empowered by these architectural capacities, Monolith
                                                             consistently outperforms systems that adopts hash-tricks
Visual and linguistic patterns barely develop in a time with collisions with roughly similar memory usage, and
scale of centuries, while the same user interested in one achieves state-of-the-art online serving AUC without
topic could shift their zeal every next minute. As a result, overly burdening our servers’ computation power.
the underlying distribution of user data is non-stationary,    The rest of the paper is organized as follows. We first
a phenomenon commonly referred to as Concept Drift elaborate design details of how Monolith tackles exist-
[7].                                                         ing challenge with collisionless hash table and realtime
   Intuitively, information from a more recent history training in Section 2. Experiments and results will be
can more effectively contribute to predicting the change presented in Section 3, along with production-tested con-
in a user’s behavior. To mitigate the effect of Concept clusions and some discussion of trade-offs between time-
Drift, serving models need to be updated from new user sensitivity, reliability and model quality. Section 4 sum-
feedback as close to real-time as possible to reflect the marizes related work and compares them with Monolith.
latest interest of a user.                                   Section 5 concludes this work.

In light of these distinction and in observation of issues
                                                                                   2.1. Hash Table
                                        Worker               Worker          ...
                                                                                   A first principle in our design of sparse parameter repre-
     Client       Master
                                                                                   sentation is to avoid cramping information from different
                                                                                   IDs into the same fixed-size embedding. Simulating a dy-
                                       Parameter            Parameter        ...
                                                                                   namic size embedding table with an out-of-the-box Ten-
                                         Server               Server
                                                                                   sorFlow Variable inevitably leads to ID collision, which
                                                                                   exacerbates as new IDs arrive and table grows. Therefore
                                                                                   instead of building upon Variable, we developed a new
Figure 2: Worker-PS Architecture.                                                  key-value HashTable for our sparse parameters.
                                                                                      Our HashTable utilizes Cuckoo Hashmap [11] under
                                                                                   the hood, which supports inserting new keys without
                                                                                   colliding with existing ones. Cuckoo Hashing achieves
2. Design                                                                          worst-case 𝑂(1) time complexity for lookups and dele-
                                                                                   tions, and an expected amortized 𝑂(1) time for insertions.
The overall architecture of Monolith generally follows
                                                                                   As illustrated in Figure 3 it maintains two tables 𝑇0 , 𝑇1
TensorFlow’s distributed Worker-ParameterServer set-
                                                                                   with different hash functions ℎ0 (𝑥), ℎ1 (𝑥), and an element
ting (Figure 2). In a Worker-PS architecture, machines are
                                                                                   would be stored in either one of them. When trying to
assigned different roles; Worker machines are responsi-
                                                                                   insert an element 𝐴 into 𝑇0 , it first attempts to place 𝐴
ble for performing computations as defined by the graph,
                                                                                   at ℎ0 (𝐴); If ℎ0 (𝐴) is occupied by another element 𝐵, it
and PS machines stores parameters and updates them
                                                                                   would evict 𝐵 from 𝑇0 and try inserting 𝐵 into 𝑇1 with
according to gradients computed by Workers.
                                                                                   the same logic. This process will be repeated until all el-
   In recommendation models, parameters are catego-
                                                                                   ements stabilize, or rehash happens when insertion runs
rized into two sets: dense and sparse. Dense parame-
                                                                                   into a cycle.
ters are weights/variables in a deep neural network, and
                                                                                      Memory footprint reduction is also an important con-
sparse parameters refer to embedding tables that corre-
                                                                                   sideration in our design. A naive approach of inserting
sponds to sparse features. In our design, both dense and
                                                                                   every new ID into the HashTable will deplete memory
sparse parameters are part of TensorFlow Graph, and are
                                                                                   quickly. Observation of real production models lead to
stored on parameter servers.
                                                                                   two conclusions:
   Similar to TensorFlow’s Variable for dense parame-
ters, we designed a set of highly-efficient, collisionless,                            1. IDs that appears only a handful of times have
and flexible HashTable operations for sparse parame-                                      limited contribution to improving model quality.
ters. As an complement to TensorFlow’s limitation that                                    An important observation is that IDs are long-tail
arises from separation of training and inference, Mono-                                   distributed, where popular IDs may occur mil-
lith’s elastically scalable online training is designed to                                lions of times while the unpopular ones appear
efficiently synchronize parameters from training-PS to                                    no more than ten times. Embeddings correspond-
online serving-PS within short intervals, with model ro-                                  ing to these infrequent IDs are underfit due to
bustness guarantee provided by fault tolerance mecha-                                     lack of training data and the model will not be
nism.                                                                                     able to make a good estimation based on them.
                                                                                          At the end of the day these IDs are not likely to
                                                                                          affect the result, so model quality will not suffer
                   A                                                                      from removal of these IDs with low occurrences;
                       h0(A)                                                           2. Stale IDs from a distant history seldom contribute
                                                                                          to the current model as many of them are never
                   BA                          DC              G        T0
                                                                                          visited. This could possibly due to a user that
                           h1(D)       h1(B)        h0(C)
                                                                                          is no longer active, or a short-video that is out-
                                                                                          of-date. Storing embeddings for these IDs could
                                                                                          not help model in any way but to drain our PS
              D                    E    F              CB               T1
                                                                                          memory in vain.
                                                                                     Based on these observation, we designed several fea-
Figure 3: Cuckoo HashMap.                                                          ture ID filtering heuristics for a more memory-efficient
                                                                                   implementation of HashTable:
                                                                                       1. IDs are filtered before they are admitted into em-
                                                                                          bedding tables. We have two filtering methods:
                                                                                          First we filter by their occurrences before they
                        Actions
         User                           Log Kafka                                     Training Example           Dump Batch Data
                      (Click, Like,                                                         Kafka
                       Convert …)


                                                             Joiner                   Online                    Batch Training Data
                                                            Flink Job                Training                          HDFS


                                                                                                          Batch Training
                        Generate
                        Features
     Model Server                     Feature Kafka                                    Training Worker


                                        Parameter Synchronization
      Serving PS                                                                         Training PS


Figure 4: Streaming Engine.
The information feedback loop from [User → Model Server → Training Worker → Model Server → User] would spend a long time
                             when taking the Batch Training path, while the Online Training will close the loop more instantly.


       are inserted as keys, where the threshold of occur-                    enters the online training stage. Instead of read-
       rences is a tunable hyperparameter that varies for                     ing mini-batch examples from the storage, a train-
       each model; In addition we utilize a probabilistic                     ing worker consumes realtime data on-the-fly and
       filter which helps further reduce memory usage;                        updates the training PS. The training PS periodi-
    2. IDs are timed and set to expire after being inactive                   cally synchronizes its parameters to the serving
       for a predefined period of time. The expire time is                    PS, which will take effect on the user side imme-
       also tunable for each embedding table to allow for                     diately. This enables our model to interactively
       distinguishing features with different sensitivity                     adapt itself according to a user’s feedback in real-
       to historical information.                                             time.
  In our implementation, HashTable is implemented as
a TensorFlow resource operation. Similar to Variable,                   2.2.1. Streaming Engine
look-ups and updates are also implemented as native
TensorFlow operations for easier integration and better   Monolith is built with the capability of seamlessly switch-
compatibility.                                            ing between batch training and online training. This is
                                                          enabled by our design of streaming engine as illustrated
                                                          by Figure 4.
2.2. Online Training                                         In our design, we use one Kafka [12] queue to log ac-
In Monolith, training is divided into two stages (Figure  tions  of users (E.g. Click on an item or like an item etc.)
1):                                                       and another Kafka queue for features. At the core of the
                                                          engine is a Flink [13] streaming job for online feature
    1. Batch training stage. This stage works as an or- Joiner. The online joiner concatenates features with la-
       dinary TensorFlow training loop: In each train- bels from user actions and produces training examples,
       ing step, a training worker reads one mini-batch which are then written to a Kafka queue. The queue for
       of training examples from the storage, requests training examples is consumed by both online training
       parameters from PS, computes a forward and a and batch training:
       backward pass, and finally push updated param-
       eters to the training PS. Slightly different from        • For online training, the training worker directly
       other common deep learning tasks, we only train             reads data from the Kafka queue;
       our dataset for one pass. Batch training is useful       • For batch training, a data dumping job will first
       for training historical data when we modify our             dump data to HDFS [14]; After data in HDFS ac-
       model architecture and retrain the model;                   cumulated to certain amount, training worker
    2. Online training stage. After a model is deployed            will retrieve data from HDFS and perform batch
       to online serving, the training does not stop but           training.
      Log Kafka                    Feature Kafka
    (User Actions)


                      In-memory                                                          Negative             Training Example
                                                                 Join
                         Cache                                                           Sampling                   Kafka


                                             Found in Cache


                       On-disk
                       KV-Store
                                            Read from KV-Store


Figure 5: Online Joiner.


Updated parameters in training PS will be pushed to                     2.2.3. Parameter Synchronization
serving PS according to the parameter synchronization
                                                                        During online training, the Monolith training cluster
schedule.
                                                                        keeps receiving data from the online serving module and
                                                                        updates parameters on the training PS. A crucial step to
2.2.2. Online Joiner                                                    enable the online serving PS to benefit from these newly
In real-world applications, user actions log and features               trained parameters is the synchronization of updated
are streamed into the online joiner (Figure 5) without                  model parameters. In production environment, we are
guarantee in time order. Therefore we use a unique key                  encountered by several challenges:
for each request so that user action and features could
                                                                             • Models on the online serving PS must not stop
correctly pair up.
                                                                               serving when updating. Our models in produc-
   The lag of user action could also be a problem. For
                                                                               tion is usually several terabytes in size, and as a
example, a user may take a few days before they decide
                                                                               result replacing all parameters takes a while. It
to buy an item they were presented days ago. This is a
                                                                               would be intolerable to stop an online PS from
challenge for the joiner because if all features are kept in
                                                                               serving the model during the replacement pro-
cache, it would simply not fit in memory. In our system,
                                                                               cess, and updates must be made on-the-fly;
an on-disk key-value storage is utilized to store features
that are waiting for over certain time period. When a                        • Transferring a multi-terabyte model of its entirety
user action log arrives, it first looks up the in-memory                       from training PS to the online serving PS would
cache, and then looks up the key-value storage in case of                      pose huge pressure to both the network band-
a missing cache.                                                               width and memory on PS, since it requires dou-
   Another problem that arise in real-world application                        bled model size of memory to accept the newly
is that the distribution of negative and positive examples                     arriving model.
are highly uneven, where number of the former could be                     For the online training to scale up to the size of our
magnitudes of order higher than the latter. To prevent                  business scenario, we designed an incremental on-the-
positive examples from being overwhelmed by negative                    fly periodic parameter synchronization mechanism in
ones, a common strategy is to do negative sampling. This                Monolith based on several noticeable characteristic of
would certainly change the underlying distribution of the               our models:
trained model, tweaking it towards higher probability of
making positive predictions. As a remedy, we apply log                      1. Sparse parameters are dominating the size of rec-
odds correction [15] during serving, making sure that                          ommendation models;
the online model is an unbiased estimator of the original                   2. Given a short range of time window, only a small
distribution.                                                                  subset of IDs gets trained and their embeddings
                                                                               updated;
                                                                                             Output

    3. Dense variables move much slower than sparse
       embeddings. This is because in momentum-based                                                                 MLP

       optimizers, the accumulation of momentum for                            FM


       dense variables is magnified by the gigantic size of                           ...
       recommendation training data, while only a few
       sparse embeddings receives updates in a single
                                                                Embeddings
       data batch.
                                                                   IDs                                   ...
(1) and (2) allows us to exploit the sparse updates across
all feature IDs. In Monolith, we maintain a hash set of
touched keys, representing IDs whose embeddings get           Figure 6: DeepFM model architecture.
trained since the last parameter synchronization. We
push the subset of sparse parameters whose keys are in
the touched-keys set with a minute-level time interval        from different aspects. We aim to answer the following
from the training PS to the online serving PS. This rel-      questions by our experiments:
atively small pack of incremental parameter update is
lightweight for network transmission and will not cause           1. How much can we benefit from a collisionless
a sharp memory spike during the synchronization.                     HashTable?
   We also exploit (3) to further reduce network I/O and          2. How important is realtime online training?
memory usage by setting a more aggressive sync sched-             3. Is Monolith’s design of parameter synchroniza-
ule for sparse parameters, while updating dense param-               tion robust enough in a large-scale production
eters less frequently. This could render us a situation              scenario?
where the dense parameters we serve is a relatively stale
version compared to sparse part. However, such incon-            In this section, we first present our experimental set-
sistency could be tolerated due to the reason mentioned       tings and then discuss results and our findings in detail.
in (3) as no conspicuous loss has been observed.
                                                              3.1. Experimental Setup
2.3. Fault Tolerance                                          3.1.1. Embedding Table
As a system in production, Monolith is designed with the      As described in Section 2.1, embedding tables in Monolith
ability to recover a PS in case it fails. A common choice     are implemented as collisionless HashTables. To prove
for fault tolerance is to snapshot the state of a model       the necessity of avoiding collisions in embedding tables
periodically, and recover from the latest snapshot when       and to quantify gains from our collisionless implemen-
PS failure is detected. The choice of snapshot frequency      tation, we performed two groups of experiments on the
has two major impacts:                                        Movielens dataset and on our internal production dataset
    1. Model quality. Intuitively, model quality suffers      respectively:
       less from loss of recent history with increased
                                                                  1. MovieLens ml-25m dataset [16]. This is a stan-
       snapshot frequency.
                                                                     dard public dataset for movie ratings, containing
    2. Computation overhead. Snapshotting a multi-                   25 million ratings that involves approximately
       terabyte model is not free. It incurs large chunks            162000 users and 62000 movies.
       of memory copy and disk I/O.
                                                                             • Preprocessing of labels. The original labels
   As a trade-off between model quality and computation                        are ratings from 0.5 to 5.0, while in produc-
overhead, Monolith snapshots all training PS every day.                        tion our tasks are mostly receiving binary
Though a PS will lose one day’s worth of update in case of                     signals from users. To better simulate our
a failure, we discover that the performance degradation                        production models, we convert scale labels
is tolerable through our experiments. We will analyze                          to binary labels by treating scores ≥ 3.5 as
the effect of PS reliability in the next section.                              positive samples and the rest as negative
                                                                               samples.
3. Evaluation                                                                • Model and metrics. We implemented a
                                                                               standard DeepFM [17] model, a commonly
For a better understanding of benefits and trade-offs                          used model architecture for recommenda-
brought about by our proposed design, we conducted                             tion problems. It consist of an FM com-
several experiments at production scale and A/B test                           ponent and a dense component (Figure 6).
with live serving traffic to evaluate and verify Monolith                      Predictions are evaluated by AUC [18] as
         this is the major measurement for real mod-     3.1.2. Online Training
         els.
                                                         During online training, we update our online serving
       • Embedding collisions. This dataset con-
                                                         PS with the latest set of parameters with minute-level
         tains approximately 160K user IDs and
                                                         intervals. We designed two groups of experiments to
         60K movie IDs. To compare with the colli-
                                                         verify model quality and system robustness.
         sionless version of embedding table imple-
         mentation, we performed another group                   1. Update frequency. To investigate the necessity
         of experiment where IDs are preprocessed                   of minute-level update frequency, we conducted
         with MD5 hashing and then mapped to a                      experiments that synchronize parameters from
         smaller ID space. As a result, some IDs will               training model to prediction model with different
         share their embedding with others. Table 1                 intervals.
         shows detailed statistics of user and movie                The dataset we use is the Criteo Display Ads Chal-
         IDs before and after hashing.                              lenge dataset1 , a large-scale standard dataset for
                                                                    benchmarking CTR models. It contains 7 days of
                          User IDs   Movie IDs                      chronologically ordered data recording features
  # Before Hashing          162541       59047                      and click actions. For this experiment, we use a
   # After Hashing          149970       57361                      standard DeepFM [17] model as described in 6.
    Collision rate           7.73%       2.86%                      To simulate online training, we did the following
                                                                    preprocessing for the dataset. We take 7 days of
  Table 1                                                           data from the dataset, and split it to two parts:
  Statistics of IDs Before and After Hashing.                       5 days of data for batch training, and 2 days for
                                                                    online training. We further split the 2 days of data
                                                                    into 𝑁 shards chronologically. Online training is
2. Internal Recommendation dataset.                                 simulated by algorithm 1.
   We also performed experiments on a recommen-                     As such, we simulate synchronizing trained pa-
   dation model in production environment. This                     rameters to online serving PS with an interval
   model generally follows a multi-tower architec-                  determined by number of data shards. We exper-
   ture, with each tower responsible for learning to                imented with 𝑁 = 10, 50, 100, which roughly cor-
   predict a specialized kind of user behavior.                     respond to update interval of 5ℎ𝑟, 1ℎ𝑟, and 30𝑚𝑖𝑛.
       • Each model has around 1000 embedding ta-                2. Live experiment. In addition, we also per-
         bles, and distribution of size of embedding                formed a live experiment with real serving traffic
         tables are very uneven;                                    to further demonstrate the importance of online
                                                                    training in real-world application. This A/B ex-
       • The original ID space of embedding table
                                                                    periment compares online training to batch train-
         was 248 . In our baseline, we applied a hash-
                                                                    ing one one of our Ads model in production.
         ing trick by decomposing to curb the size
         of embedding table. To be more specific,
         we use two smaller embedding tables in-
         stead of a gigantic one to generate a unique
                                                       3.2. Results and Analysis
         embedding for each ID by vector combina- 3.2.1. The Effect of Embedding Collision
         tion:
                                                       Results from MovieLens dataset and the Internal
                        𝐼 𝐷𝑟 = 𝐼 𝐷 % 224               recommedation dataset both show that embedding colli-
                                       24              sions will jeopardize model quality.
                        𝐼 𝐷𝑞 = 𝐼 𝐷 ÷ 2
                                                           1. Models with collisionless HashTable consistently
                           𝐸 = 𝐸 𝑟 + 𝐸𝑞 ,
                                                              outperforms those with collision. This conclusion
         where 𝐸 , 𝐸 are embeddings correspond-               holds true regardless of
                  𝑟   𝑞
         ing to 𝐼 𝐷𝑟 , 𝐼 𝐷𝑞 . This effectively reduces                   • Increase of number of training epochs.
         embedding table sizes from 248 to 225 ;                           As shown in Figure 7, the model with
       • This model is serving in real production,                         collisionless embedding table has higher
         and the performance of this experiment is                         AUC from the first epoch and converges
         measured by online AUC with real serving                          at higher value;
         traffic.
                                                         1
                                                             https://www.kaggle.com/competitions/criteo-display-ad-challenge
                                                             /data
Algorithm 1 Simulated Online Training.
           1: Input: 𝐷 𝑏𝑎𝑡𝑐ℎ ;                                                                                        /* Data for batch training. */
                         𝑜𝑛𝑙𝑖𝑛𝑒 ;
           2: Input: 𝐷𝑖=1⋯𝑁                                                                 /* Data for online training, split into 𝑁 shards.            */
           3: 𝜃𝑡𝑟𝑎𝑖𝑛 ← 𝑇 𝑟𝑎𝑖𝑛(𝐷 𝑏𝑎𝑡𝑐ℎ , 𝜃𝑡𝑟𝑎𝑖𝑛 ) ;                                                                               /* Batch training. */
              /* Online training.                                                                                                                  */
           4: for 𝑖 = 1 ⋯ 𝑁 do
           5:    𝜃𝑠𝑒𝑟𝑣𝑒 ← 𝜃𝑡𝑟𝑎𝑖𝑛 ;                                                               /* Sync training parameters to serving model. */
           6:   𝐴𝑈 𝐶𝑖 = Evaluate (𝜃𝑠𝑒𝑟𝑣𝑒 , 𝐷𝑖𝑜𝑛𝑙𝑖𝑛𝑒 ) ;                                             /* Evaluate online prediction on new data. */
           7:   𝜃𝑡𝑟𝑎𝑖𝑛 ← 𝑇 𝑟𝑎𝑖𝑛(𝐷𝑖𝑜𝑛𝑙𝑖𝑛𝑒 , 𝜃𝑡𝑟𝑎𝑖𝑛 ) ;                                                                      /* Train with new data. */
           8: end for


                                                                                                                 Figure 8, models with collisionless embed-
                                 collision-free hash
                                                                                                                 ding table is also robust as time passes by
                     0.825       hash w/ collision                                                               and users/items context changes.
                                                                                                       2. Data sparsity caused by collisionless embedding
                     0.820                                                                                table will not lead to model overfitting. As shown
                                                                                                          in Figure 7, a model with collisionless embedding
test auc


                     0.815                                                                                table does not overfit after it converges.

                     0.810                                                                         3.2.2. Online Training: Trading-off Reliability For
                                                                                                          Realtime
                     0.805                                We discovered that a higher parameter synchronization
                             1        2    3       4      frequency is always conducive to improving online serv-
                                                            5     6       7   8        9   10
                                                             epoch
                                                          ing AUC, and that online serving models are more toler-
Figure 7: Effect of Embedding Collision On DeepFM, Movie- ant with loss of a few shard of PS than we expect.
Lens
                                                                                                       1. The Effect of Parameter Synchronization Fre-
                                                                                                          quency.
                                                                                                          In our online streaming training experiment (1)
                     0.790       hash w/ collision
                                 collisionless hash table                                                 with Criteo Display Ads Challenge dataset, model
                                                                                                          quality consistently improves with the increase
                     0.785                                                                                of parameter synchronization frequency, as is ev-
                                                                                                          ident by comparison from two perspectives:
Online serving AUC


                     0.780                                                                                    • Models with online training performs bet-
                                                                                                                ter than models without. Figure 9a, 9b, 9c
                                                                                                                compares AUC of online training models
                     0.775
                                                                                                                evaluated by the following shard of data
                                                                                                                versus batch training models evaluated by
                     0.770                                                                                      each shard of data;
                                  2            4            6         8           10       12                 • Models with smaller parameter synchro-
                                                                Day
                                                                                                                nization interval performs better that those
Figure 8: Effect of Embedding Collision On A Recommenda-                                                        with larger interval. Figure 10 and Table 2
tion Model In Production                                                                                        compares online serving AUC for models
We measure performance of this recommendation model by on-                                                      with sync interval of 5ℎ𝑟, 1ℎ𝑟, and 30𝑚𝑖𝑛
line serving AUC, which is fluctuating across different days due                                                respectively.
to concept-drift.                                                                                         The live A/B experiment between online training
                                                                                                          and batch training on an Ads model in produc-
                                                                                                          tion also show that there is a significant bump in
                                 • Change of distribution with passage of                                 online serving AUC (Table 3).
                                   time due to Concept Drift. As shown in
                                                            w/ online training                    0.8075                                                     w/ online training
                                                            w/o online training                                                                              w/o online training
      0.798                                                                                       0.8050
                                                                                                  0.8025
      0.796                                                                                       0.8000
AUC


                                                                                            AUC
                                                                                                  0.7975
      0.794
                                                                                                  0.7950
                                                                                                  0.7925
      0.792
                                                                                                  0.7900
              0           10       20             30            40                50                       0             10                20           30       40                50
                                        Hours                                                                                                   Hours

                  (a) Online training with 5 hrs sync interval                                                 (b) Online training with 1 hr sync interval


                                                0.810                                                           w/ online training
                                                                                                                w/o online training

                                                0.805


                                                0.800
                                         AUC


                                                0.795


                                                0.790


                                                        0            10                20            30             40                50
                                                                                            Hours

                                                        (c) Online training with 30 min sync interval
Figure 9: Online training v.s. Batch training on Criteo dataset.Blue lines: AUC of models with online training; Yellow lines:
AUC of batch training models evaluated against streaming data.


                                    Sync Interval                Average AUC (online)                           Average AUC (batch)
                                           5 hr                           79.66 ± 0.020                                  79.42 ± 0.026
                                           1 hr                           79.78 ± 0.005                                  79.44 ± 0.030
                                          30 min                          79.80 ± 0.008                                  79.43 ± 0.025

Table 2
Average AUC comparison for DeepFM model on Criteo dataset.


              Inspired by this observation, we synchronize                                                level. Suppose 100,000 IDs gets updated in a
              sparse parameters to serving PS of our produc-                                              minute, and the dimension of embedding is 1024,
              tion models as frequent as possible (currently                                              the total size of data need to be transferred is
              at minute-level), to the extent that the compu-                                             4𝐾 𝐵 × 100, 000 ≈ 400𝑀𝐵 per minute. For dense
              tation overhead and system reliability could en-                                            parameters, since they are synchronized daily, we
              dure. Recall that dense variables requires a less                                           choose to schedule the synchronization when the
              frequent update as discussed in 2.2.3, we update                                            traffic is lowest (e.g. midnight).
              them at day-level. By doing so, we can bring                                             2. The Effect of PS reliability.
              down our computation overhead to a very low                                                 With a minute-level parameter synchronization,
                                            Day         1                2      3        4        5         6        7
                     AUC Improvement %               14.443        16.871     17.068   14.028   18.081   16.404    15.202

Table 3
Improvement of Online Training Over Batch Training from Live A/B Experiment on an Ads Model.


                                                                                       (b) For dense variables, since they are updated
                                                                                       slowly as we discussed in 2.2.3, losing 1 day’s
      0.810                                       Sync Interval 5 hr
                                                  Sync Interval 1 hr                   update out of 1000 PS is negligible.
                                                  Sync Interval 30 min
                                                                                       Based on the above observation and calculation,
      0.805
                                                                                       we radically lowered our snapshot frequency and
                                                                                       thereby saved quite a bit in computation over-
      0.800                                                                            head.
AUC


      0.795
                                                                              4. Related Work
      0.790
                                                                              Ever since some earliest successful application of deep
                                                                              learning to industry-level recommendation systems [1, 2],
              0       10       20           30          40               50   researchers and engineers have been employing various
                                    Hours
                                                                              techniques to ameliorate issues mentioned in Section 1.
Figure 10: Comparison of different sync intervals for online                     To tackle the issue of sparse feature representation,
training.                                                                     [1, 10] uses fixed-size embedding table with hash-trick.
                                                                              There are also attempts in improving hashing to reduce
                                                                              collision [19, 10]. Other works directly utilize native
                                                                              key-value hash table to allow dynamic growth of table
              we initially expect a more frequent snapshot of
                                                                              size [3, 5, 20, 4]. These implementations builds upon
              training PS to match the realtime update. To our
                                                                              TensorFlow but relies either on specially designed soft-
              surprise, we enlarged the snapshot interval to 1
                                                                              ware mechanism [3, 21, 20] or hardware [4] to access
              day and still observed nearly no loss of model
                                                                              and manage their hash-tables. Compared to these solu-
              quality.
                                                                              tions, Monolith’s hash-table is yet another native Tensor-
              Finding the right trade-off between model quality
                                                                              Flow operation. It is developer friendly and has higher
              and computation overhead is difficult for person-
                                                                              cross-platform interoperability, which is suitable for ToB
              alized ranking systems since users are extremely
                                                                              scenarios. An organic and tight integration with Tensor-
              sensitive on recommendation quality. Tradition-
                                                                              Flow also enables easier optimizations of computation
              ally, large-scale systems tend to set a frequent
                                                                              performance.
              snapshot schedule for their models, which sacri-
                                                                                 Bridging the gap between training and serving and
              fices computation resources in exchange for min-
                                                                              alleviation of Concept Drift [7] is another topic of inter-
              imized loss in model quality. We also did quite
                                                                              est. To support online update and avoid memory issues,
              some exploration in this regard and to our sur-
                                                                              both [5] and [3] designed feature eviction mechanisms to
              prise, model quality is more robust than expected.
                                                                              flexibly adjust the size of embedding tables. Both [5] and
              With a 0.01% failure rate of PS machine per day,
                                                                              [21] support some form of online training, where learned
              we find a model from the previous day works
                                                                              parameters are synced to serving with a relatively short
              embarrassingly well. This is explicable by the
                                                                              interval compared to traditional batch training, with fault
              following calculation: Suppose a model’s parame-
                                                                              tolerance mechanisms. Monolith took similar approach
              ters are sharded across 1000 PS, and they snapshot
                                                                              to elastically admit and evict features, while it has a more
              every day. Given 0.01% failure rate, one of them
                                                                              lightweight parameter synchronization mechanism to
              will go down every 10 days and we lose all up-
                                                                              guarantee model quality.
              dates on this PS for 1 day. Assuming a DAU of
              15 Million and an even distribution of user IDs
              on each PS, we lose 1 day’s feedback from 15000 5. Conclusion
              users every 10 days. This is acceptable because (a)
              For sparse features which is user-specific, this is In this work, we reviewed several most important chal-
              equivalent to losing a tiny fraction of 0.01% DAU; lenges for industrial-level recommendation systems and
present our system in production, Monolith, to address           [5] B. Jiang, C. Deng, H. Yi, Z. Hu, G. Zhou, Y. Zheng,
them and achieved best performance compared to exist-                S. Huang, X. Guo, D. Wang, Y. Song, L. Zhao,
ing solutions.                                                       Z. Wang, P. Sun, Y. Zhang, D. Zhang, J. Li, J. Xu,
   We proved that a collisionless embedding table is essen-          X. Zhu, K. Gai, Xdl: an industrial deep learning
tial for model quality, and demonstrated that our imple-             framework for high-dimensional sparse data, Pro-
mentation of Cuckoo HashMap based embedding table is                 ceedings of the 1st International Workshop on Deep
both memory efficient and helpful for improving online               Learning Practice for High-Dimensional Sparse
serving metrics.                                                     Data (2019).
   We also proved that realtime serving is crucial in rec-       [6] H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chan-
ommendation systems, and that parameter synchroniza-                 dra, H. B. Aradhye, G. Anderson, G. S. Corrado,
tion interval should be as short as possible for an ultimate         W. Chai, M. Ispir, R. Anil, Z. Haque, L. Hong, V. Jain,
model performance. Our solution for online realtime                  X. Liu, H. Shah, Wide & deep learning for rec-
serving in Monolith has a delicately designed parameter              ommender systems, Proceedings of the 1st Work-
synchronization and a fault tolerance mechanism: In our              shop on Deep Learning for Recommender Systems
parameter synchronization algorithm, we showed that                  (2016).
consistency of version across different parts of parame-         [7] J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy,
ters could be traded-off for reducing network bandwidth              A. Bouchachia, A survey on concept drift adap-
consumption; In fault tolerance design, we demonstrated              tation, ACM Computing Surveys (CSUR) 46 (2014)
that our strategy of trading-off PS reliability for realtime-        1 – 37.
ness is a robust solution.                                       [8] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis,
   To conclude, Monolith succeeded in providing a gen-               J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Is-
eral solution for production scale recommendation sys-               ard, M. Kudlur, J. Levenberg, R. Monga, S. Moore,
tems.                                                                D. G. Murray, B. Steiner, P. A. Tucker, V. Vasudevan,
                                                                     P. Warden, M. Wicke, Y. Yu, X. Zhang, Tensorflow:
                                                                     A system for large-scale machine learning, ArXiv
Acknowledgements                                                     abs/1605.08695 (2016).
                                                                 [9] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Brad-
Hanzhi Zhou provided useful suggestions on revision of
                                                                     bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,
this paper.
                                                                     L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. De-
                                                                     Vito, M. Raison, A. Tejani, S. Chilamkurthy,
References                                                           B. Steiner, L. Fang, J. Bai, S. Chintala, Pytorch: An
                                                                     imperative style, high-performance deep learning
 [1] P. Covington, J. K. Adams, E. Sargin, Deep neural               library, in: NeurIPS, 2019.
     networks for youtube recommendations, Proceed-             [10] T. Bredillet, Core modeling at instagram, 2019. URL:
     ings of the 10th ACM Conference on Recommender                  https://instagram-engineering.com/core-modelin
     Systems (2016).                                                 g-at-instagram-a51e0158aa48.
 [2] U. Gupta, X. Wang, M. Naumov, C.-J. Wu, B. Reagen,         [11] R. Pagh, F. F. Rodler, Cuckoo hashing, in: ESA,
     D. M. Brooks, B. Cottel, K. M. Hazelwood, B. Jia,               2001.
     H.-H. S. Lee, A. Malevich, D. Mudigere, M. Smelyan-        [12] J. Kreps, Kafka : a distributed messaging system for
     skiy, L. Xiong, X. Zhang, The architectural implica-            log processing, 2011.
     tions of facebook’s dnn-based personalized recom-          [13] P. Carbone, A. Katsifodimos, S. Ewen, V. Markl,
     mendation, 2020 IEEE International Symposium on                 S. Haridi, K. Tzoumas, Apache flink™: Stream and
     High Performance Computer Architecture (HPCA)                   batch processing in a single engine, IEEE Data Eng.
     (2020) 488–501.                                                 Bull. 38 (2015) 28–38.
 [3] M. Xie, K. Ren, Y. Lu, G. Yang, Q. Xu, B. Wu, J. Lin,      [14] K. V. Shvachko, H. Kuang, S. R. Radia, R. J. Chansler,
     H. Ao, W. Xu, J. Shu, Kraken: Memory-efficient                  The hadoop distributed file system, 2010 IEEE 26th
     continual learning for large-scale real-time recom-             Symposium on Mass Storage Systems and Tech-
     mendations, SC20: International Conference for                  nologies (MSST) (2010) 1–10.
     High Performance Computing, Networking, Stor-              [15] H. Wang, A. Zhang, C. Wang, Nonuniform neg-
     age and Analysis (2020) 1–17.                                   ative sampling and log odds correction with rare
 [4] W. Zhao, J. Zhang, D. Xie, Y. Qian, R. Jia, P. Li, Ai-          events data, in: Advances in Neural Information
     box: Ctr prediction model training on a single node,            Processing Systems, 2021.
     Proceedings of the 28th ACM International Confer-          [16] F. M. Harper, J. A. Konstan, The movielens datasets:
     ence on Information and Knowledge Management                    History and context, ACM Trans. Interact. Intell.
     (2019).                                                         Syst. 5 (2015) 19:1–19:19.
[17] H. Guo, R. Tang, Y. Ye, Z. Li, X. He, Deepfm: A          URL: https://tech.meituan.com/2021/12/09/meitua
     factorization-machine based neural network for ctr       n-tensorflow-in-recommender-systems.html.
     prediction, in: IJCAI, 2017.                        [21] X. Lian, B. Yuan, X. Zhu, Y. Wang, Y. He, H. Wu,
[18] A. P. Bradley, The use of the area under the roc         L. Sun, H. Lyu, C. Liu, X. Dong, Y. Liao, M. Luo,
     curve in the evaluation of machine learning algo-        C. Zhang, J. Xie, H. Li, L. Chen, R. Huang, J. Lin,
     rithms, Pattern Recognit. 30 (1997) 1145–1159.           C. Shu, X.-B. Qiu, Z. Liu, D. Kong, L. Yuan, H. bo Yu,
[19] A. Egg, Online learning for recommendations at           S. Yang, C. Zhang, J. Liu, Persia: An open, hybrid
     grubhub, Fifteenth ACM Conference on Recom-              system scaling deep learning-based recommenders
     mender Systems (2021).                                   up to 100 trillion parameters, ArXiv abs/2111.05897
[20] Meituan, Distributed training optimization for ten-      (2021).
     sorflow in recommender systems (in chinese), 2021.

</pre>