Monolith: Real Time Recommendation System With Collisionless Embedding Table Zhuoran Liu1 , Leqi Zou1 , Xuan Zou1 , Caihua Wang1 , Biao Zhang1 , Da Tang1 , Bolin Zhu2,† , Yijie Zhu1 , Peng Wu1 , Ke Wang1 and Youlong Cheng1,‡ 1 Bytedance Inc. 2 Fudan University Abstract Building a scalable and real-time recommendation system is vital for many businesses driven by time-sensitive customer feedback, such as short-videos ranking or online ads. Despite the ubiquitous adoption of production-scale deep learning frameworks like TensorFlow or PyTorch, these general-purpose frameworks fall short of business demands in recommendation scenarios for various reasons: on one hand, tweaking systems based on static parameters and dense computations for recommendation with dynamic and sparse features is detrimental to model quality; on the other hand, such frameworks are designed with batch-training stage and serving stage completely separated, preventing the model from interacting with customer feedback in real-time. These issues led us to reexamine traditional approaches and explore radically different design choices. In this paper, we present Monolith1 , a system tailored for online training. Our design has been driven by observations of our application workloads and production environment that reflects a marked departure from other recommendations systems. Our contributions are manifold: first, we crafted a collisionless embedding table with optimizations such as expirable embeddings and frequency filtering to reduce its memory footprint; second, we provide an production-ready online training architecture with high fault-tolerance; finally, we proved that system reliability could be traded-off for real-time learning. Monolith has successfully landed in the BytePlus Recommend2 product. 1. Introduction 2. The underlying distribution of training data is non-stationary, a.k.a. Concept Drift [7]. The past decade witnessed a boom of businesses powered by recommendation techniques. In pursuit of a better Such differences have posed unique challenges to re- customer experience, delivering personalized content for searchers and engineers working on recommendation each individual user as real-time response is a common systems. goal of these business applications. To this end, infor- mation from a user’s latest interaction is often used as 1.1. Sparsity and Dynamism the primary input for training a model, as it would best The data for recommendation mostly contain sparse cat- depict a user’s portrait and make predictions of user’s egorical features, some of which appear with low fre- interest and future behaviors. quency. The common practice of mapping them to a Deep learning have been dominating recommenda- high-dimensional embedding space would give rise to a tion models [1, 2, 3, 4, 5, 6] as the gigantic amount of series of issues: user data is a natural fit for massively data-driven neural models. However, efforts to leverage the power of deep • Unlike language models where number of word- learning in industry-level recommendation systems are pieces are limited, the amount of users and rank- constantly encountered with problems arising from the ing items are orders of magnitude larger. Such an unique characteristics of data derived from real-world enormous embedding table would hardly fit into user behavior. These data are drastically different from single host memory; those used in conventional deep learning problems like • Worse still, the size of embedding table is ex- language modeling or computer vision in two aspects: pected to grow over time as more users and items 1. The features are mostly sparse, categorical and are admitted, while frameworks like [8, 9] uses a dynamically changing; fixed-size dense variables to represent embedding table. ORSUM@ACM RecSys 2022: 5th Workshop on Online Recommender Systems and User Modeling, jointly with the 16th ACM Conference on In practice, many systems adopt low-collision hashing Recommender Systems, September 23rd, 2022, Seattle, WA, USA [1, 10] as a way to reduce memory footprint and to allow † Work done during internship at Bytedance Inc. growing of IDs. This relies on an over-idealistic assump- ‡ Corresponding author. tion that IDs in the embedding table is distributed evenly Envelope-Open youlong.cheng@bytedance.com (Y. Cheng) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License in frequency, and collisions are harmless to the model Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) quality. Unfortunately this is rarely true for a real-world Data Data Training Model Training PS Serving PS User (Batch Training) (Online Training) Worker Server Historical batch data Batch Feature IDs Training Stage Embedding table lookup Feature embeddings Model forward and backward pass Feature IDs and gradients Apply gradient updates Online streaming Parameter Online data Sync Training Feature IDs User Request Stage Embedding table Feature IDs lookup Feature embeddings Embedding table lookup Feature Model forward embeddings and backward pass Sync Feature IDs and gradients interval Model forward Apply pass Ranking Result gradient Data of updates User Actions features and user reactions Parameter Sync Figure 1: Monolith Online Training Architecture. recommendation system, where a small group of users that arises from our production, we designed Monolith, or items have significantly more occurrences. With the a large-scale recommendation system to address these organic growth of embedding table size, chances of hash pain-points. We did extensive experiments to verify and key collision increases and lead to deterioration of model iterate our design in the production environment. Mono- quality [10]. lith is able to Therefore it is a natural demand for production-scale 1. Provide full expressive power for sparse features recommendation systems to have the capacity to capture by designing a collisionless hash table and a dy- as many features in its parameters, and also have the namic feature eviction mechanism; capability of elastically adjusting the number of users and items it tries to book-keep. 2. Loop serving feedback back to training in real- time with online training. 1.2. Non-stationary Distribution Empowered by these architectural capacities, Monolith consistently outperforms systems that adopts hash-tricks Visual and linguistic patterns barely develop in a time with collisions with roughly similar memory usage, and scale of centuries, while the same user interested in one achieves state-of-the-art online serving AUC without topic could shift their zeal every next minute. As a result, overly burdening our servers’ computation power. the underlying distribution of user data is non-stationary, The rest of the paper is organized as follows. We first a phenomenon commonly referred to as Concept Drift elaborate design details of how Monolith tackles exist- [7]. ing challenge with collisionless hash table and realtime Intuitively, information from a more recent history training in Section 2. Experiments and results will be can more effectively contribute to predicting the change presented in Section 3, along with production-tested con- in a user’s behavior. To mitigate the effect of Concept clusions and some discussion of trade-offs between time- Drift, serving models need to be updated from new user sensitivity, reliability and model quality. Section 4 sum- feedback as close to real-time as possible to reflect the marizes related work and compares them with Monolith. latest interest of a user. Section 5 concludes this work. In light of these distinction and in observation of issues 2.1. Hash Table Worker Worker ... A first principle in our design of sparse parameter repre- Client Master sentation is to avoid cramping information from different IDs into the same fixed-size embedding. Simulating a dy- Parameter Parameter ... namic size embedding table with an out-of-the-box Ten- Server Server sorFlow Variable inevitably leads to ID collision, which exacerbates as new IDs arrive and table grows. Therefore instead of building upon Variable, we developed a new Figure 2: Worker-PS Architecture. key-value HashTable for our sparse parameters. Our HashTable utilizes Cuckoo Hashmap [11] under the hood, which supports inserting new keys without colliding with existing ones. Cuckoo Hashing achieves 2. Design worst-case 𝑂(1) time complexity for lookups and dele- tions, and an expected amortized 𝑂(1) time for insertions. The overall architecture of Monolith generally follows As illustrated in Figure 3 it maintains two tables 𝑇0 , 𝑇1 TensorFlow’s distributed Worker-ParameterServer set- with different hash functions ℎ0 (𝑥), ℎ1 (𝑥), and an element ting (Figure 2). In a Worker-PS architecture, machines are would be stored in either one of them. When trying to assigned different roles; Worker machines are responsi- insert an element 𝐴 into 𝑇0 , it first attempts to place 𝐴 ble for performing computations as defined by the graph, at ℎ0 (𝐴); If ℎ0 (𝐴) is occupied by another element 𝐵, it and PS machines stores parameters and updates them would evict 𝐵 from 𝑇0 and try inserting 𝐵 into 𝑇1 with according to gradients computed by Workers. the same logic. This process will be repeated until all el- In recommendation models, parameters are catego- ements stabilize, or rehash happens when insertion runs rized into two sets: dense and sparse. Dense parame- into a cycle. ters are weights/variables in a deep neural network, and Memory footprint reduction is also an important con- sparse parameters refer to embedding tables that corre- sideration in our design. A naive approach of inserting sponds to sparse features. In our design, both dense and every new ID into the HashTable will deplete memory sparse parameters are part of TensorFlow Graph, and are quickly. Observation of real production models lead to stored on parameter servers. two conclusions: Similar to TensorFlow’s Variable for dense parame- ters, we designed a set of highly-efficient, collisionless, 1. IDs that appears only a handful of times have and flexible HashTable operations for sparse parame- limited contribution to improving model quality. ters. As an complement to TensorFlow’s limitation that An important observation is that IDs are long-tail arises from separation of training and inference, Mono- distributed, where popular IDs may occur mil- lith’s elastically scalable online training is designed to lions of times while the unpopular ones appear efficiently synchronize parameters from training-PS to no more than ten times. Embeddings correspond- online serving-PS within short intervals, with model ro- ing to these infrequent IDs are underfit due to bustness guarantee provided by fault tolerance mecha- lack of training data and the model will not be nism. able to make a good estimation based on them. At the end of the day these IDs are not likely to affect the result, so model quality will not suffer A from removal of these IDs with low occurrences; h0(A) 2. Stale IDs from a distant history seldom contribute to the current model as many of them are never BA DC G T0 visited. This could possibly due to a user that h1(D) h1(B) h0(C) is no longer active, or a short-video that is out- of-date. Storing embeddings for these IDs could not help model in any way but to drain our PS D E F CB T1 memory in vain. Based on these observation, we designed several fea- Figure 3: Cuckoo HashMap. ture ID filtering heuristics for a more memory-efficient implementation of HashTable: 1. IDs are filtered before they are admitted into em- bedding tables. We have two filtering methods: First we filter by their occurrences before they Actions User Log Kafka Training Example Dump Batch Data (Click, Like, Kafka Convert …) Joiner Online Batch Training Data Flink Job Training HDFS Batch Training Generate Features Model Server Feature Kafka Training Worker Parameter Synchronization Serving PS Training PS Figure 4: Streaming Engine. The information feedback loop from [User → Model Server → Training Worker → Model Server → User] would spend a long time when taking the Batch Training path, while the Online Training will close the loop more instantly. are inserted as keys, where the threshold of occur- enters the online training stage. Instead of read- rences is a tunable hyperparameter that varies for ing mini-batch examples from the storage, a train- each model; In addition we utilize a probabilistic ing worker consumes realtime data on-the-fly and filter which helps further reduce memory usage; updates the training PS. The training PS periodi- 2. IDs are timed and set to expire after being inactive cally synchronizes its parameters to the serving for a predefined period of time. The expire time is PS, which will take effect on the user side imme- also tunable for each embedding table to allow for diately. This enables our model to interactively distinguishing features with different sensitivity adapt itself according to a user’s feedback in real- to historical information. time. In our implementation, HashTable is implemented as a TensorFlow resource operation. Similar to Variable, 2.2.1. Streaming Engine look-ups and updates are also implemented as native TensorFlow operations for easier integration and better Monolith is built with the capability of seamlessly switch- compatibility. ing between batch training and online training. This is enabled by our design of streaming engine as illustrated by Figure 4. 2.2. Online Training In our design, we use one Kafka [12] queue to log ac- In Monolith, training is divided into two stages (Figure tions of users (E.g. Click on an item or like an item etc.) 1): and another Kafka queue for features. At the core of the engine is a Flink [13] streaming job for online feature 1. Batch training stage. This stage works as an or- Joiner. The online joiner concatenates features with la- dinary TensorFlow training loop: In each train- bels from user actions and produces training examples, ing step, a training worker reads one mini-batch which are then written to a Kafka queue. The queue for of training examples from the storage, requests training examples is consumed by both online training parameters from PS, computes a forward and a and batch training: backward pass, and finally push updated param- eters to the training PS. Slightly different from • For online training, the training worker directly other common deep learning tasks, we only train reads data from the Kafka queue; our dataset for one pass. Batch training is useful • For batch training, a data dumping job will first for training historical data when we modify our dump data to HDFS [14]; After data in HDFS ac- model architecture and retrain the model; cumulated to certain amount, training worker 2. Online training stage. After a model is deployed will retrieve data from HDFS and perform batch to online serving, the training does not stop but training. Log Kafka Feature Kafka (User Actions) In-memory Negative Training Example Join Cache Sampling Kafka Found in Cache On-disk KV-Store Read from KV-Store Figure 5: Online Joiner. Updated parameters in training PS will be pushed to 2.2.3. Parameter Synchronization serving PS according to the parameter synchronization During online training, the Monolith training cluster schedule. keeps receiving data from the online serving module and updates parameters on the training PS. A crucial step to 2.2.2. Online Joiner enable the online serving PS to benefit from these newly In real-world applications, user actions log and features trained parameters is the synchronization of updated are streamed into the online joiner (Figure 5) without model parameters. In production environment, we are guarantee in time order. Therefore we use a unique key encountered by several challenges: for each request so that user action and features could • Models on the online serving PS must not stop correctly pair up. serving when updating. Our models in produc- The lag of user action could also be a problem. For tion is usually several terabytes in size, and as a example, a user may take a few days before they decide result replacing all parameters takes a while. It to buy an item they were presented days ago. This is a would be intolerable to stop an online PS from challenge for the joiner because if all features are kept in serving the model during the replacement pro- cache, it would simply not fit in memory. In our system, cess, and updates must be made on-the-fly; an on-disk key-value storage is utilized to store features that are waiting for over certain time period. When a • Transferring a multi-terabyte model of its entirety user action log arrives, it first looks up the in-memory from training PS to the online serving PS would cache, and then looks up the key-value storage in case of pose huge pressure to both the network band- a missing cache. width and memory on PS, since it requires dou- Another problem that arise in real-world application bled model size of memory to accept the newly is that the distribution of negative and positive examples arriving model. are highly uneven, where number of the former could be For the online training to scale up to the size of our magnitudes of order higher than the latter. To prevent business scenario, we designed an incremental on-the- positive examples from being overwhelmed by negative fly periodic parameter synchronization mechanism in ones, a common strategy is to do negative sampling. This Monolith based on several noticeable characteristic of would certainly change the underlying distribution of the our models: trained model, tweaking it towards higher probability of making positive predictions. As a remedy, we apply log 1. Sparse parameters are dominating the size of rec- odds correction [15] during serving, making sure that ommendation models; the online model is an unbiased estimator of the original 2. Given a short range of time window, only a small distribution. subset of IDs gets trained and their embeddings updated; Output 3. Dense variables move much slower than sparse embeddings. This is because in momentum-based MLP optimizers, the accumulation of momentum for FM dense variables is magnified by the gigantic size of ... recommendation training data, while only a few sparse embeddings receives updates in a single Embeddings data batch. IDs ... (1) and (2) allows us to exploit the sparse updates across all feature IDs. In Monolith, we maintain a hash set of touched keys, representing IDs whose embeddings get Figure 6: DeepFM model architecture. trained since the last parameter synchronization. We push the subset of sparse parameters whose keys are in the touched-keys set with a minute-level time interval from different aspects. We aim to answer the following from the training PS to the online serving PS. This rel- questions by our experiments: atively small pack of incremental parameter update is lightweight for network transmission and will not cause 1. How much can we benefit from a collisionless a sharp memory spike during the synchronization. HashTable? We also exploit (3) to further reduce network I/O and 2. How important is realtime online training? memory usage by setting a more aggressive sync sched- 3. Is Monolith’s design of parameter synchroniza- ule for sparse parameters, while updating dense param- tion robust enough in a large-scale production eters less frequently. This could render us a situation scenario? where the dense parameters we serve is a relatively stale version compared to sparse part. However, such incon- In this section, we first present our experimental set- sistency could be tolerated due to the reason mentioned tings and then discuss results and our findings in detail. in (3) as no conspicuous loss has been observed. 3.1. Experimental Setup 2.3. Fault Tolerance 3.1.1. Embedding Table As a system in production, Monolith is designed with the As described in Section 2.1, embedding tables in Monolith ability to recover a PS in case it fails. A common choice are implemented as collisionless HashTables. To prove for fault tolerance is to snapshot the state of a model the necessity of avoiding collisions in embedding tables periodically, and recover from the latest snapshot when and to quantify gains from our collisionless implemen- PS failure is detected. The choice of snapshot frequency tation, we performed two groups of experiments on the has two major impacts: Movielens dataset and on our internal production dataset 1. Model quality. Intuitively, model quality suffers respectively: less from loss of recent history with increased 1. MovieLens ml-25m dataset [16]. This is a stan- snapshot frequency. dard public dataset for movie ratings, containing 2. Computation overhead. Snapshotting a multi- 25 million ratings that involves approximately terabyte model is not free. It incurs large chunks 162000 users and 62000 movies. of memory copy and disk I/O. • Preprocessing of labels. The original labels As a trade-off between model quality and computation are ratings from 0.5 to 5.0, while in produc- overhead, Monolith snapshots all training PS every day. tion our tasks are mostly receiving binary Though a PS will lose one day’s worth of update in case of signals from users. To better simulate our a failure, we discover that the performance degradation production models, we convert scale labels is tolerable through our experiments. We will analyze to binary labels by treating scores ≥ 3.5 as the effect of PS reliability in the next section. positive samples and the rest as negative samples. 3. Evaluation • Model and metrics. We implemented a standard DeepFM [17] model, a commonly For a better understanding of benefits and trade-offs used model architecture for recommenda- brought about by our proposed design, we conducted tion problems. It consist of an FM com- several experiments at production scale and A/B test ponent and a dense component (Figure 6). with live serving traffic to evaluate and verify Monolith Predictions are evaluated by AUC [18] as this is the major measurement for real mod- 3.1.2. Online Training els. During online training, we update our online serving • Embedding collisions. This dataset con- PS with the latest set of parameters with minute-level tains approximately 160K user IDs and intervals. We designed two groups of experiments to 60K movie IDs. To compare with the colli- verify model quality and system robustness. sionless version of embedding table imple- mentation, we performed another group 1. Update frequency. To investigate the necessity of experiment where IDs are preprocessed of minute-level update frequency, we conducted with MD5 hashing and then mapped to a experiments that synchronize parameters from smaller ID space. As a result, some IDs will training model to prediction model with different share their embedding with others. Table 1 intervals. shows detailed statistics of user and movie The dataset we use is the Criteo Display Ads Chal- IDs before and after hashing. lenge dataset1 , a large-scale standard dataset for benchmarking CTR models. It contains 7 days of User IDs Movie IDs chronologically ordered data recording features # Before Hashing 162541 59047 and click actions. For this experiment, we use a # After Hashing 149970 57361 standard DeepFM [17] model as described in 6. Collision rate 7.73% 2.86% To simulate online training, we did the following preprocessing for the dataset. We take 7 days of Table 1 data from the dataset, and split it to two parts: Statistics of IDs Before and After Hashing. 5 days of data for batch training, and 2 days for online training. We further split the 2 days of data into 𝑁 shards chronologically. Online training is 2. Internal Recommendation dataset. simulated by algorithm 1. We also performed experiments on a recommen- As such, we simulate synchronizing trained pa- dation model in production environment. This rameters to online serving PS with an interval model generally follows a multi-tower architec- determined by number of data shards. We exper- ture, with each tower responsible for learning to imented with 𝑁 = 10, 50, 100, which roughly cor- predict a specialized kind of user behavior. respond to update interval of 5ℎ𝑟, 1ℎ𝑟, and 30𝑚𝑖𝑛. • Each model has around 1000 embedding ta- 2. Live experiment. In addition, we also per- bles, and distribution of size of embedding formed a live experiment with real serving traffic tables are very uneven; to further demonstrate the importance of online training in real-world application. This A/B ex- • The original ID space of embedding table periment compares online training to batch train- was 248 . In our baseline, we applied a hash- ing one one of our Ads model in production. ing trick by decomposing to curb the size of embedding table. To be more specific, we use two smaller embedding tables in- stead of a gigantic one to generate a unique 3.2. Results and Analysis embedding for each ID by vector combina- 3.2.1. The Effect of Embedding Collision tion: Results from MovieLens dataset and the Internal 𝐼 𝐷𝑟 = 𝐼 𝐷 % 224 recommedation dataset both show that embedding colli- 24 sions will jeopardize model quality. 𝐼 𝐷𝑞 = 𝐼 𝐷 ÷ 2 1. Models with collisionless HashTable consistently 𝐸 = 𝐸 𝑟 + 𝐸𝑞 , outperforms those with collision. This conclusion where 𝐸 , 𝐸 are embeddings correspond- holds true regardless of 𝑟 𝑞 ing to 𝐼 𝐷𝑟 , 𝐼 𝐷𝑞 . This effectively reduces • Increase of number of training epochs. embedding table sizes from 248 to 225 ; As shown in Figure 7, the model with • This model is serving in real production, collisionless embedding table has higher and the performance of this experiment is AUC from the first epoch and converges measured by online AUC with real serving at higher value; traffic. 1 https://www.kaggle.com/competitions/criteo-display-ad-challenge /data Algorithm 1 Simulated Online Training. 1: Input: 𝐷 𝑏𝑎𝑡𝑐ℎ ; /* Data for batch training. */ 𝑜𝑛𝑙𝑖𝑛𝑒 ; 2: Input: 𝐷𝑖=1⋯𝑁 /* Data for online training, split into 𝑁 shards. */ 3: 𝜃𝑡𝑟𝑎𝑖𝑛 ← 𝑇 𝑟𝑎𝑖𝑛(𝐷 𝑏𝑎𝑡𝑐ℎ , 𝜃𝑡𝑟𝑎𝑖𝑛 ) ; /* Batch training. */ /* Online training. */ 4: for 𝑖 = 1 ⋯ 𝑁 do 5: 𝜃𝑠𝑒𝑟𝑣𝑒 ← 𝜃𝑡𝑟𝑎𝑖𝑛 ; /* Sync training parameters to serving model. */ 6: 𝐴𝑈 𝐶𝑖 = Evaluate (𝜃𝑠𝑒𝑟𝑣𝑒 , 𝐷𝑖𝑜𝑛𝑙𝑖𝑛𝑒 ) ; /* Evaluate online prediction on new data. */ 7: 𝜃𝑡𝑟𝑎𝑖𝑛 ← 𝑇 𝑟𝑎𝑖𝑛(𝐷𝑖𝑜𝑛𝑙𝑖𝑛𝑒 , 𝜃𝑡𝑟𝑎𝑖𝑛 ) ; /* Train with new data. */ 8: end for Figure 8, models with collisionless embed- collision-free hash ding table is also robust as time passes by 0.825 hash w/ collision and users/items context changes. 2. Data sparsity caused by collisionless embedding 0.820 table will not lead to model overfitting. As shown in Figure 7, a model with collisionless embedding test auc 0.815 table does not overfit after it converges. 0.810 3.2.2. Online Training: Trading-off Reliability For Realtime 0.805 We discovered that a higher parameter synchronization 1 2 3 4 frequency is always conducive to improving online serv- 5 6 7 8 9 10 epoch ing AUC, and that online serving models are more toler- Figure 7: Effect of Embedding Collision On DeepFM, Movie- ant with loss of a few shard of PS than we expect. Lens 1. The Effect of Parameter Synchronization Fre- quency. In our online streaming training experiment (1) 0.790 hash w/ collision collisionless hash table with Criteo Display Ads Challenge dataset, model quality consistently improves with the increase 0.785 of parameter synchronization frequency, as is ev- ident by comparison from two perspectives: Online serving AUC 0.780 • Models with online training performs bet- ter than models without. Figure 9a, 9b, 9c compares AUC of online training models 0.775 evaluated by the following shard of data versus batch training models evaluated by 0.770 each shard of data; 2 4 6 8 10 12 • Models with smaller parameter synchro- Day nization interval performs better that those Figure 8: Effect of Embedding Collision On A Recommenda- with larger interval. Figure 10 and Table 2 tion Model In Production compares online serving AUC for models We measure performance of this recommendation model by on- with sync interval of 5ℎ𝑟, 1ℎ𝑟, and 30𝑚𝑖𝑛 line serving AUC, which is fluctuating across different days due respectively. to concept-drift. The live A/B experiment between online training and batch training on an Ads model in produc- tion also show that there is a significant bump in • Change of distribution with passage of online serving AUC (Table 3). time due to Concept Drift. As shown in w/ online training 0.8075 w/ online training w/o online training w/o online training 0.798 0.8050 0.8025 0.796 0.8000 AUC AUC 0.7975 0.794 0.7950 0.7925 0.792 0.7900 0 10 20 30 40 50 0 10 20 30 40 50 Hours Hours (a) Online training with 5 hrs sync interval (b) Online training with 1 hr sync interval 0.810 w/ online training w/o online training 0.805 0.800 AUC 0.795 0.790 0 10 20 30 40 50 Hours (c) Online training with 30 min sync interval Figure 9: Online training v.s. Batch training on Criteo dataset.Blue lines: AUC of models with online training; Yellow lines: AUC of batch training models evaluated against streaming data. Sync Interval Average AUC (online) Average AUC (batch) 5 hr 79.66 ± 0.020 79.42 ± 0.026 1 hr 79.78 ± 0.005 79.44 ± 0.030 30 min 79.80 ± 0.008 79.43 ± 0.025 Table 2 Average AUC comparison for DeepFM model on Criteo dataset. Inspired by this observation, we synchronize level. Suppose 100,000 IDs gets updated in a sparse parameters to serving PS of our produc- minute, and the dimension of embedding is 1024, tion models as frequent as possible (currently the total size of data need to be transferred is at minute-level), to the extent that the compu- 4𝐾 𝐵 × 100, 000 ≈ 400𝑀𝐵 per minute. For dense tation overhead and system reliability could en- parameters, since they are synchronized daily, we dure. Recall that dense variables requires a less choose to schedule the synchronization when the frequent update as discussed in 2.2.3, we update traffic is lowest (e.g. midnight). them at day-level. By doing so, we can bring 2. The Effect of PS reliability. down our computation overhead to a very low With a minute-level parameter synchronization, Day 1 2 3 4 5 6 7 AUC Improvement % 14.443 16.871 17.068 14.028 18.081 16.404 15.202 Table 3 Improvement of Online Training Over Batch Training from Live A/B Experiment on an Ads Model. (b) For dense variables, since they are updated slowly as we discussed in 2.2.3, losing 1 day’s 0.810 Sync Interval 5 hr Sync Interval 1 hr update out of 1000 PS is negligible. Sync Interval 30 min Based on the above observation and calculation, 0.805 we radically lowered our snapshot frequency and thereby saved quite a bit in computation over- 0.800 head. AUC 0.795 4. Related Work 0.790 Ever since some earliest successful application of deep learning to industry-level recommendation systems [1, 2], 0 10 20 30 40 50 researchers and engineers have been employing various Hours techniques to ameliorate issues mentioned in Section 1. Figure 10: Comparison of different sync intervals for online To tackle the issue of sparse feature representation, training. [1, 10] uses fixed-size embedding table with hash-trick. There are also attempts in improving hashing to reduce collision [19, 10]. Other works directly utilize native key-value hash table to allow dynamic growth of table we initially expect a more frequent snapshot of size [3, 5, 20, 4]. These implementations builds upon training PS to match the realtime update. To our TensorFlow but relies either on specially designed soft- surprise, we enlarged the snapshot interval to 1 ware mechanism [3, 21, 20] or hardware [4] to access day and still observed nearly no loss of model and manage their hash-tables. Compared to these solu- quality. tions, Monolith’s hash-table is yet another native Tensor- Finding the right trade-off between model quality Flow operation. It is developer friendly and has higher and computation overhead is difficult for person- cross-platform interoperability, which is suitable for ToB alized ranking systems since users are extremely scenarios. An organic and tight integration with Tensor- sensitive on recommendation quality. Tradition- Flow also enables easier optimizations of computation ally, large-scale systems tend to set a frequent performance. snapshot schedule for their models, which sacri- Bridging the gap between training and serving and fices computation resources in exchange for min- alleviation of Concept Drift [7] is another topic of inter- imized loss in model quality. We also did quite est. To support online update and avoid memory issues, some exploration in this regard and to our sur- both [5] and [3] designed feature eviction mechanisms to prise, model quality is more robust than expected. flexibly adjust the size of embedding tables. Both [5] and With a 0.01% failure rate of PS machine per day, [21] support some form of online training, where learned we find a model from the previous day works parameters are synced to serving with a relatively short embarrassingly well. This is explicable by the interval compared to traditional batch training, with fault following calculation: Suppose a model’s parame- tolerance mechanisms. Monolith took similar approach ters are sharded across 1000 PS, and they snapshot to elastically admit and evict features, while it has a more every day. Given 0.01% failure rate, one of them lightweight parameter synchronization mechanism to will go down every 10 days and we lose all up- guarantee model quality. dates on this PS for 1 day. Assuming a DAU of 15 Million and an even distribution of user IDs on each PS, we lose 1 day’s feedback from 15000 5. Conclusion users every 10 days. This is acceptable because (a) For sparse features which is user-specific, this is In this work, we reviewed several most important chal- equivalent to losing a tiny fraction of 0.01% DAU; lenges for industrial-level recommendation systems and present our system in production, Monolith, to address [5] B. Jiang, C. Deng, H. Yi, Z. Hu, G. Zhou, Y. Zheng, them and achieved best performance compared to exist- S. Huang, X. Guo, D. Wang, Y. Song, L. Zhao, ing solutions. Z. Wang, P. Sun, Y. Zhang, D. Zhang, J. Li, J. Xu, We proved that a collisionless embedding table is essen- X. Zhu, K. Gai, Xdl: an industrial deep learning tial for model quality, and demonstrated that our imple- framework for high-dimensional sparse data, Pro- mentation of Cuckoo HashMap based embedding table is ceedings of the 1st International Workshop on Deep both memory efficient and helpful for improving online Learning Practice for High-Dimensional Sparse serving metrics. Data (2019). We also proved that realtime serving is crucial in rec- [6] H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chan- ommendation systems, and that parameter synchroniza- dra, H. B. Aradhye, G. Anderson, G. S. Corrado, tion interval should be as short as possible for an ultimate W. Chai, M. Ispir, R. Anil, Z. Haque, L. Hong, V. Jain, model performance. Our solution for online realtime X. Liu, H. Shah, Wide & deep learning for rec- serving in Monolith has a delicately designed parameter ommender systems, Proceedings of the 1st Work- synchronization and a fault tolerance mechanism: In our shop on Deep Learning for Recommender Systems parameter synchronization algorithm, we showed that (2016). consistency of version across different parts of parame- [7] J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, ters could be traded-off for reducing network bandwidth A. Bouchachia, A survey on concept drift adap- consumption; In fault tolerance design, we demonstrated tation, ACM Computing Surveys (CSUR) 46 (2014) that our strategy of trading-off PS reliability for realtime- 1 – 37. ness is a robust solution. [8] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, To conclude, Monolith succeeded in providing a gen- J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Is- eral solution for production scale recommendation sys- ard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, tems. D. G. Murray, B. Steiner, P. A. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, X. Zhang, Tensorflow: A system for large-scale machine learning, ArXiv Acknowledgements abs/1605.08695 (2016). [9] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Brad- Hanzhi Zhou provided useful suggestions on revision of bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, this paper. L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. De- Vito, M. Raison, A. Tejani, S. Chilamkurthy, References B. Steiner, L. Fang, J. Bai, S. Chintala, Pytorch: An imperative style, high-performance deep learning [1] P. Covington, J. K. Adams, E. Sargin, Deep neural library, in: NeurIPS, 2019. networks for youtube recommendations, Proceed- [10] T. Bredillet, Core modeling at instagram, 2019. URL: ings of the 10th ACM Conference on Recommender https://instagram-engineering.com/core-modelin Systems (2016). g-at-instagram-a51e0158aa48. [2] U. Gupta, X. Wang, M. Naumov, C.-J. Wu, B. Reagen, [11] R. Pagh, F. F. Rodler, Cuckoo hashing, in: ESA, D. M. Brooks, B. Cottel, K. M. Hazelwood, B. Jia, 2001. H.-H. S. Lee, A. Malevich, D. Mudigere, M. Smelyan- [12] J. Kreps, Kafka : a distributed messaging system for skiy, L. Xiong, X. Zhang, The architectural implica- log processing, 2011. tions of facebook’s dnn-based personalized recom- [13] P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, mendation, 2020 IEEE International Symposium on S. Haridi, K. Tzoumas, Apache flink™: Stream and High Performance Computer Architecture (HPCA) batch processing in a single engine, IEEE Data Eng. (2020) 488–501. Bull. 38 (2015) 28–38. [3] M. Xie, K. Ren, Y. Lu, G. Yang, Q. Xu, B. Wu, J. Lin, [14] K. V. Shvachko, H. Kuang, S. R. Radia, R. J. Chansler, H. Ao, W. Xu, J. Shu, Kraken: Memory-efficient The hadoop distributed file system, 2010 IEEE 26th continual learning for large-scale real-time recom- Symposium on Mass Storage Systems and Tech- mendations, SC20: International Conference for nologies (MSST) (2010) 1–10. High Performance Computing, Networking, Stor- [15] H. Wang, A. Zhang, C. Wang, Nonuniform neg- age and Analysis (2020) 1–17. ative sampling and log odds correction with rare [4] W. Zhao, J. Zhang, D. Xie, Y. Qian, R. Jia, P. Li, Ai- events data, in: Advances in Neural Information box: Ctr prediction model training on a single node, Processing Systems, 2021. Proceedings of the 28th ACM International Confer- [16] F. M. Harper, J. A. Konstan, The movielens datasets: ence on Information and Knowledge Management History and context, ACM Trans. Interact. Intell. (2019). Syst. 5 (2015) 19:1–19:19. [17] H. Guo, R. Tang, Y. Ye, Z. Li, X. He, Deepfm: A URL: https://tech.meituan.com/2021/12/09/meitua factorization-machine based neural network for ctr n-tensorflow-in-recommender-systems.html. prediction, in: IJCAI, 2017. [21] X. Lian, B. Yuan, X. Zhu, Y. Wang, Y. He, H. Wu, [18] A. P. Bradley, The use of the area under the roc L. Sun, H. Lyu, C. Liu, X. Dong, Y. Liao, M. Luo, curve in the evaluation of machine learning algo- C. Zhang, J. Xie, H. Li, L. Chen, R. Huang, J. Lin, rithms, Pattern Recognit. 30 (1997) 1145–1159. C. Shu, X.-B. Qiu, Z. Liu, D. Kong, L. Yuan, H. bo Yu, [19] A. Egg, Online learning for recommendations at S. Yang, C. Zhang, J. Liu, Persia: An open, hybrid grubhub, Fifteenth ACM Conference on Recom- system scaling deep learning-based recommenders mender Systems (2021). up to 100 trillion parameters, ArXiv abs/2111.05897 [20] Meituan, Distributed training optimization for ten- (2021). sorflow in recommender systems (in chinese), 2021.