=Paper= {{Paper |id=Vol-3837/paper1 |storemode=property |title=A Bag of Tricks for Scaling CPU-based Deep FFMs to more than 300m Predictions per Second |pdfUrl=https://ceur-ws.org/Vol-3837/paper_02_ceur_paper.pdf |volume=Vol-3837 |authors=Blaž Škrlj,Benjamin Ben-Shalom,Grega Gašperšič,Adi Schwartz,Ramzi Hoseisi,Naama Ziporin,Davorin Kopič,Andraž Tori |dblpUrl=https://dblp.org/rec/conf/adkdd/SkrljBGSHZKT24 }} ==A Bag of Tricks for Scaling CPU-based Deep FFMs to more than 300m Predictions per Second== https://ceur-ws.org/Vol-3837/paper_02_ceur_paper.pdf
                         A Bag of Tricks for Scaling CPU-based Deep FFMs to
                         more than 300m Predictions per Second
                         Blaž Škrlj1,∗ , Benjamin Ben-Shalom1 , Grega Gašperšič1 , Adi Schwartz1 , Ramzi Hoseisi1 ,
                         Naama Ziporin1 , Davorin Kopič1 and Andraž Tori1
                         1
                             Outbrain Inc.


                                             Abstract
                                             Field-aware Factorization Machines (FFMs) have emerged as a powerful model for click-through rate prediction, particularly excelling
                                             in capturing complex feature interactions. In this work, we present an in-depth analysis of our in-house, Rust-based Deep FFM
                                             implementation, and detail its deployment on a CPU-only, multi-data-center scale. We overview key optimizations devised for both
                                             training and inference, demonstrated by previously unpublished benchmark results in efficient model search and online training. Further,
                                             we detail an in-house weight quantization that resulted in more than an order of magnitude reduction in bandwidth footprint related to
                                             weight transfers across data-centres. We disclose the engine and associated techniques under an open-source license to contribute to
                                             the broader machine learning community. This paper showcases one of the first successful CPU-only deployments of Deep FFMs at
                                             such scale, marking a significant stride in practical, low-footprint click-through rate prediction methodologies.

                                             Keywords
                                             Data Stream Mining, Factorization Machines, Online Learning, Scalable Machine Learning



                                                                                                                                       TensorFlow [4] and PyTorch [5] enabled construction of
                                                                   Incremental (Online) Model training
                                                                                                                                       highly expressive architectures that often require special-
                                                                                                                                       ized hardware for efficient productization [6, 7, 8, 9]. CPU-
                                   AutoML Model                                                                                        only, single instance – single pass alternatives are fewer,
                                      search                                                                                           and revolve around highly optimized C++ or Rust-based
                                                                        Path to production                                             approaches that exploit consumer hardware as much as pos-
                                                                                                                                       sible. The latter is the main focus of this paper (overview in
                                                                                                                                       Figure 1).
                                      Model Transfer
                                       and storage
                                                                                                                                       2. Fwumious Wabbit (FW) - an
                                                                                        Model serving                                     overview
                                                                                                                                       We proceed with a discussion of Fwumious Wabbit (FW), an
                         Figure 1: Overview of the key topics discussed in this paper.                                                 in-house, Rust-based factorization machine-based system
                         Performance optimizations that span model search (AutoML),                                                    currently used in production for large-scale recommenda-
                         online model training, storage, transfer and serving are discussed.                                           tion1 .

                         1. Introduction                                                                                               2.1. Origins of FW and Vowpal Wabbit (VW)
                         Design and development of machine learning approaches                                                         The FW derives from Vowpal Wabbit (VW) [10], a high-
                         for the domain of recommendation systems revolves around                                                      performance, scalable open-source ML system recognized
                         the interplay between scalability and approximation capa-                                                     for its efficiency on large datasets 2 . While VW primar-
                         bility of classification and regression algorithms. Currently,                                                ily uses logistic regression for tasks like click-through rate
                         many deployed recommendation engines rely on factoriza-                                                       prediction, it lacks readily available advanced extensions
                         tion machine-based approaches; this is mostly due to good                                                     found in the domain of factorization machines. One of the
                         trade-offs when it comes to scalability, maintainability and                                                  more expressive variations of factorization machines are
                         data scientists’ involvement in building such models. Even                                                    the Field-aware Factorization Machines (FFMs), described
                         though contemporary recommenders started to increasingly                                                      in detail in the works of Juan et al. [11, 12]. Building on this
                         rely on language model-based techniques [1], utilizing fac-                                                   foundation, we enhanced the FFM architecture by integrat-
                         torization machines remains de facto solution for large-scale                                                 ing elements of deep learning. Specifically, a multi-layer
                         ”screening” of candidates that are to be served. Such candi-                                                  perceptron (MLP)-like structure in conjunction with the tra-
                         dates can include from unseen items (online stores), to movie                                                 ditional FFM (and logistic regression) components. The ar-
                         recommendations, to ads [2, 3]. Scalability of factorization                                                  chitecture’s computational complexity, a notable challenge,
                         machines enables creation of real-time systems that handle                                                    contributes to its rarity in existing benchmarks. When im-
                         hundreds of millions of requests in predictable and maintain-                                                 plemented in standard frameworks like TensorFlow, the
                         able manner. In recent years, two main branches of methods                                                    architecture struggles to scale effectively for practical use.
                         have emerged. Approaches based on frameworks such as                                                             Despite these challenges, our deep learning-extended
                                                                                                                                       FFM method demonstrated significant performance gains
                                                                                                                                       over other tested algorithms in internal assessments. How-
                          AdKDD Workshop 2024
                         ∗
                               Corresponding author.
                                                                                                                                       ever, scaling this method was not straightforward. It was
                          Envelope-Open bskrlj@outbrain.com (B. Škrlj)                                                                 1
                                                                                                                                         The engine with main implementations discussed in this paper is freely
                          Orcid 0000-0002-9916-8756 (B. Škrlj)                                                                           available as https://github.com/outbrain/fwumious_wabbit.
                                        © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License   2
                                        Attribution 4.0 International (CC BY 4.0).                                                       https://vowpalwabbit.org/


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
                                                                              hyperparameters considered include power of t, learning
                   DNN                    Activations                         rates for different types of blocks (ffm, lr), regularization
                                                                              amount (L2 norm, VW). For DCNv2 we considered different
                                                                              learning rates, cross layer numbers, dropout rates and beta
                     MergeNormLayer                                           parameters. Results of the benchmark are summarized in
                                             DiagMask                         Figure 3. For each data set, algorithms considered are visu-
                                                                              alized as AUC scores computed in a rolling window of 30k
                                                          Embeddings          instances10 .
                                                                                 The trace in each plot represents the average performance
                                                                FFM           (95% CI), and light-gray regions represent model evaluations
                                                                              that were out-of-distribution – this aspect is particularly
                                                                              relevant for understanding stability of different approaches
    LR
                                                                              and their sensitivity to hyperparameter configurations. For
                                                                              example, we observed that adding deep layers to VW mod-
Figure 2: Architecture of implemented CPU-based DeepFFMs.
Main blocks are the neural network (gray), logistic (yellow) and
                                                                              els in most cases resulted in worse performance. Carefully
FFM (red) ones.                                                               tuned VW hyperparameters yielded sufficient performance,
                                                                              however, indicate potentially cumbersome model search
                                                                              (when considering new use cases/data) in practice. Simi-
only through invoking BLAS [13], that we achieved critical                    lar behavior was observed for DCNv2. The dotted black
performance enhancements, allowing for practical full-scale                   lines represent the overall best single-window performance,
deployment3 . An overview of the architecture is shown in                     and performance on a given data set’s test set11 Overall,
Figure 2. . Key parts of the architecture are                                 initial phases of learning revealed VW’s capability to adapt
               𝑛                             𝑛       𝑛
                                                                              with less data, the DeepFFMs dominate after enough data is
lr(𝑤, 𝑥) = ∑ 𝑤𝑗 ⋅ 𝑥𝑗 + 𝑏; ffm(𝑤, 𝑥) = ∑ ∑ (𝑤𝑗1 ,𝑓2 ⋅ 𝑤𝑗2 ,𝑓1 )                seen by the engines. Superior performance was observed
               𝑗                            𝑗𝑖 =1 𝑗2 =𝑗1 +1                   by DCNv2 on Criteo, yet not other data sets (all features
                                                                              considered). The benchmark demonstrates that progres-
                                                                ⋅ 𝑥𝑗1 𝑥𝑗2 .
                                                                              sively more complex architectures tend to result in better
Neural part (matrix form),                                                    modeling capabilities, and with them, better AUCs in this
                                                                              benchmark. In terms of runtime, on the same hardware,
    ffnn(W1,2,…,𝑛 , X) = 𝑎𝑛 (… 𝑎2 (𝑎1 (X ⋅ W1 ) ⋅ W2 ) … ) ⋅ W𝑛 ,             Criteo data set could be processed on average in 32min by
                                                                              VW, and 31min by FW (linear model vs. DeepFFM). Deep
takes as input both FFM and LR’s outputs, i.e.                                VW variations took substantially longer, around 65min on
                                                                              average (batch size of 2k). This result indicates that FW
dffm(W1,2,…,𝑛 , w𝑏 , w𝑐 , x) =ffnn(W1,2,…,𝑛 , 𝑀𝑒𝑟𝑔𝑒𝑁 𝑜𝑟𝑚𝐿𝑎𝑦𝑒𝑟                 enables more powerful models with same time bounds for
                                (lr(w𝑏 , 𝑥), 𝐷𝑖𝑎𝑔𝑀𝑎𝑠𝑘(ffm(w𝑐 , 𝑥))).          training. The DCNv2 (CPU) baseline was 30%-50% slower
                                                                              compared to DeepFFM runs. These tatistics were obtained
Here, MergeNormLayer represents the operator that com-                        based on tens of thousands of runs that represented different
bines outputs of FFM and LR parts and applies normalization.                  algorithm configurations (both hyperparameters and field
Further, DiagMask represents diagonal mask of FFM space,                      specifications). Being CPU-based, the described approaches
inducing half smaller number of combinations requiring                        enable seamless scaling to commodity hardware, resulting
down-stream processing4 .                                                     in lower training and inference costs in practice.

2.2. Criteo, Avazu and KDD2012 - a                                            3. FW in practice: Service
     benchmark and stability analysis
                                                                                 Architecture overview
Even though we evaluated FW extensively on internal data
sets (and online, in A/B tests), where it showed consistent                   This section aims to facilitate understanding of subsequently
dominance, results on published data sets such as Criteo are                  discussed optimizations that were put in place to enable
also of relevance for dissemination of engines’ behavior and                  scaling of Deep FFMs. The implemented FW contains both
overall performance. In this section we overview a bench-                     training and inference logic. The training logic is relevant
mark we conducted to assess general behavior of VW and                        for incrementally training more than a hundred models,
FW. We also implemented DCNv2 [14, 15], a Tensorflow-                         online, every 𝑛 minutes (depends on the model). Training
based strong baseline5 . For considered data sets (Criteo6 ,                  jobs are separate deployments that automatically query for
Avazu7 and KDD20128 ), log transform of continuous fea-                       relevant chunks of data, download, update based on existing
tures was conducted and no additional data pruning (rare                      weights and send the weights to the serving layer. Serv-
values etc.) was conducted (as is done in our system)9 . The                  ing layer on-the-fly reconstructs the final inference weights
                                                                              via a patching mechanism discussed in Section 6, and ex-
3
  https://github.com/outbrain/fwumious_wabbit/blob/main/src/block_            poses the weights as part of the serving service that handles
  neural.rs
4
  See https://github.com/outbrain/fwumious_wabbit/blob/main/src/              millions of requests with new data. Based on the effect of
  regressor.rs for more details.                                              predictions, data is streamed back to the system as training
5
  Unique hash was assigned to each value for this baseline for ease of
                                                                              10
  implementation.                                                                RIG and Log-loss scores are aligned with AUC-based results, hence
6                                                                                only these are reported for readability purposes
  https://www.kaggle.com/c/criteo-display-ad-challenge
7                                                                             11
  https://www.kaggle.com/c/avazu-ctr-prediction/data                             for KDD, we took last 2m instances to capture apparent variability
8                                                                                in data better, other data sets are split as reported in their origin
  https://www.kaggle.com/c/kddcup2012-track2
9
  Such minimal pre-processing is within reach of a regular production.           publications.
      AUC (ws=30k)   0.8                                        0.8                                         0.8                                  0.8                                 0.8


                     0.7                                        0.7                                         0.7                                  0.7                                 0.7


                     0.6                                        0.6                                         0.6                                  0.6                                 0.6


                     0.5                                        0.5                                         0.5                                  0.5                                 0.5
                            0           2              4               0         2                4               0         2            4             0        2            4             0       2            4
                                   #inst. (VW-linear) ×107                  #inst. (VW-mlp)       ×107                #inst. (FW-DeepFFM)×107              #inst. (FW-FFM)   ×107              #inst. (DCNv2)   ×107




                     0.8                                        0.8                                         0.8                                  0.8                                 0.8
      AUC (ws=30k)




                     0.7                                        0.7                                         0.7                                  0.7                                 0.7


                     0.6                                        0.6                                         0.6                                  0.6                                 0.6


                     0.5                                        0.5                                         0.5                                  0.5                                 0.5
                            0               2              4           0             2                4           0             2            4         0            2            4         0           2            4
                                   #inst. (VW-linear) ×107                  #inst. (VW-mlp)       ×107                #inst. (FW-DeepFFM)×107              #inst. (FW-FFM)   ×107              #inst. (DCNv2)   ×107




                     0.8                                        0.8                                         0.8                                  0.8                                 0.8
      AUC (ws=30k)




                     0.7                                        0.7                                         0.7                                  0.7                                 0.7


                     0.6                                        0.6                                         0.6                                  0.6                                 0.6


                     0.5                                        0.5                                         0.5                                  0.5                                 0.5
                            0               1              2           0             1                2           0             1            2         0            1            2         0           1            2
                                   #inst. (VW-linear) ×107                  #inst. (VW-mlp)       ×107                #inst. (FW-DeepFFM)×107              #inst. (FW-FFM)   ×107              #inst. (DCNv2)   ×107



                                Figure 3: Visualization of overall performance of different algorithms (single-pass) across different benchmark data sets
                                (top-down: Criteo, Avazu, kddcup2012. Visualizations show traces of all trained models (per engine).



Table 1                                                                                                                             4.1. Speeding up model warm-up phase
Stability analysis and overall performance. Rows with max test
set performance highlighted.                                                                                                        Model warm-up corresponds to a phase in model training
                                                                                                                                    where model starts with past data, and ”catches up” with
                                                  Avazu (window=30k)
                                                                                                                                    present data as fast as possible. We identified efficient data
                     algo                       avg   median          max      std         min            test
                     VW-linear           0.6832        0.7016     0.8200    0.0668       0.4664       0.7596
                                                                                                                                    pre-fetching as a crucial optimization for speeding up this
                     VW-mlp              0.6755        0.6984     0.8200    0.0748       0.4664       0.7596                        process. By implementing async learning cycles, multiple
                     FW-DeepFFM          0.7648        0.7654     0.8507    0.0243       0.4764       0.7916
                     FW-FFM              0.7524        0.7524     0.8234    0.0227       0.4816       0.7693
                                                                                                                                    rounds of ”future” data can be downloaded upfront, mak-
                     DCNv2               0.7750        0.7745     0.8326    0.0202       0.5005       0.7763                        ing sure the learning engine has constant influx of data.
                                                  Criteo (window=30k)                                                               Data pre-fetch in practice results in up to 4x faster pre-
                     algo                       avg   median          max      std         min            test                      warming. Within the cloud environment where the jobs
                     VW-linear
                     VW-mlp
                                         0.7340
                                         0.7247
                                                       0.7460
                                                       0.7425
                                                                  0.8219
                                                                  0.8211
                                                                            0.0556
                                                                            0.0670
                                                                                         0.4768
                                                                                         0.4768
                                                                                                      0.7920
                                                                                                      0.7920
                                                                                                                                    are deployed, we can control machine ”taints”, i.e. signa-
                     FW-DeepFFM          0.7655        0.7689     0.8053    0.0179       0.4796       0.7803                        tures that determine their hardware profile. Pre-warm jobs
                     FW-FFM              0.7578        0.7621     0.8020    0.0198       0.4682       0.7742
                     DCNv2               0.8042        0.8052     0.8370    0.0118       0.4958       0.8085
                                                                                                                                    have dedicated taints, which in practice results in machines
                                        KDDCup2012 (window=30k)                                                                     that are newer and stronger.
                     algo                       avg   median          max      std         min            test
                     VW-linear
                     VW-mlp
                                         0.6333
                                         0.6309
                                                       0.6419
                                                       0.6402
                                                                  0.8336
                                                                  0.8336
                                                                            0.0807
                                                                            0.0869
                                                                                         0.3430
                                                                                         0.3759
                                                                                                      0.7688
                                                                                                      0.7688
                                                                                                                                    4.2. Hogwild-based training
                     FW-DeepFFM          0.7323        0.7400     0.8781    0.0414       0.3687       0.7967
                     FW-FFM              0.7228        0.7318     0.8382    0.0391       0.3651       0.7641                        An optimization that significantly improved model pre-
                     DCNv2               0.7589        0.7610     0.8718    0.0301       0.4792       0.7734                        warm time is the previously reported Hogwild-based model
data (a feedback loop). The training jobs are Python-based                                                                          training[16], implemented also for Fwumious framework
services that interact with the binary via process invocations.                                                                     (as part of this work). Here, weight overlaps/overrides are
Serving binds the inference capabilities with the serving                                                                           allowed as the trade off for multi-threaded updates. By
(Java) service directly via a foreign function interface (ffi)12 .                                                                  tuning Hogwild capacity to tainted machines, we observed
The architecture enables separation of concerns – training                                                                          multi-fold speedups in model warm-up. In practice, the
jobs are separate to inference jobs, albeit at the cost of need-                                                                    times for bigger models went from multiple weeks to days,
ing to send the updated weight data between services; this is                                                                       and in most cases around a day of training (to catch up).
one of the key performance bottlenecks that was addressed                                                                           Weight degradation due to Hogwild was A/B tested and
in this work. An overview of the scope of this paper is                                                                             does not appear to cause any noticeable RPM drops. Sum-
shown in Figure 1.                                                                                                                  mary of Howgild-based training compared to control (no
                                                                                                                                    such training) is shown in Table 2. Utilization of hogwild
4. Model training improvements                                                                                                      has shown substantial benefits also when utilized during
                                                                                                                                    online training (e.g., every 5min), and enabled of scaling of
We next discuss main improvements implemented at the                                                                                100% bigger models. To the best of our knowledge, this is
level of training jobs and offline research.                                                                                        one of the first demonstrations of consistent Hogwild-based
                                                                                                                                    training improvements for Deep FFMs.
12
     https://github.com/outbrain/fwumious_wabbit/blob/main/src/lib.rs
Table 2
Impact of Hogwild-based training.
      Implementation         Warmup time (same period)
   FW-deepFFM-control                     8d
   FW-deepFFM-hogwild             23h (48 threads)
      Implementation        Online training (same period)
   FW-deepFFM-control                    20m
   FW-deepFFM-hogwild              4m (4 threads)


4.3. Sparse weight updates
The next discussed optimization is related to how gradi-
ents are accounted for during model optimization itself.                Figure 4: Impact of context caching on inference time.
We observed that deep layers, albeit being parameter-wise
in minority compared to FFM part, take up considerable
amount of time during optimization. To remedy this short-
coming, we identified an optimization opportunity that is
a combination of activation function used in most models,
𝑓 (𝑥) = max(𝑥, 0), and the specific implementation of FW.
By realizing that we can identify zero global gradient scenar-
ios upfront, prior to updating any weights, we could skip
whole branches of computation with no impact on learning.
The performance (speed) of training however, was across-
the-board improved by 30% for most models, and for deeper
ones by up to 3x, see Table 3 for more details. We observed             Figure 5: Relative impact of SIMD-enabled (blue, after drop) vs.
that at most two hidden layers were feasible for production,            SIMD-disabled (purple) FW in production (inference).
hence any further speedups than observed 30% were not
feasible in practice. This optimization was possible due to             (inference) with no loss in RPM performance, and resulted
ReLU’s nature; this activation maps weights to zeros, effec-            in a consistent 20% speedup for all serving14 . Real-life exam-
tively enabling identification of compute branches that need            ple of deployed SIMD-based FW vs. the control (no SIMD)
to be skipped during updates.                                           is shown in Figure 5. Up to 25% faster inference (and with
                                                                        it lower resource utilization) were observed.
5. Model serving improvements
We proceed our discussion with an overview of CPU-based
                                                                        6. Storage and transfer optimization
model inference via context caching. A considerable op-                 As discussed in previous sections, training and serving jobs
timization we observed could take place in our system is                are separated. This separation of concerns, albeit easier to
context caching. Each request can be separated into context             maintain, contributes to a major drawback: weight sending
and candidates. For all candidates in the request, the con-             across the network. Model weights need to be constantly
text is the same, even though the recommended content’s                 updated, which incurs substantial bandwidth costs. For
features differ – this implies part of the feature space is very        example, hundreds of live models that take up to 10G of
consistent for each candidate batch. To exploit this prop-              memory (per update) are constantly transferred across the
erty, a dedicated serving-level caching scheme was put in               network, resulting in a substantial bandwidth overhead to
place. FW at this point does an additional pass only with the           ensure low-latency online serving.
context part, where it identifies and caches frequent parts                Model patching. The first improvement we imple-
of the context. On subsequent candidate passes it reuses                mented is the concept of model patching. This process is
this information on-the fly instead of re-calculating it for            inspired by application of software patches (in general), al-
each context-candidate pair. Deployment impact of context               beit tailored to internal structure of FW’s weights. Each
caching is shown in Figure 413 . We next discuss (SIMD)                 trained model consists of training weights and the opti-
Instruction-aware forward pass. Another optimization                    mizer’s weights. The latter are not required for actual in-
that is particular to inference is proper exploitation of SIMD          ference, which immediately reduces the required space by
intrinsics. These hardware instruction level optimizations,             half. Further, each subsequent inference weights update
however, needed to be carefully implemented as the space                (inference weights can be multiple GB) first computes model
of serving hardware is not homogeneous, meaning that on-                diff – byte-level difference between old and new weights.
the-fly instruction detection, and subsequent utilization of            This is possible due to a consistent memory-level structure
appropriate binary needed to be put in place. SIMD in-                  of weight files. The diffs are compressed, sent to the serv-
trinsics were successfully used to speed up forward pass                ing layer, unpacked and applied to previous weights file
13
     https://github.com/outbrain/fwumious_wabbit/blob/main/src/radix_   to obtain the new set of weights (inference). This process
     tree.rs                                                            takes tens of seconds, however, further reduces memory
                                                                        footprint on the network by more than 100% (less than a
                                                                        GB of updates per model after patching Deep FFMs).
Table 3                                                                    First, instead of storing absolute indices of bytes that
Speedups observed due to sparse weight updates.
        #Hidden layers          1      2       3             4          14
                                                                             https://github.com/outbrain/fwumious_wabbit/blob/main/src/block_
   Speedup (sparse updates)    1.3x   1.8x   2.4x           3.5x             ffm.rs
change, relative locations are stored, resulting in a consid-
erable storage saving. Next, small integers denoting these
differences are stored as a custom integer type – instead
of storing whole ints, compressed versions (small ints are
impacted the most) are stored, leading to further improve-
ments15 . As patcher works at the level of bytes, we also
successfully tested it for internal Tensorflow-based flows
(reduced bandwidth for sending models). Weight Quan-
tization. Inspired by recent weight quantization advance-
ments in the field of large language models [17, 18], we
implemented a variation of 16b weight quantization
that, when combined with the byte-level patching mecha-
nism, offered considerable bandwidth and model storage             Figure 6: Speedup observed when jointly using quantization and
improvements. The quantization algorithm was designed to           model patching (as opposed to just patching).
account for the following use-case specific properties. First,
by ensuring consistently small weight patches, the quantiza-
tion ensures consistently smaller network load. Second, the        Table 4
quantization and dequantization procedures must be fast,           Impact of model quantization on the global production CTR
as they need to happen within a designated time window             model.
                                                                          Weight processing         Avg. time spent   Update file size
after each training round (procedure has tens of seconds at            no procecssing (baseline)            /             100%
most at its disposal for full weight space). Finally, the algo-            fw-quantization                 2s             50%
rithm needs to be able to dynamically select viable weight                    fw-patcher                   45s           30±5%
                                                                    fw-patcher + fw-quantization           8s            3±2%
ranges, as we observed considerable variation in weight up-
date sizes based on e.g., time of the day (traffic amount). The
                                                                      Note that weight patching and quantization on their own
final version of the algorithm can be summarized as follows.
                                                                   already at least halve the size of weights that are used in
For each online model update (e.g., 5min window), weights
                                                                   serving and production. Further, by combining the two ap-
are first traversed to obtain the minimum and maximum val-
                                                                   proaches, we observed a non-linear improvement in patch
ues (weights). These statistics are required to dynamically
                                                                   sizes – around 10x smaller updates are regularly produced.
determine the range of relevant weight bins, as the amount
                                                                   The quantized patches-based model showed small lifts in
of possible values for 16b representation is small (around
                                                                   and online A/B against control with no quantization applied,
65k). Let 𝑊 = {𝑤1 , 𝑤2 , … , 𝑤𝑛 |𝑤𝑖 ∈ ℝ} denote the set of all
                                                                   considerably reducing network bandwidth required with
(𝑛) weights and 𝑏max denote the number of possible weight
                                                                   a small positive business impact (+0.15% RPM). Speedup
buckets. Once the minimum and maximum are obtained,
                                                                   in a real-life production system due to compound effect
the bucket size is computed as
                                                                   of quantization and patching can be observed in Figure 6.
                 max(𝑊 ).round(𝛼) − min(𝑊 ).round(𝛽)               Rightmost part of the plot represents total time spent patch-
     bucket𝑠 =                                       .             ing and computing quantized weights.
                                𝑏max

Note that minimum and maximum are rounded to 𝛼 and
𝛽 decimals. This consideration stems from empirical re-            7. Conclusions and open problems
sults that indicated that considering full precision bounds
                                                               In this paper, we presented a collection of implementation
results in less stable patch sizes 16 . When constraining mini-
                                                               details for scaling CPU-based DeepFFMs to operate at a
mum and maximum to certain precision, behavior stabilized
                                                               multi-data-center scale, capable of handling hundreds of
whilst preserving performance and online behavior. In the
                                                               millions of predictions per second. We delved into both the
second pass, weights are quantized – for each weight, its
                                                               offline and online components of our system. In the offline
16b representation is computed and stored. This results in
                                                               phase, we covered the complete workflow, including model
computing
                                                               architecture, enhancements to system warm-up processes,
((𝑤𝑖 − min(𝑊 )/bucket𝑠 ).round().castTo16b().convertToBytes(), and bandwidth optimization strategies. Within the online
                                                               phase, we describe two novel modifications to the inference
i.e. a set of bytes that represent a certain weight bucket.    layer that have yielded significant speed improvements. Our
Bytes are stored in FW weight format and re-used during        main algorithms, concepts, and performance benchmarks
inference. An important detail also concerns metadata re-      were discussed in detail, open-source implementations of
quired to perform this type of quantization; the original      key components were made freely available. The imple-
weights file is enriched with a header that contains the       mentation is extensible to other FFM-based variants. As
bucket size and weight minimum – these two properties are      further work, on the inference side, implementing quantiza-
sufficient for efficient weight reconstruction when/where      tion techniques could accelerate the forward pass by using
relevant17 . Results on a representative CTR model are         integer-based operations [19]. Improved weight sharing
shown in Table 4. Metrics of interest are time to produce      and memory mapping could offer training improvements.
patch and the final patch/weight update’s size. Patching
and quantization result in up to 30x smaller model updates.
                                                                   References
15
   https://github.com/outbrain/fwumious_wabbit/blob/main/weight_
   patcher                                                          [1] J. Zhang, K. Bao, Y. Zhang, W. Wang, F. Feng, X. He, Is
16
   (quantization output tended to fluctuate more)                       chatgpt fair for recommendation? evaluating fairness
17
   https://github.com/outbrain/fwumious_wabbit/blob/main/src/
   quantization.rs
                                                                        in large language model recommendation, in: Proceed-
     ings of the 17th ACM Conference on Recommender                  ling, G. Henry, et al., An updated set of basic linear
     Systems, 2023, pp. 993–999.                                     algebra subprograms (blas), ACM Transactions on
 [2] S. Zhang, Y. Tay, L. Yao, A. Sun, C. Zhang, Deep                Mathematical Software 28 (2002) 135–151.
     learning for recommender systems, in: Recommender          [14] R. Wang, R. Shivanna, D. Cheng, S. Jain, D. Lin,
     Systems Handbook, Springer, 2021, pp. 173–210.                  L. Hong, E. Chi, Dcn v2: Improved deep & cross net-
 [3] Y. Deldjoo, M. Schedl, P. Cremonesi, G. Pasi, Recom-            work and practical lessons for web-scale learning to
     mender systems leveraging multimedia content, ACM               rank systems, in: Proceedings of the web conference
     Computing Surveys (CSUR) 53 (2020) 1–38.                        2021, 2021, pp. 1785–1797.
 [4] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen,       [15] W. Shen, Deepctr: Easy-to-use,modular and extendible
     C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin,           package of deep-learning based ctr models, https://
     S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Is-          github.com/shenweichen/deepctr, 2017.
     ard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Lev-  [16] B. Recht, C. Re, S. Wright, F. Niu, Hogwild!: A lock-free
     enberg, D. Mané, R. Monga, S. Moore, D. Murray,                 approach to parallelizing stochastic gradient descent,
     C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever,      Advances in neural information processing systems
     K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,               24 (2011).
     F. Viégas, O. Vinyals, P. Warden, M. Wattenberg,           [17] B. Rokh, A. Azarpeyvand, A. Khanteymoori, A compre-
     M. Wicke, Y. Yu, X. Zheng, TensorFlow: Large-scale              hensive survey on model quantization for deep neural
     machine learning on heterogeneous systems, 2015.                networks, arXiv preprint arXiv:2205.07877 (2022).
     URL: https://www.tensorflow.org/, software available       [18] H. Bai, L. Hou, L. Shang, X. Jiang, I. King, M. R. Lyu,
     from tensorflow.org.                                            Towards efficient post-training quantization of pre-
 [5] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Brad-               trained language models, Advances in Neural Infor-
     bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,             mation Processing Systems 35 (2022) 1405–1418.
     L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito,      [19] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang,
     M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner,              A. Howard, H. Adam, D. Kalenichenko, Quantization
     L. Fang, J. Bai, S. Chintala,         Pytorch: An im-           and training of neural networks for efficient integer-
     perative style, high-performance deep learning                  arithmetic-only inference, in: Proceedings of the IEEE
     library, in: Advances in Neural Information Pro-                conference on computer vision and pattern recogni-
     cessing Systems 32, Curran Associates, Inc., 2019,              tion, 2018, pp. 2704–2713.
     pp. 8024–8035. URL: http://papers.neurips.cc/paper/
     9015-pytorch-an-imperative-style-high-performance-deep-learning-library.
     pdf.
 [6] W. Song, C. Shi, Z. Xiao, Z. Duan, Y. Xu, M. Zhang,
     J. Tang, Autoint: Automatic feature interaction learn-
     ing via self-attentive neural networks, in: Proceedings
     of the 28th ACM international conference on informa-
     tion and knowledge management, 2019, pp. 1161–1170.
 [7] J. Lian, X. Zhou, F. Zhang, Z. Chen, X. Xie, G. Sun,
     xdeepfm: Combining explicit and implicit feature in-
     teractions for recommender systems, in: Proceed-
     ings of the 24th ACM SIGKDD international confer-
     ence on knowledge discovery & data mining, 2018, pp.
     1754–1763.
 [8] H.-T. Cheng, L. Koc, J. Harmsen, T. Shaked, T. Chan-
     dra, H. Aradhye, G. Anderson, G. Corrado, W. Chai,
     M. Ispir, et al., Wide & deep learning for recommender
     systems, in: Proceedings of the 1st workshop on deep
     learning for recommender systems, 2016, pp. 7–10.
 [9] H. Guo, R. Tang, Y. Ye, Z. Li, X. He, Deepfm: a
     factorization-machine based neural network for ctr
     prediction, arXiv preprint arXiv:1703.04247 (2017).
[10] A. Bietti, A. Agarwal, J. Langford, A contextual
     bandit bake-off, arXiv:1802.04064v3 [stat.ML], 2018.
     URL: https://www.microsoft.com/en-us/research/
     publication/a-contextual-bandit-bake-off-2/.
[11] Y. Juan, D. Lefortier, O. Chapelle, Field-aware factor-
     ization machines in a real-world online advertising
     system, in: Proceedings of the 26th International Con-
     ference on World Wide Web Companion, 2017, pp.
     680–688.
[12] Y. Juan, Y. Zhuang, W.-S. Chin, C.-J. Lin, Field-aware
     factorization machines for ctr prediction, in: Proceed-
     ings of the 10th ACM conference on recommender
     systems, 2016, pp. 43–50.
[13] L. S. Blackford, A. Petitet, R. Pozo, K. Remington, R. C.
     Whaley, J. Demmel, J. Dongarra, I. Duff, S. Hammar-