<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">A Bag of Tricks for Scaling CPU-based Deep FFMs to more than 300m Predictions per Second</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Blaž</forename><surname>Škrlj</surname></persName>
							<email>bskrlj@outbrain.com</email>
							<affiliation key="aff0">
								<orgName type="institution">Outbrain Inc</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Benjamin</forename><surname>Ben-Shalom</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Outbrain Inc</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Grega</forename><surname>Gašperšič</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Outbrain Inc</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Adi</forename><surname>Schwartz</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Outbrain Inc</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Ramzi</forename><surname>Hoseisi</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Outbrain Inc</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Naama</forename><surname>Ziporin</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Outbrain Inc</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Davorin</forename><surname>Kopič</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Outbrain Inc</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Andraž</forename><surname>Tori</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Outbrain Inc</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">A Bag of Tricks for Scaling CPU-based Deep FFMs to more than 300m Predictions per Second</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">AE9E68D4E395D39FC081679EA024905C</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T20:14+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Data Stream Mining</term>
					<term>Factorization Machines</term>
					<term>Online Learning</term>
					<term>Scalable Machine Learning</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Field-aware Factorization Machines (FFMs) have emerged as a powerful model for click-through rate prediction, particularly excelling in capturing complex feature interactions. In this work, we present an in-depth analysis of our in-house, Rust-based Deep FFM implementation, and detail its deployment on a CPU-only, multi-data-center scale. We overview key optimizations devised for both training and inference, demonstrated by previously unpublished benchmark results in efficient model search and online training. Further, we detail an in-house weight quantization that resulted in more than an order of magnitude reduction in bandwidth footprint related to weight transfers across data-centres. We disclose the engine and associated techniques under an open-source license to contribute to the broader machine learning community. This paper showcases one of the first successful CPU-only deployments of Deep FFMs at such scale, marking a significant stride in practical, low-footprint click-through rate prediction methodologies.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Design and development of machine learning approaches for the domain of recommendation systems revolves around the interplay between scalability and approximation capability of classification and regression algorithms. Currently, many deployed recommendation engines rely on factorization machine-based approaches; this is mostly due to good trade-offs when it comes to scalability, maintainability and data scientists' involvement in building such models. Even though contemporary recommenders started to increasingly rely on language model-based techniques [1], utilizing factorization machines remains de facto solution for large-scale "screening" of candidates that are to be served. Such candidates can include from unseen items (online stores), to movie recommendations, to ads [2, 3]. Scalability of factorization machines enables creation of real-time systems that handle hundreds of millions of requests in predictable and maintainable manner. In recent years, two main branches of methods have emerged. Approaches based on frameworks such as TensorFlow [4] and PyTorch [5] enabled construction of highly expressive architectures that often require specialized hardware for efficient productization [6, 7, 8, 9]. CPUonly, single instance -single pass alternatives are fewer, and revolve around highly optimized C++ or Rust-based approaches that exploit consumer hardware as much as possible. The latter is the main focus of this paper (overview in Figure <ref type="figure">1</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Fwumious Wabbit (FW) -an overview</head><p>We proceed with a discussion of Fwumious Wabbit (FW), an in-house, Rust-based factorization machine-based system currently used in production for large-scale recommendation <ref type="bibr" target="#b0">1</ref> .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Origins of FW and Vowpal Wabbit (VW)</head><p>The FW derives from Vowpal Wabbit (VW) [10], a highperformance, scalable open-source ML system recognized for its efficiency on large datasets <ref type="bibr" target="#b1">2</ref> . While VW primarily uses logistic regression for tasks like click-through rate prediction, it lacks readily available advanced extensions found in the domain of factorization machines. One of the more expressive variations of factorization machines are the Field-aware Factorization Machines (FFMs), described in detail in the works of Juan et al. [11, 12]. Building on this foundation, we enhanced the FFM architecture by integrating elements of deep learning. Specifically, a multi-layer perceptron (MLP)-like structure in conjunction with the traditional FFM (and logistic regression) components. The architecture's computational complexity, a notable challenge, contributes to its rarity in existing benchmarks. When implemented in standard frameworks like TensorFlow, the architecture struggles to scale effectively for practical use. Despite these challenges, our deep learning-extended FFM method demonstrated significant performance gains over other tested algorithms in internal assessments. However, scaling this method was not straightforward. It was only through invoking BLAS [13], that we achieved critical performance enhancements, allowing for practical full-scale deployment <ref type="bibr" target="#b2">3</ref> . An overview of the architecture is shown in Figure <ref type="figure" target="#fig_0">2</ref>. . Key parts of the architecture are</p><formula xml:id="formula_0">lr(𝑤, 𝑥) = 𝑛 ∑ 𝑗 𝑤 𝑗 ⋅ 𝑥 𝑗 + 𝑏; ffm(𝑤, 𝑥) = 𝑛 ∑ 𝑗 𝑖 =1 𝑛 ∑ 𝑗 2 =𝑗 1 +1 (𝑤 𝑗 1 ,𝑓 2 ⋅ 𝑤 𝑗 2 ,𝑓 1 ) ⋅ 𝑥 𝑗 1 𝑥 𝑗 2 . Neural part (matrix form), ffnn(W 1,2,…,𝑛 , X) = 𝑎 𝑛 (… 𝑎 2 (𝑎 1 (X ⋅ W 1 ) ⋅ W 2 ) … ) ⋅ W 𝑛 ,</formula><p>takes as input both FFM and LR's outputs, i.e. dffm(W 1,2,…,𝑛 , w 𝑏 , w 𝑐 , x) =ffnn(W 1,2,…,𝑛 , 𝑀𝑒𝑟𝑔𝑒𝑁 𝑜𝑟𝑚𝐿𝑎𝑦𝑒𝑟 (lr(w 𝑏 , 𝑥), 𝐷𝑖𝑎𝑔𝑀𝑎𝑠𝑘(ffm(w 𝑐 , 𝑥))).</p><p>Here, MergeNormLayer represents the operator that combines outputs of FFM and LR parts and applies normalization. Further, DiagMask represents diagonal mask of FFM space, inducing half smaller number of combinations requiring down-stream processing<ref type="foot" target="#foot_1">4</ref> .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Criteo, Avazu and KDD2012 -a benchmark and stability analysis</head><p>Even though we evaluated FW extensively on internal data sets (and online, in A/B tests), where it showed consistent dominance, results on published data sets such as Criteo are also of relevance for dissemination of engines' behavior and overall performance. In this section we overview a benchmark we conducted to assess general behavior of VW and FW. We also implemented DCNv2 [14, 15], a Tensorflowbased strong baseline <ref type="bibr" target="#b4">5</ref> . For considered data sets (Criteo<ref type="foot" target="#foot_3">6</ref> , Avazu<ref type="foot" target="#foot_4">7</ref> and KDD2012<ref type="foot" target="#foot_5">8</ref> ), log transform of continuous features was conducted and no additional data pruning (rare values etc.) was conducted (as is done in our system) <ref type="bibr" target="#b8">9</ref> . The hyperparameters considered include power of t, learning rates for different types of blocks (ffm, lr), regularization amount (L2 norm, VW). For DCNv2 we considered different learning rates, cross layer numbers, dropout rates and beta parameters. Results of the benchmark are summarized in Figure <ref type="figure" target="#fig_1">3</ref>. For each data set, algorithms considered are visualized as AUC scores computed in a rolling window of 30k instances <ref type="bibr" target="#b9">10</ref> . The trace in each plot represents the average performance (95% CI), and light-gray regions represent model evaluations that were out-of-distribution -this aspect is particularly relevant for understanding stability of different approaches and their sensitivity to hyperparameter configurations. For example, we observed that adding deep layers to VW models in most cases resulted in worse performance. Carefully tuned VW hyperparameters yielded sufficient performance, however, indicate potentially cumbersome model search (when considering new use cases/data) in practice. Similar behavior was observed for DCNv2. The dotted black lines represent the overall best single-window performance, and performance on a given data set's test set <ref type="bibr" target="#b10">11</ref> Overall, initial phases of learning revealed VW's capability to adapt with less data, the DeepFFMs dominate after enough data is seen by the engines. Superior performance was observed by DCNv2 on Criteo, yet not other data sets (all features considered). The benchmark demonstrates that progressively more complex architectures tend to result in better modeling capabilities, and with them, better AUCs in this benchmark. In terms of runtime, on the same hardware, Criteo data set could be processed on average in 32min by VW, and 31min by FW (linear model vs. DeepFFM). Deep VW variations took substantially longer, around 65min on average (batch size of 2k). This result indicates that FW enables more powerful models with same time bounds for training. The DCNv2 (CPU) baseline was 30%-50% slower compared to DeepFFM runs. These tatistics were obtained based on tens of thousands of runs that represented different algorithm configurations (both hyperparameters and field specifications). Being CPU-based, the described approaches enable seamless scaling to commodity hardware, resulting in lower training and inference costs in practice.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">FW in practice: Service Architecture overview</head><p>This section aims to facilitate understanding of subsequently discussed optimizations that were put in place to enable scaling of Deep FFMs. The implemented FW contains both training and inference logic. The training logic is relevant for incrementally training more than a hundred models, online, every 𝑛 minutes (depends on the model). Training jobs are separate deployments that automatically query for relevant chunks of data, download, update based on existing weights and send the weights to the serving layer. Serving layer on-the-fly reconstructs the final inference weights via a patching mechanism discussed in Section 6, and exposes the weights as part of the serving service that handles millions of requests with new data. Based on the effect of predictions, data is streamed back to the system as training  Serving binds the inference capabilities with the serving (Java) service directly via a foreign function interface (ffi) <ref type="bibr" target="#b11">12</ref> .</p><p>The architecture enables separation of concerns -training jobs are separate to inference jobs, albeit at the cost of needing to send the updated weight data between services; this is one of the key performance bottlenecks that was addressed in this work. An overview of the scope of this paper is shown in Figure <ref type="figure">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Model training improvements</head><p>We next discuss main improvements implemented at the level of training jobs and offline research.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Speeding up model warm-up phase</head><p>Model warm-up corresponds to a phase in model training where model starts with past data, and "catches up" with present data as fast as possible. We identified efficient data pre-fetching as a crucial optimization for speeding up this process. By implementing async learning cycles, multiple rounds of "future" data can be downloaded upfront, making sure the learning engine has constant influx of data. Data pre-fetch in practice results in up to 4x faster prewarming. Within the cloud environment where the jobs are deployed, we can control machine "taints", i.e. signatures that determine their hardware profile. Pre-warm jobs have dedicated taints, which in practice results in machines that are newer and stronger.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Hogwild-based training</head><p>An optimization that significantly improved model prewarm time is the previously reported Hogwild-based model training [16], implemented also for Fwumious framework (as part of this work). Here, weight overlaps/overrides are allowed as the trade off for multi-threaded updates. By tuning Hogwild capacity to tainted machines, we observed multi-fold speedups in model warm-up. In practice, the times for bigger models went from multiple weeks to days, and in most cases around a day of training (to catch up).  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Weight degradation due to</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Sparse weight updates</head><p>The next discussed optimization is related to how gradients are accounted for during model optimization itself. We observed that deep layers, albeit being parameter-wise in minority compared to FFM part, take up considerable amount of time during optimization. To remedy this shortcoming, we identified an optimization opportunity that is a combination of activation function used in most models, 𝑓 (𝑥) = max(𝑥, 0), and the specific implementation of FW.</p><p>By realizing that we can identify zero global gradient scenarios upfront, prior to updating any weights, we could skip whole branches of computation with no impact on learning.</p><p>The performance (speed) of training however, was acrossthe-board improved by 30% for most models, and for deeper ones by up to 3x, see Table <ref type="table" target="#tab_3">3</ref> for more details. We observed that at most two hidden layers were feasible for production, hence any further speedups than observed 30% were not feasible in practice. This optimization was possible due to ReLU's nature; this activation maps weights to zeros, effectively enabling identification of compute branches that need to be skipped during updates.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Model serving improvements</head><p>We proceed our discussion with an overview of CPU-based model inference via context caching. A considerable optimization we observed could take place in our system is context caching. Each request can be separated into context and candidates. For all candidates in the request, the context is the same, even though the recommended content's features differ -this implies part of the feature space is very consistent for each candidate batch. To exploit this property, a dedicated serving-level caching scheme was put in place. FW at this point does an additional pass only with the context part, where it identifies and caches frequent parts of the context. On subsequent candidate passes it reuses this information on-the fly instead of re-calculating it for each context-candidate pair. Deployment impact of context caching is shown in Figure <ref type="figure" target="#fig_2">4</ref> 13 . We next discuss (SIMD) Instruction-aware forward pass. Another optimization that is particular to inference is proper exploitation of SIMD intrinsics. These hardware instruction level optimizations, however, needed to be carefully implemented as the space of serving hardware is not homogeneous, meaning that onthe-fly instruction detection, and subsequent utilization of appropriate binary needed to be put in place. SIMD intrinsics were successfully used to speed up forward pass <ref type="bibr" target="#b12">13</ref> https://github.com/outbrain/fwumious_wabbit/blob/main/src/radix_ tree.rs   (inference) with no loss in RPM performance, and resulted in a consistent 20% speedup for all serving <ref type="bibr" target="#b13">14</ref> . Real-life example of deployed SIMD-based FW vs. the control (no SIMD) is shown in Figure <ref type="figure" target="#fig_3">5</ref>. Up to 25% faster inference (and with it lower resource utilization) were observed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Storage and transfer optimization</head><p>As discussed in previous sections, training and serving jobs are separated. This separation of concerns, albeit easier to maintain, contributes to a major drawback: weight sending across the network. Model weights need to be constantly updated, which incurs substantial bandwidth costs. For example, hundreds of live models that take up to 10G of memory (per update) are constantly transferred across the network, resulting in a substantial bandwidth overhead to ensure low-latency online serving. Model patching. The first improvement we implemented is the concept of model patching. This process is inspired by application of software patches (in general), albeit tailored to internal structure of FW's weights. Each trained model consists of training weights and the optimizer's weights. The latter are not required for actual inference, which immediately reduces the required space by half. Further, each subsequent inference weights update (inference weights can be multiple GB) first computes model diff -byte-level difference between old and new weights. This is possible due to a consistent memory-level structure of weight files. The diffs are compressed, sent to the serving layer, unpacked and applied to previous weights file to obtain the new set of weights (inference). This process takes tens of seconds, however, further reduces memory footprint on the network by more than 100% (less than a GB of updates per model after patching Deep FFMs).</p><p>First, instead of storing absolute indices of bytes that change, relative locations are stored, resulting in a considerable storage saving. Next, small integers denoting these differences are stored as a custom integer type -instead of storing whole ints, compressed versions (small ints are impacted the most) are stored, leading to further improvements <ref type="bibr" target="#b14">15</ref> . As patcher works at the level of bytes, we also successfully tested it for internal Tensorflow-based flows (reduced bandwidth for sending models). Weight Quantization. Inspired by recent weight quantization advancements in the field of large language models [17, 18], we implemented a variation of 16b weight quantization that, when combined with the byte-level patching mechanism, offered considerable bandwidth and model storage improvements. The quantization algorithm was designed to account for the following use-case specific properties. First, by ensuring consistently small weight patches, the quantization ensures consistently smaller network load. Second, the quantization and dequantization procedures must be fast, as they need to happen within a designated time window after each training round (procedure has tens of seconds at most at its disposal for full weight space). Finally, the algorithm needs to be able to dynamically select viable weight ranges, as we observed considerable variation in weight update sizes based on e.g., time of the day (traffic amount). The final version of the algorithm can be summarized as follows.</p><p>For each online model update (e.g., 5min window), weights are first traversed to obtain the minimum and maximum values (weights). These statistics are required to dynamically determine the range of relevant weight bins, as the amount of possible values for 16b representation is small (around 65k). Let 𝑊 = {𝑤 1 , 𝑤 2 , … , 𝑤 𝑛 |𝑤 𝑖 ∈ ℝ} denote the set of all (𝑛) weights and 𝑏 max denote the number of possible weight buckets. Once the minimum and maximum are obtained, the bucket size is computed as bucket 𝑠 = max(𝑊 ).round(𝛼) − min(𝑊 ).round(𝛽) 𝑏 max .</p><p>Note that minimum and maximum are rounded to 𝛼 and 𝛽 decimals. This consideration stems from empirical results that indicated that considering full precision bounds results in less stable patch sizes <ref type="bibr" target="#b15">16</ref> . When constraining minimum and maximum to certain precision, behavior stabilized whilst preserving performance and online behavior. In the second pass, weights are quantized -for each weight, its 16b representation is computed and stored. This results in computing ((𝑤 𝑖 − min(𝑊 )/bucket 𝑠 ).round().castTo16b().convertToBytes(),</p><p>i.e. a set of bytes that represent a certain weight bucket. Bytes are stored in FW weight format and re-used during inference. An important detail also concerns metadata required to perform this type of quantization; the original weights file is enriched with a header that contains the bucket size and weight minimum -these two properties are sufficient for efficient weight reconstruction when/where relevant <ref type="bibr" target="#b16">17</ref>    Note that weight patching and quantization on their own already at least halve the size of weights that are used in serving and production. Further, by combining the two approaches, we observed a non-linear improvement in patch sizes -around 10x smaller updates are regularly produced. The quantized patches-based model showed small lifts in and online A/B against control with no quantization applied, considerably reducing network bandwidth required with a small positive business impact (+0.15% RPM). Speedup in a real-life production system due to compound effect of quantization and patching can be observed in Figure <ref type="figure" target="#fig_4">6</ref>. Rightmost part of the plot represents total time spent patching and computing quantized weights.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusions and open problems</head><p>In this paper, we presented a collection of implementation details for scaling CPU-based DeepFFMs to operate at a multi-data-center scale, capable of handling hundreds of millions of predictions per second. We delved into both the offline and online components of our system. In the offline phase, we covered the complete workflow, including model architecture, enhancements to system warm-up processes, and bandwidth optimization strategies. Within the online phase, we describe two novel modifications to the inference layer that have yielded significant speed improvements. Our main algorithms, concepts, and performance benchmarks were discussed in detail, open-source implementations of key components were made freely available. The implementation is extensible to other FFM-based variants. As further work, on the inference side, implementing quantization techniques could accelerate the forward pass by using integer-based operations [19]. Improved weight sharing and memory mapping could offer training improvements.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Architecture of implemented CPU-based DeepFFMs. Main blocks are the neural network (gray), logistic (yellow) and FFM (red) ones.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Visualization of overall performance of different algorithms (single-pass) across different benchmark data sets (top-down: Criteo, Avazu, kddcup2012. Visualizations show traces of all trained models (per engine).</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Impact of context caching on inference time.</figDesc><graphic coords="4,312.98,68.99,213.69,113.08" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: Relative impact of SIMD-enabled (blue, after drop) vs. SIMD-disabled (purple) FW in production (inference).</figDesc><graphic coords="4,312.98,209.09,213.68,99.82" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 6 :</head><label>6</label><figDesc>Figure 6: Speedup observed when jointly using quantization and model patching (as opposed to just patching).</figDesc><graphic coords="5,312.98,69.00,213.68,123.10" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Stability analysis and overall performance. Rows with max test set performance highlighted.</figDesc><table><row><cell></cell><cell cols="4">Avazu (window=30k)</cell><cell></cell></row><row><cell>algo</cell><cell cols="2">avg median</cell><cell>max</cell><cell>std</cell><cell>min</cell><cell>test</cell></row><row><cell>VW-linear</cell><cell>0.6832</cell><cell cols="4">0.7016 0.8200 0.0668 0.4664</cell><cell>0.7596</cell></row><row><cell>VW-mlp</cell><cell>0.6755</cell><cell cols="4">0.6984 0.8200 0.0748 0.4664</cell><cell>0.7596</cell></row><row><cell cols="2">FW-DeepFFM 0.7648</cell><cell cols="4">0.7654 0.8507 0.0243 0.4764</cell><cell>0.7916</cell></row><row><cell>FW-FFM</cell><cell>0.7524</cell><cell cols="4">0.7524 0.8234 0.0227 0.4816</cell><cell>0.7693</cell></row><row><cell>DCNv2</cell><cell>0.7750</cell><cell cols="4">0.7745 0.8326 0.0202 0.5005</cell><cell>0.7763</cell></row><row><cell></cell><cell cols="4">Criteo (window=30k)</cell><cell></cell></row><row><cell>algo</cell><cell cols="2">avg median</cell><cell>max</cell><cell>std</cell><cell>min</cell><cell>test</cell></row><row><cell>VW-linear</cell><cell>0.7340</cell><cell cols="4">0.7460 0.8219 0.0556 0.4768</cell><cell>0.7920</cell></row><row><cell>VW-mlp</cell><cell>0.7247</cell><cell cols="4">0.7425 0.8211 0.0670 0.4768</cell><cell>0.7920</cell></row><row><cell cols="2">FW-DeepFFM 0.7655</cell><cell cols="4">0.7689 0.8053 0.0179 0.4796</cell><cell>0.7803</cell></row><row><cell>FW-FFM</cell><cell>0.7578</cell><cell cols="4">0.7621 0.8020 0.0198 0.4682</cell><cell>0.7742</cell></row><row><cell>DCNv2</cell><cell>0.8042</cell><cell cols="4">0.8052 0.8370 0.0118 0.4958</cell><cell>0.8085</cell></row><row><cell></cell><cell cols="4">KDDCup2012 (window=30k)</cell><cell></cell></row><row><cell>algo</cell><cell cols="2">avg median</cell><cell>max</cell><cell>std</cell><cell>min</cell><cell>test</cell></row><row><cell>VW-linear</cell><cell>0.6333</cell><cell cols="4">0.6419 0.8336 0.0807 0.3430</cell><cell>0.7688</cell></row><row><cell>VW-mlp</cell><cell>0.6309</cell><cell cols="4">0.6402 0.8336 0.0869 0.3759</cell><cell>0.7688</cell></row><row><cell cols="2">FW-DeepFFM 0.7323</cell><cell cols="4">0.7400 0.8781 0.0414 0.3687</cell><cell>0.7967</cell></row><row><cell>FW-FFM</cell><cell>0.7228</cell><cell cols="4">0.7318 0.8382 0.0391 0.3651</cell><cell>0.7641</cell></row><row><cell>DCNv2</cell><cell>0.7589</cell><cell cols="4">0.7610 0.8718 0.0301 0.4792</cell><cell>0.7734</cell></row><row><cell cols="7">data (a feedback loop). The training jobs are Python-based</cell></row><row><cell cols="7">services that interact with the binary via process invocations.</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 2</head><label>2</label><figDesc>Impact of Hogwild-based training.</figDesc><table><row><cell>Implementation</cell><cell>Warmup time (same period)</cell></row><row><cell>FW-deepFFM-control</cell><cell>8d</cell></row><row><cell>FW-deepFFM-hogwild</cell><cell>23h (48 threads)</cell></row><row><cell>Implementation</cell><cell>Online training (same period)</cell></row><row><cell>FW-deepFFM-control</cell><cell>20m</cell></row><row><cell>FW-deepFFM-hogwild</cell><cell>4m (4 threads)</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 3</head><label>3</label><figDesc>Speedups observed due to sparse weight updates.</figDesc><table><row><cell>#Hidden layers</cell><cell>1</cell><cell>2</cell><cell>3</cell><cell>4</cell></row><row><cell>Speedup (sparse updates)</cell><cell>1.3x</cell><cell>1.8x</cell><cell>2.4x</cell><cell>3.5x</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head></head><label></label><figDesc>. Results on a representative CTR model are shown in Table 4. Metrics of interest are time to produce patch and the final patch/weight update's size. Patching and quantization result in up to 30x smaller model updates.</figDesc><table /><note><ref type="bibr" target="#b14">15</ref> https://github.com/outbrain/fwumious_wabbit/blob/main/weight_ patcher<ref type="bibr" target="#b15">16</ref> (quantization output tended to fluctuate more)<ref type="bibr" target="#b16">17</ref> https://github.com/outbrain/fwumious_wabbit/blob/main/src/ quantization.rs</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 4</head><label>4</label><figDesc>Impact of model quantization on the global production CTR model.</figDesc><table><row><cell>Weight processing</cell><cell>Avg. time spent</cell><cell>Update file size</cell></row><row><cell>no procecssing (baseline)</cell><cell>/</cell><cell>100%</cell></row><row><cell>fw-quantization</cell><cell>2s</cell><cell>50%</cell></row><row><cell>fw-patcher</cell><cell>45s</cell><cell>30±5%</cell></row><row><cell>fw-</cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_6"><head>patcher + fw-quantization 8s 3±2%</head><label></label><figDesc></figDesc><table /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_0">https://github.com/outbrain/fwumious_wabbit/blob/main/src/block_ neural.rs</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_1">See https://github.com/outbrain/fwumious_wabbit/blob/main/src/ regressor.rs for more details.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_2">Unique hash was assigned to each value for this baseline for ease of implementation.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_3">https://www.kaggle.com/c/criteo-display-ad-challenge</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_4">https://www.kaggle.com/c/avazu-ctr-prediction/data</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_5">https://www.kaggle.com/c/kddcup2012-track2</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_6">Such minimal pre-processing is within reach of a regular production.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_7">RIG and Log-loss scores are aligned with AUC-based results, hence only these are reported for readability purposes</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="11" xml:id="foot_8">for KDD, we took last 2m instances to capture apparent variability in data better, other data sets are split as reported in their origin publications.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="12" xml:id="foot_9">https://github.com/outbrain/fwumious_wabbit/blob/main/src/lib.rs</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="14" xml:id="foot_10">https://github.com/outbrain/fwumious_wabbit/blob/main/src/block_ ffm.rs</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Is chatgpt fair for recommendation? evaluating fairness in large language model recommendation</title>
		<author>
			<persName><forename type="first">J</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Bao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Feng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>He</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 17th ACM Conference on Recommender Systems</title>
				<meeting>the 17th ACM Conference on Recommender Systems</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="993" to="999" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Deep learning for recommender systems</title>
		<author>
			<persName><forename type="first">S</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Tay</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Yao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Zhang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Recommender Systems Handbook</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="173" to="210" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Recommender systems leveraging multimedia content</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Deldjoo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Schedl</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Cremonesi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Pasi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Computing Surveys (CSUR)</title>
		<imprint>
			<biblScope unit="volume">53</biblScope>
			<biblScope unit="page" from="1" to="38" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Abadi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Barham</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Brevdo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Citro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">S</forename><surname>Corrado</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Davis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dean</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Devin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Ghemawat</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Goodfellow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Harp</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Irving</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Isard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Jia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Jozefowicz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Kaiser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Kudlur</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Levenberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Mané</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Monga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Moore</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Murray</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Olah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Schuster</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Shlens</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Steiner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Talwar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Tucker</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Vanhoucke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Vasudevan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Viégas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Vinyals</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Warden</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wattenberg</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Wicke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zheng</surname></persName>
		</author>
		<ptr target="https://www.tensorflow.org/,softwareavailablefromtensorflow.org" />
		<title level="m">TensorFlow: Large-scale machine learning on heterogeneous systems</title>
				<imprint>
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Pytorch: An imperative style, high-performance deep learning library</title>
		<author>
			<persName><forename type="first">A</forename><surname>Paszke</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gross</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Massa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lerer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Bradbury</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Chanan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Killeen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Gimelshein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Antiga</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Desmaison</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kopf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Devito</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Raison</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Tejani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Chilamkurthy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Steiner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Fang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Bai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Chintala</surname></persName>
		</author>
		<ptr target="http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf" />
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems 32</title>
				<imprint>
			<publisher>Curran Associates, Inc</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="8024" to="8035" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Autoint: Automatic feature interaction learning via self-attentive neural networks</title>
		<author>
			<persName><forename type="first">W</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Shi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Xiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Duan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Tang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 28th ACM international conference on information and knowledge management</title>
				<meeting>the 28th ACM international conference on information and knowledge management</meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="1161" to="1170" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">xdeepfm: Combining explicit and implicit feature interactions for recommender systems</title>
		<author>
			<persName><forename type="first">J</forename><surname>Lian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Xie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery &amp; data mining</title>
				<meeting>the 24th ACM SIGKDD international conference on knowledge discovery &amp; data mining</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="1754" to="1763" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Wide &amp; deep learning for recommender systems</title>
		<author>
			<persName><forename type="first">H.-T</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Koc</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Harmsen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Shaked</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Chandra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Aradhye</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Anderson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Corrado</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Chai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ispir</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 1st workshop on deep learning for recommender systems</title>
				<meeting>the 1st workshop on deep learning for recommender systems</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="7" to="10" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Ye</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>He</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1703.04247</idno>
		<title level="m">Deepfm: a factorization-machine based neural network for ctr prediction</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Bietti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Langford</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1802.04064v3</idno>
		<ptr target="https://www.microsoft.com/en-us/research/publication/a-contextual-bandit-bake-off-2/" />
		<title level="m">A contextual bandit bake-off</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note>stat.ML</note>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Field-aware factorization machines in a real-world online advertising system</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Juan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Lefortier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Chapelle</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 26th International Conference on World Wide Web Companion</title>
				<meeting>the 26th International Conference on World Wide Web Companion</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="680" to="688" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Field-aware factorization machines for ctr prediction</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Juan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhuang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-S</forename><surname>Chin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C.-J</forename><surname>Lin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 10th ACM conference on recommender systems</title>
				<meeting>the 10th ACM conference on recommender systems</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="43" to="50" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">An updated set of basic linear algebra subprograms (blas)</title>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">S</forename><surname>Blackford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Petitet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Pozo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Remington</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">C</forename><surname>Whaley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Demmel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dongarra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Duff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Hammarling</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Henry</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Transactions on Mathematical Software</title>
		<imprint>
			<biblScope unit="volume">28</biblScope>
			<biblScope unit="page" from="135" to="151" />
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Dcn v2: Improved deep &amp; cross network and practical lessons for web-scale learning to rank systems</title>
		<author>
			<persName><forename type="first">R</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Shivanna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Jain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Hong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Chi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the web conference 2021</title>
				<meeting>the web conference 2021</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="1785" to="1797" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<author>
			<persName><forename type="first">W</forename><surname>Shen</surname></persName>
		</author>
		<ptr target="https://github.com/shenweichen/deepctr" />
		<title level="m">Deepctr: Easy-to-use,modular and extendible package of deep-learning based ctr models</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Hogwild!: A lock-free approach to parallelizing stochastic gradient descent</title>
		<author>
			<persName><forename type="first">B</forename><surname>Recht</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Re</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Wright</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Niu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">24</biblScope>
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<author>
			<persName><forename type="first">B</forename><surname>Rokh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Azarpeyvand</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Khanteymoori</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2205.07877</idno>
		<title level="m">A comprehensive survey on model quantization for deep neural networks</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">Towards efficient post-training quantization of pretrained language models</title>
		<author>
			<persName><forename type="first">H</forename><surname>Bai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Hou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Shang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>King</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">R</forename><surname>Lyu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="page" from="1405" to="1418" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Quantization and training of neural networks for efficient integerarithmetic-only inference</title>
		<author>
			<persName><forename type="first">B</forename><surname>Jacob</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kligys</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Howard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Adam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Kalenichenko</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE conference on computer vision and pattern recognition</title>
				<meeting>the IEEE conference on computer vision and pattern recognition</meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="2704" to="2713" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
