Leveraging Instrumental Variables in Online Advertising Auctions : Robust Click-Through-Rate Prediction Ryohei Emori1,2,∗ , Shinya Suzumura3 , Nobuyuki Shimizu3 and Takahiro Hoshino1,2 1 Keio University, 2-15-45, Mita, Minato-ku, Tokyo, Japan 2 Riken AIP center, 1-4-1 Nihonbashi, Chuo-ku, Tokyo, Japan 3 LY Corporation, Kioi Tower 1-3 Kioicho, Chiyoda-ku, Tokyo, Japan Abstract Predicting the click-through rate (CTR) in online ad auctions is essential for calculating bid amounts and forming rankings. However, predicting CTR from historical data faces some difficulties, one of which is the cold-start problem. Our research uses the instrumental variables (IVs) framework to address the cold-start problem and selection bias, validating robust CTR prediction in online advertising auctions. Although generally identifying IVs in wide applications is notably challenging, their potential use is not limited to CTR prediction; they can potentially be used to address practical issues and research questions in advertising auctions in general. We put forth bid amounts as IVs, discussing their validity as IVs and testing the robustness of predictions using IVs in both simulations and real data scenarios. Moreover, we enhanced our methodology by integrating explicit interactions between bid amounts and other features, demonstrating that accounting for heterogeneity in IVs significantly improves prediction accuracy in actual data. Our proposal on IVs and its refined CTR prediction approach enriches the research fields on causal inference robustness and invariant prediction. Keywords Instrumental Variables, Omitted Variable Bias, Robustness, Cold-start Problem, Click-Through-Rate, Online Advertising Auction 1. Introduction that often lead to erroneous predictions due to the unreal- istic absence of unobserved confounding factors between Online advertising, an essential backbone of the digital econ- treatment and outcome relationships [8]; and 4) potentially omy, relies heavily on accurate prediction models to allocate infer the causal effect of impressions on conversion as well ads effectively and enhance the user experience. Crucially, as clicks. the accuracy of click-through rate (CTR) prediction plays a Furthermore, we demonstrate that the explicit use of first- pivotal role in determining the success in terms of welfare of stage heterogeneity in the IVs method can be strongly rec- of online advertising auctions, and at the same time, hover ommended in online ad auctions [9, 10]. First-stage hetero- the potential biases that may skew results [1, 2]. geneity in the IVs method has been relatively overlooked In addition to the problem of bias that lurks in some on- compared to heterogeneity in the second stage, namely, user line ad auctions and is often the subject of research, the response. However, we find that increasing the association cold-start problem arises when we must make predictions between IVs and impression probability shows robust predic- for new advertisements or infrequent users, leading to de- tions for the overall prediction and the cold-start problem. creased predictive accuracy. Against the backdrop of prob- The contributions of the paper have three main points: lems arising from those various factors, causal methods of predicting user behavior that capture invariant user behav- 1. We identify and propose valid IVs tailored to online ior have risen as a subject of high research interest [3, 4, 5]. advertising auctions. The IVs suit broad advertising Among them, prior research [3] has highlighted that one auction contexts, including display and search ad- of those causal methods, the instrumental variables (IVs) vertising. Furthermore, the IVs method is expected method, has the potential to contribute to solving the cold- to have further applications such as causal inference start problem. [6] provided a methodology for IVs using of medium- and long-term effects of ad impressions neural networks, but specific IVs always need to be identi- on conversions, etc., not limited to causal effects on fied in a specific research domain. [7] uses the user’s search user click behavior in online ad auctions. query as an instrumental variable; their use of IVs is lim- 2. There have been few empirical examples the IVs ited to search advertising and may not satisfy one of the method has been demonstrated to be capable of mak- conditions for IVs, the exclusion restriction. ing invariant behavioral predictions. We identify In this paper, we identify bid amounts as IVs in online ad valid IVs for further application in the setting of auction settings and demonstrate that click prediction using online ad auctions, a setting in which the research the IVs method exhibits robust predictions in the overall field has been broaden, and demonstrated the robust- prediction and cold start problems. ness of the IVs method’s prediction accuracy for the Although IVs are generally considered difficult to identify, overall forecast and the cold-start scenario in our they have the potential to: 1) maximize the use of data, in- experiments. cluding impressions of ads with low historical win rates; 2) 3. Notably, our research advances the concept of utiliz- not require random impressions of ads; 3) avoid assumptions ing the first stage heterogeneity in the IVs method in the context of prediction. By considering hetero- geneity in the strength of IVs concerning impression AdKDD’24 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, August 25–29, 2024, Barcelona, Spain probability, our method shows more significantly ∗ Corresponding author. robust prediction performance in whole prediction Envelope-Open ryohey3569@keio.jp (R. Emori); ssuzumur@lycorp.co.jp and the cold-start scenario. (S. Suzumura); nobushim@lycorp.co.jp (N. Shimizu); hoshino@econ.keio.ac.jp (T. Hoshino) Orcid 0009-0003-1247-8327 (R. Emori) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 2. Identification of Instrumental where 𝐹 (⋅) is the generated distribution of bid amounts. As summarized by [2], bias in the recommendation sys- Variables in Ad Auctions tem is a looping process. Figure 1 depicts the looping of several biases, focused in ad auctions setting, which are in- 2.1. Ad Auctions and Biases terdependent. In particular, the auction score will be biased if the platform’s prediction of the pCTR is a biased estimator. Popularity Bias The same is true for pCVR and adjust term. The assignment Data Imbalance Score Prediction of impressions by the auction score with bias is as follows: 𝑈𝑠𝑒𝑟 - 𝑠 𝑦./0.1, !! , 𝑋!! , 𝐷!! 𝑎𝑛𝑑 𝑦.345678034,!! , 𝑋!! , 𝐷!! 𝑎𝑟𝑒 𝑙𝑜𝑔𝑔𝑒𝑑 𝑜𝑛𝑡𝑜 𝑝𝑙𝑎𝑡𝑓𝑜𝑟𝑚! 𝑠 𝑑𝑎𝑡𝑎𝑏𝑎𝑠𝑒. 𝐴𝑑𝑣𝑒𝑟𝑡𝑖𝑠𝑒𝑟𝑠 𝑚𝑎𝑛𝑢𝑎𝑙𝑙𝑦 𝑠𝑒𝑡 𝑏𝑖𝑑 𝑎𝑚𝑜𝑢𝑛𝑡𝑠 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝑜𝑛 𝑋!! 𝑜𝑟 𝑗𝑖∗ = arg max 𝐴𝑢𝑐𝑡𝑖𝑜𝑛 𝑆𝑐𝑜𝑟𝑒𝑗biased 𝑖 . 𝑡𝑎𝑟𝑔𝑒𝑡 𝐶𝑃𝐴 𝑋!! × 𝒑𝑪𝑽𝑹 𝒚𝒄𝒐𝒏𝒗𝒆𝒓𝒔𝒊𝒐𝒏,!! = 𝟏 𝑿!! , 𝑫!! = 𝟏) 𝑗𝑖 ∈{1,⋯,𝑚𝑖 } Exposure Bias 𝐴𝑢𝑐𝑡𝑖𝑜𝑛 𝑆𝑐𝑜𝑟𝑒!! = 𝑨𝒅𝒋𝒖𝒔𝒕𝒆𝒅 𝑩𝒊𝒅 𝑋!! , 𝐷!! = 1 × 𝒑𝑪𝑻𝑹 𝒚𝒄𝒍𝒊𝒄𝒌,!! = 𝟏 𝑿!! , 𝑫!! = 𝟏) + 𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑇𝑒𝑟𝑚(𝑋!! ) 2.2. Causal View of Online Ad Auctions User Response 𝑂𝑝𝑡𝑖𝑜𝑛𝑎𝑙𝑙𝑦 Inductive Bias Ad Non-Impression Ad Impression Ad Auction 𝑦!! 𝜀!! Conditional υ!! Selection Bias Click Independence Exclusion binary Figure 1: Inductive, Selection, Exposure, and Popularity Bias in Restriction Users’ Click Behavior and Ad Auction System 𝐷!! 𝑍!! Relevance Before we explain that the bid amounts is IVs, we describe Impression Bid binary continuous the setting in ad auctions. This is because it is essential to examine the actual flow of data generation to ascertain the IVs. The notations used to describe the auction mechanism 𝑋!! 𝑝𝐶𝑇𝑅!! Features are as follows: the total number of auctions is N, the number of auctioneers participating in auction 𝑖 ∈ {1, ⋯ , 𝑁 } is 𝑚𝑖 , Figure 2: Users’ Click Behavior and Bid Amounts as Instrumental and the auctioneer’s advertisement is 𝑗𝑖 ∈ {1, ⋯ , 𝑚𝑖 }. Let Variables in Ad Auctions 𝐵𝑖𝑑𝑗𝑖 be the bid amount that the auctioneer spends on the ad 𝑗𝑖 , 𝑝𝐶𝑇 𝑅𝑗𝑖 be the predictive click-through-rate, and 𝑗𝑖∗ be Treatment 𝐷𝑗𝑖 , impressions in ad auctions, can be eas- the ad that wins an impression to the user in the auction ily correlated with the error term for the unobserved het- 𝑖. Also, 𝑦𝑗𝑖 is the outcome that is 1 if ad 𝑗𝑖 is clicked and 0 if erogeneity of users’ click behavior. This can be explicitly not, 𝑋𝑗𝑖 is a variables vector used to target ads and users in expressed in the pCTR formulation as follows: ad 𝑗𝑖 . To simplify complex effects such as position bias, we 𝑝(𝑦𝑗𝑖 = 1) ∶= 𝜃 ∗ (𝑋𝑗𝑖 , 𝜂𝑗𝑖 , 𝜖𝑗𝑖 |𝐷𝑗𝑖 = 1), assume a setting where there is only one ad that wins an impression. Therefore, let 𝐷𝑗𝑖 be a binary dummy that is 1 where 𝜖𝑗𝑖 represents the error term in the user’s click re- when 𝑗𝑖 = 𝑗𝑖∗ and 0 otherwise. Also, let 𝑦𝑗𝑖 be the outcome sponse, and 𝜂𝑗𝑖 is unobserved heterogeneity of click behavior that is 1 if the ad 𝑗𝑖∗ is clicked and 0 otherwise. that correlates with some or all of 𝑋𝑗𝑖 consisting of user and Here, 𝑝𝐶𝑇 𝑅𝑗𝑖 is as followed: ad features but cannot be observed, known as the omitted variable. 𝜃 ∗ (⋅) is a function returns a predictive probability 𝑝𝐶𝑇 𝑅𝑗𝑖 = 𝑝(𝑦𝑗𝑖 = 1|𝐷𝑗𝑖 = 1, 𝑋𝑗𝑖 ), when 𝑦𝑗𝑖 = 1. Treatments are determined in the auction system together where 𝑝𝐶𝑇 𝑅𝑗𝑖 is the probability of whether ad 𝑗𝑖 will be with predicted values such as pCTR and pCVR, which are clicked given winning impression, target and other vari- conditioned on the user and ad features involved in ad ables. auctions, and the advertiser’s bid amount. At this point, In ad auctions, there can be various methods for deter- pCTR and pCVR are not conditioned on omitted variables mining auction scores. Here, for instance, the auction score 𝜂𝑗𝑖 , which generates a bias in the estimates of predictive is calculated as follows: outcome. Since the bid amount is determined from the pre- 𝐴𝑢𝑐𝑡𝑖𝑜𝑛 𝑆𝑐𝑜𝑟𝑒𝑗𝑖 = 𝐵𝑖𝑑𝑗𝑖 × 𝑝𝐶𝑇 𝑅𝑗𝑖 , dictions with this bias and an auction is formed, there is a strong suspicion that the impressions 𝐷𝑗𝑖 are endogenous This determination scheme, which takes into account bid variables, which are variables correlated with the error term amount and predictive CTR in the auction score, has been amplified through the auction with the omitted variable studied under the name ”weighted GSP” [11, 12]. When the bias. We consider the assumption that no omitted variables bid amount is a manual bid by the auctioneer, it is generated exist as a type of inductive bias, a convenient assumption from the distribution of bid amounts conditional on the for pCTR model. target variable of the ad set by the auctioneer. Alternatively, Unconfoundedness, i.e., a situation where no omitted when the bid amount is an automated bid by the platform, variables exist, is a somewhat severe assumption for real- the bid amount is generated by, for example, predictive world data. Therefore, IVs methods that do not require the conversion rate (pCVR) and target CPA. In this case, 𝑝𝐶𝑉 𝑅𝑗𝑖 assumption of unconfoundedness can be compelling and is a function of 𝑋𝑗𝑖 . That is, bid amounts is generated from valuable. some distribution conditioned on the target variables of the ad set by the auctioneer or other variables used by the 2.3. Validating Bid Amounts as IVs platform. Thus, There are three conditions that valid IVs satisfy. The first is 𝐵𝑖𝑑𝑗𝑖 ∼ 𝐹 (𝑋𝑗𝑖 ), the relevance of the IVs to a treatment variable. The second is an exclusion restriction, where the IVs does not directly • Q.1 Do prediction methods using simple neural net- affect the outcome but rather affects the outcome through works with IVs perform in the online ad auction the treatment variable. The third is the independence of the setting? and IVs with respect to the treatment and the outcome. Notating • Q.2 Is IVs heterogeneity strongly present in online IVs vector in ad 𝑗𝑖 as 𝑍𝑗𝑖 and combining these conditions, we ad auction settings and is explicitly addressing it can write them as follows: effective in prediction?, 𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑐𝑒 ∶ 𝐷𝑗𝑖 ⟂𝑍̸ 𝑗𝑖 , • Q.3 Heterogeneity in treatment effects is widely known, but by how much improvement relative to 𝐸𝑥𝑐𝑙𝑢𝑠𝑖𝑜𝑛 𝑅𝑒𝑠𝑡𝑟𝑖𝑐𝑡𝑖𝑜𝑛 ∶ {𝜖𝑗𝑖 , 𝐷𝑗𝑖 } ⟂ 𝑍𝑗𝑖 , accounting for heterogeneity in IVs? 𝐶𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝐼 𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑐𝑒 ∶ 𝜖𝑗𝑖 | 𝑋𝑗𝑖 ⟂ 𝑍𝑗𝑖 , To introduce models that respond to those questions, the We argue that bid amounts is valid as IVs in ad auctions. methodology section is organized as follows. For Q.1, We The reason bid amounts function as IVs is summarized in first introduce the basic structure of the nonparametric IVs Figure 2 under our proposed IVs formulation. method and highlight its heterogeneous relevance to the With regard to the relevance between bid amounts and probability of winning impressions in ad auctions. Next, impressions, the relevance is explicitly acknowledged by the Q.2, we present a method based on an attention network fact that the main item in the auction score is the bid amount. that explicitly considers interactions between IVs and their Concerning the exclusion restriction, the bid amount only other features. Finally, Q.3, we explicitly incorporate hetero- influences impressions through the auction score. There- geneity in click probabilities by employing an interaction fore, the bid amounts does not influence the user’s click structure similar to the heterogeneity of instrumental vari- behavior. Conditional on the variables used by advertisers ables. Figure 3 summarizes our proposed final IVs method. and platforms to set bid amounts, bid amounts are valid For simplicity in subscripting the training data, 𝑙 corre- instruments. sponds to the record number in this section. 2.4. Reasons Other Variables are Not Valid y!"#!$ 𝒑𝑰𝑴𝑷 Sigmoid 𝒁 IVs Attention Network 𝑥&, '()* Attention Network 𝑥&, '()* … … Here, we introduce why other variables, such as bid times Sigmoid i.e., i.e., Leveraging Leveraging pIMP 𝑥', '()* IV 𝑥', '()* Interactions Interactions used for targeting, do not meet the conditions of an instru- 𝑥&, $% NN 𝑥&, $% mental variable in ad auctions. … … NN Relevance : Take targeting variables as an example. 𝑥!, $% 𝑥!, $% From the perspective of relevance, advertisers determine Second Stage First Stage bid amounts based on targeting users, which should relate Multi-task Learning to the probability of assignment. Bid amounts influence the auction score directly, ensuring more vital relevance Figure 3: IV-IMP Approach Leveraging First- and Second-stage than targeting variables, while targeting variables have an Heterogeneity with Multi-task Learning Structure ”indirect” relevance to the auction score. Conditional Independence : The more crucial condi- tion, however, is that targeting variables do not satisfy the independence from the unobserved factors affecting the 3.1. First-stage IVs Heterogeneity in Ad user’s probability of clicking. For instance, consider bid Auctions times as one of the targeting variables. The time when a In principle, we can estimate a user’s click response 𝑦𝑙 using user requests an advertisement, that is, the user’s visitation IVs in a two-stage approach. Following nonparametric IVs process, and the probability of clicking the ad can be re- notation by [13], the incorporation of heterogeneity in the lated. Users visiting at 10 AM may have a higher or lower first stage can be written as follows: probability of clicking an ad, and even if conditioned on other targeting variables, the presence of unobserved fac- 𝑝(𝑦𝑙 = 1) = 𝜙 ∗ (𝑋𝑙 , 𝑝(𝑍𝑙 , 𝑋𝑙 ), 𝜖𝑙 ), tors makes it impossible to guarantee the independence of 𝑝(𝑍𝑙 , 𝑋𝑙 ) = 𝑝(𝐷𝑙 = 1|𝑋𝑙 , 𝑍𝑙 ), bid times from the click probability. On the other hand, the probability that a user will click is considered independent where 𝑝(𝑍𝑙 , 𝑋𝑙 ) is an instrument summarized by the interac- of the bid amount, conditioned on the targeting variables, tion of multiple IVs, and we assume that 𝐷𝑙 depends only on since the user cannot know how much was paid for the 𝑋𝑙 through 𝑝(𝑍𝑙 , 𝑋𝑙 ) and call it first stage. 𝜙 ∗ is a function specific advertising at the time of the click. that returns a predictive probability of the event 𝑦𝑙 = 1, Exclusion Restriction : From the perspective of the ex- which is called second stage. In the ad auctions, 𝑝(𝑍𝑙 , 𝑋𝑙 ) clusion restriction, targeting variables affect the probability is the predicted impression probability, henceforth 𝑝𝐼 𝑀𝑃, of a user’s click, and do not ensure that their influence on the which is a multi-task learning frame and can be trained click probability is exerted solely through the assignment in one step together with 𝑝𝐶𝑇 𝑅. Using neural networks, of impressions. a layer structure can be used that follows the simplified manner of IVs, which we henceforth refer to as the IV-BS 3. Click Prediction with First-stage approach. Although there can be several approaches incorporating IVs Heterogeneity interactions between features and IVs, we use an attention network. This is because it is suitable merely for validating In the methodology section, we propose several variants of the idea of bid amount heterogeneity. the IVs method to examine the following questions: 3.2. Leveraging First-Stage IVs by 4. Experiments Interactions The experimental section is divided into two parts: simu- Given a dataset, let the input feature matrix be represented lation and evaluation in scenarios approximating the cold- as 𝐾 after passing through an input layer where all units are start problem with real data sets. The code for replication fully connected, including units from 𝑝𝐼 𝑀𝑃 and features. is available at the following link: https://github.com/ryohei- Let 𝐵 denote the batch size and 𝐿 represent the number of emori/NPIV-pCTR. Please note that the repository excludes units in the input layer, leading to 𝐾 having dimensions sections related to private data. of 𝐵 × 𝐿. The instrumental variable, represented as matrix The notation is consistent with that used in Section 3. 𝑍, has dimensions 𝐵 × 1. To align with the shape of 𝐾, matrix 𝑄 iv is formed by performing a tiling operation on 𝑍. 4.1. Simulated Datasets Specifically, each row of 𝑍 is replicated on the basis of the number of columns in 𝐾. Furthermore, the weight matrix for IVs interaction is denoted as 𝑊 iv and has dimensions 𝐿 × 𝐿. Algorithm 1 Simulating auction data and validating base- Using these matrices, the attention score 𝛼iv is calculated lines as: 1: 1. Initializing paramaters: 2: Set parameters (𝛼, 𝛽, 𝛾 ) 𝛼 iv = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥(𝑊 iv (𝑄 iv ⊙ 𝐾 ) + 𝑏 iv ). 3: 𝑘 ∶= 0 Here, we use the swish function as an activation function in 4: while 𝑘 < 5, 000 do the weight matrix 𝑊 iv so as to represent the non-linear 5: Generate 𝑋𝑘 and 𝜂𝑘 strength in the heterogeneity of bid amounts. We feed 6: 𝐷𝑘 ∼ Bernoulli(𝑝𝐷𝑘 ), where 𝑝𝐷𝑘 = Logistic(𝑋𝑘′ 𝛼 + 𝜂𝑘 ) element-wise products as interactions into the fully con- nected layer with the softmax function as the activation 7: if 𝐷𝑘 = 1 then function to generate the attention score 𝛼 𝑖𝑣 . Then, we ob- 8: 𝑦𝑘 ∼ Bernoulli(𝑝𝑦𝑘 ), where 𝑝𝑦𝑘 = Logistic(𝑋𝑘′ 𝛽 + tain the representation g by the element-wise product of 𝜂𝑘 ) the input layer 𝐾 and the generated attention scores 𝛼 iv . 9: 𝑘 ∶= 𝑘 + 1 10: end if 𝑔 iv = 𝛼 iv ⊙ 𝐾 11: end while 12: Train pCTR: 𝑝(𝑦𝑘 = 1|𝐷𝑘 = 1) ∶= 𝜃(𝑋𝑘 ) We combine the representation g obtained by the attention 13: 2. Generating historical auction data: layer and the features input in a fully connected neural 14: for each auction 𝑖 in 5, 000 do network to form the hidden layer. 15: 𝑚𝑖 = 20 16: Generate 𝑋𝑗𝑖 and 𝜂𝑗𝑖 3.3. Second-stage Heterogeneity 17: 𝐵𝑖𝑑𝑗𝑖 ∼ Beta(𝜇, 2) by [14], where 𝜇 ∶= Logistic(𝑋𝑗′𝑖 𝛾 ) In the second stage, namely in 𝑝𝐶𝑇 𝑅 side, it is evident that 18: 𝑝𝐶𝑇 𝑅𝑗𝑖 = 𝜃(𝑋𝑗𝑖 ) heterogeneity exists when conditioning on user and ad- 19: 𝑗𝑖∗ ∶= arg max𝑗𝑖 ∈{1,⋯,𝑚𝑖 } Auction Score𝑗𝑖 , vertisement features regarding the effect of impressions. where Auction Score𝑗𝑖 ∶= 𝐵𝑖𝑑𝑗𝑖 × 𝑝𝐶𝑇 𝑅𝑗𝑖 Similarly to how we took the dot product of bid amounts 20: 𝑦𝑗𝑖 ∼ Bernoulli(𝑝𝑗𝑖 ) & 𝐷𝑗𝑖 = 1 if 𝑗𝑖 = 𝑗𝑖∗ and feature units in the input layer in the first stage, we where 𝑝𝑗𝑖 = Logistic(𝑋𝑗′𝑖 𝛽 + 𝜂𝑗𝑖 ) symmetrically use the same in the second stage. The input 21: 𝑦𝑗𝑖 = 0 & 𝐷𝑗𝑖 = 0, otherwise layer consists of fully connected units from 𝑝𝐼 𝑀𝑃 and fea- 22: end for tures. The structure of the entire network including 𝑝𝐼 𝑀𝑃 23: 3. Learning 𝑝𝐶𝑇 𝑅 with historical data: and 𝑝𝐶𝑇 𝑅 is drawn in Figure 3. The attention score and {(𝑦𝑗𝑖 , 𝑋𝑗𝑖 , 𝐵𝑖𝑑𝑗𝑖 , 𝐷𝑗𝑖 ), 𝑗𝑖 = 1, ⋯ , 𝑚𝑖 , 𝑖 = 1, ⋯ , 5, 000} representation g can be written as follows: 24: 4. Validating 𝑝𝐶𝑇 𝑅 with independently displayed data: 𝛼 imp = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥(𝑊 imp (𝑄 imp ⊙ 𝐾 ) + 𝑏 imp ), {(𝑦𝑙 , 𝑋𝑙 , 𝐷𝑙 = 1), 𝑙 ∈ {1, ⋯ , 50, 000}}, 𝑔 imp = 𝛼 imp ⊙ 𝐾 , where 𝑦𝑙 ∼ Bernoulli(𝑝𝑙 ), 𝑝𝑙 = Logistic(𝑋𝑙′ 𝛽 + 𝜂𝑙 ), generated 𝑋𝑙 and 𝜂𝑙 . where 𝑄 imp is formed by performing a tiling operation on 𝑝𝐼 𝑀𝑃 to align with the shape of 𝐾. Specifically, each row of The procedures for simulating the auction data are sum- 𝑝𝐼 𝑀𝑃 is replicated on the basis of the number of columns marized in Algorithm 1, aligning with procedure and no- in 𝐾. 𝑊 imp is a weight matrix of 𝐿 × 𝐿 for 𝑝𝐼 𝑀𝑃 interaction. tation in section 3.1. The experiment is replicated 20 times. The subscripts 𝑘 and 𝑙 correspond to the number of records 3.4. Loss Function for Multi-task Learning in step 1 and 4, respectively. 𝜃(𝑋𝑘 ) is learned by logistic In the multi-task learning framework for pIMP and pCTR, regression. We use the Beta distribution for generating bid we adjust the loss function for pCTR by applying sample amounts, which satisfies non-negative constraints. Specifi- weights through an indicator function, 1{𝐷𝑙 =1} : cally, we use the reparametrized Beta distribution by [14] to model the mean of bid amounts. For simplicity, the num- 𝐿𝑜𝑠𝑠𝑝𝐶𝑇 𝑅 = 𝐿𝑜𝑠𝑠𝑝𝐶𝑇 𝑅 × 1{𝐷𝑙 =1} ber of auctioneers 𝑚𝑖 participating in auction 𝑖 is fixed, but in reality, it may vary depending on the attractiveness of This function ensures that the 𝐿𝑜𝑠𝑠𝑝𝐶𝑇 𝑅 is only computed users, represented by 𝑋𝑗𝑖 . The link function Logistic(⋅) is for data points with impressions, when 𝐷𝑙 = 1, filtering out defined as (1 + exp(−⋅))−1 . The feature vectors 𝑋𝑘 , 𝑋𝑗𝑖 , and instances without impressions from affecting the pCTR loss 𝑋𝑙 are 25 × 1 vectors respectively. Each 𝑋𝑠,𝑘 is drawn from calculation. This approach allows us to concentrate on the a specific distribution: Uniform[−5, 5] for 𝑠 ∈ {1, ⋯ , 10}, performance of the model to predict CTR. Bernoulli(0.5) for 𝑠 ∈ {11, ⋯ , 20}, and Uniform[−2, 2] for 4.4. Ablation studies 𝑠 ∈ {21, ⋯ , 25}. These vectors are generated similarly. The To evaluate our proposed methods with instrumental vari- vectors 𝜂𝑘 , 𝜂𝑗 , and 𝜂𝑙 are generated from a Uniform[−5, 5] ables, we took a naive benchmark and comparative base- distribution. The parameters 𝛼, 𝛽, and 𝛾 are coefficient vec- lines. tors with 25 × 1 elements each, independently generated from a normal distribution with a mean of 0.1 and variance 1. Naive: The Naive has three hidden layers between of 1. the input layer of features and their passage to the We assume that rare ads and users have more prominent sigmoid function, building a pCTR model. Each of unobserved confounding factors, and thus evaluate predic- these hidden layers consists of 256 units. The first tive CTR by dividing the degree of magnitude of the omitted layer uses the swish activation function, while the variable values. Thus, the test data is separated by the dis- second and third layers use the ReLU activation func- tance of 𝜂𝑙 from the mean. Out of a total number of 50, 000 tion. records, we move the outside quantiles of the distribution 2. IV-BS: The baseline is described in section 3.1. Its of 𝜂𝑙 by 10% on each side. pCTR model has the same network structure as Naive, including 𝑝𝐼 𝑀𝑃 in the input layer. 4.2. Real Datasets 3. IV-FS: The baseline is described in section 3.2. In 𝑝𝐶𝑇 𝑅 side, it has the same network structure as IV- The actual dataset consists of user responses to advertise- BS. ments displayed on websites such as Yahoo! JAPAN oper- 4. IV-SSFS: The baseline in 𝑝𝐶𝑇 𝑅 side is described in ated by LY corporation and auction history records including section 3.3, while its network has the same structure bidding. The datasets are divided into a training dataset, as IV-FS in 𝑝𝐼 𝑀𝑃 side. in which ad impressions and clicks are observed through 5. UBIPS : It consists of 𝑝𝐼 𝑀𝑃 times 𝑝𝐶𝑇 𝑅 for unbi- ad auctions, and a test dataset, in which ad impressions are ased inverse propensity weighting estimator [15]. randomly made to visiting users. Its network structure is consistent with IV-BS for 𝑝𝐼 𝑀𝑃 and 𝑝𝐶𝑇 𝑅 excluding 𝑝𝐼 𝑀𝑃 in the input of 4.2.1. Training data 𝑝𝐶𝑇 𝑅. It also uses a multitasking framework. The training data covers a sample of 50, 000 records ran- The IV-FS and IV-SSFS are not tested in our simulated domly drawn from the population for a past seven-day pe- dataset for two reasons: one is the IV-BS is sufficient to test riod. The training data were generated from ad auctions whether bid amounts are efficient and valid IVs in ad auc- system, which produced data not satisfying the condition tions. Another is those approaches are not suitable to the of conditional independence between the treatment 𝐷𝑗𝑖 and simplicity, such as the linear interactions, in the heterogene- unobserved confounders 𝜖𝑗𝑖 . ity of IVs and the user’s click probability in our simulated dataset. 4.2.2. Test data In this experiments, the loss function is unified across comparative beselines. 𝑝𝐶𝑇 𝑅 and 𝑝𝐼 𝑀𝑃 models both use In the test data, the prediction baselines using the day after binary cross entropy as their loss function. We trained the the 7 days of training data is evaluated. The test dataset comparison models until convergence, where no further consists of all independently displayed records conditional improvement in the loss function in 𝑝𝐶𝑇 𝑅 was observed. on ads’ targeting variables. For all comparative approaches, the optimization method To evaluate the model’s performance in cold-start scenar- was Adamax, and the learning rate was fixed at 0.001. ios, the test data was divided based on previous ad impres- sions. Specifically, the data was split into 20 subsets at every 5% quantile, with each subset containing data points below 4.5. Comparing Each Baselines the respective quantile. To ensure sufficient sample size, the test data included 2,000,000 records. Predicting clicks with more past impressions is generally easier, even with a 7 100 0 20 6 simple baseline. 0.9 0 Relative LogLoss Improvement (%) Relative AUC Improvement (%) 5 100 0.8 20 4 200 LogLoss Naive Naive AUC IV-BS IV-BS UBIPS UBIPS 3 300 40 0.7 4.3. Evaluation Score 2 1 400 500 0.6 60 80 We used log loss, known as a standard evaluation metric for 0 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 Outside Quantiles of l in Users' Click Response Outside Quantiles of l in Users' Click Response pCTR, and the area under the curve (AUC) scores. AUC is a (a) LogLoss & Relative LogLoss (b) AUC & Relative AUC proper metric for evaluating rankings in assessing the ability Figure 4: Simulation: Performance scores at each outside quan- to predict the correct position in auction rankings. For the tile of 𝜂𝑙 . Box plots show actual scores. Line plots show relative simulation data, we employes the actual scores and relative scores, with the bold line as the mean and shaded area showing scores to compare improvements. For our real dataset, we replication variation. present relative evaluation scores due to confidentiality. The 4.5.1. In Simulated datasets relative scores are defined as follows: Figure 4 shows that IV-BS improves AUC and LogLoss Naive LogLoss − Compared LogLoss performance even with omitted variables. IV-BS remains Relative LogLoss = × 100, Naive LogLoss stable and robust, especially on the left side where the test Compared AUC − 0.5 data’s 𝜂𝑙 value is high. Notably, omitted variable bias cannot Relative AUC = ( − 1) × 100. be ignored even in the Weighted GSP impression assignment Naive AUC − 0.5 algorithm, and in this regard, IV-BS demonstrates superior performance. Naive 70 Naive causal user modeling, Advances in Neural Information 35 Processing Systems 35 (2022) 14419–14433. IV-BS IV-BS Relative LogLoss Improvements (%) IV-FS IV-FS 60 Relative AUC Improvements (%) IV-SSFS IV-SSFS 30 UBIPS UBIPS 25 50 40 [6] J. Hartford, G. Lewis, K. Leyton-Brown, M. Taddy, 20 15 30 Deep iv: A flexible approach for counterfactual pre- 10 20 diction, in: International Conference on Machine 5 10 0 0 Learning, PMLR, 2017, pp. 1414–1423. 0 20 40 60 80 Outside Quantiles of Number of Previous Ad Impression 100 0 20 40 60 80 Outside Quantiles of Number of Previous Ad Impression 100 [7] Z. Si, X. Han, X. Zhang, J. Xu, Y. Yin, Y. Song, J.-R. (a) Relative LogLoss (b) Relative AUC Wen, A model-agnostic causal learning framework for recommendation using search data, in: Proceed- Figure 5: Real data: Performance scores at each quantile of ings of the ACM Web Conference 2022, WWW ’22, previous ad impressions. Association for Computing Machinery, New York, NY, USA, 2022, p. 224–233. URL: https://doi.org/10.1145/ 3485447.3511951. doi:10.1145/3485447.3511951 . 4.5.2. In Real dataset [8] G. W. Imbens, Instrumental variables: An econo- metrician’s perspective, Statistical Science 29 (2014) An evaluation of our proposed methods on the real dataset 323–358. URL: http://www.jstor.org/stable/43288511. is shown in Figure 5. It is expected that Naive performs rel- [9] A. Belloni, D. Chen, V. Chernozhukov, C. Hansen, atively well since the training data includes many ads with Sparse models and methods for optimal instruments numerous impressions. However, our proposed methods, with an application to eminent domain, Econometrica IV-BS, IV-FS, and IV-SSFS, show significant improvement 80 (2012) 2369–2429. in relative AUC, particularly for ads with few previous im- [10] A. Abadie, J. Gu, S. Shen, Instrumental variable es- pressions. The improvement of UBIPS over Naive, unlike timation with first-stage heterogeneity, Journal of in the simulation experiment, is likely attributable to the econometrics (2023) 105425–. confounder being associated with the variable observed in [11] D. R. Thompson, K. Leyton-Brown, Revenue opti- the actual data. mization in the generalized second-price auction, in: Improvement for ads with few impressions matches that Proceedings of the fourteenth ACM conference on for ads with many, likely due to the infrequent inclusion Electronic commerce, 2013, pp. 837–852. of rare ads in training data, causing popularity bias. No- [12] Y. Sun, Y. Zhou, X. Deng, Optimal reserve tably, the increasing improvement of IVs methods for the prices in weighted gsp auctions, Electronic 0 − 20 quantile of previous impressions demonstrates their Commerce Research and Applications 13 (2014) robustness in predicting rare ads. 178–187. URL: https://www.sciencedirect.com/ science/article/pii/S1567422314000106. doi:https: 5. Conclusion //doi.org/10.1016/j.elerap.2014.02.003 . [13] M. Frolich, Nonparametric iv estimation of local av- This paper argues that bid amount is a valid instrumen- erage treatment effects with covariates, Journal of tal variable under the assumption of conditional indepen- econometrics 139 (2007) 35–75. dence, and tested its validity by applying it to predictive [14] S. Ferrari, F. Cribari-Neto, Beta regression for mod- CTR. Our experiment on a real dataset showed that explicitly elling rates and proportions, Journal of applied statis- accounting for heterogeneity in the strength of IVs allows tics 31 (2004) 799–815. for efficient and robust predictions. For greater extensi- [15] Y. Saito, S. Yaginuma, Y. Nishino, H. Sakata, K. Nakata, bility, incorporating complex interactions between IVs and Unbiased recommender learning from missing-not- other features with more developed approachs such asgraph at-random implicit feedback, in: Proceedings of the neural networks is recommended. Additionally, addressing 13th International Conference on Web Search and other looping bias and validating prediction methods in Data Mining, WSDM ’20, Association for Comput- repeated auctions would be valuable. ing Machinery, New York, NY, USA, 2020, p. 501–509. URL: https://doi.org/10.1145/3336191.3371783. doi:10. 1145/3336191.3371783 . References [1] V. Marotta, Y. Wu, K. Zhang, A. Acquisti, The welfare impact of targeted advertising technologies, Infor- mation Systems Research 33 (2022) 131–151. doi:10. 1287/isre.2021.1024 . [2] J. Chen, H. Dong, X. Wang, F. Feng, M. Wang, X. He, Bias and debias in recommender system: A survey and future directions, ACM Transactions on Information Systems 41 (2023) 1–39. [3] P. Bühlmann, Invariance, causality and robustness, Statistical science 35 (2020) 404–426. [4] Y. He, Z. Wang, P. Cui, H. Zou, Y. Zhang, Q. Cui, Y. Jiang, Causpref: Causal preference learning for out-of-distribution recommendation, in: Proceedings of the ACM Web Conference 2022, 2022, pp. 410–421. [5] A. Feder, G. Horowitz, Y. Wald, R. Reichart, N. Rosen- feld, In the eye of the beholder: Robust prediction with