Leveraging Instrumental Variables in Online Advertising
                         Auctions : Robust Click-Through-Rate Prediction
                         Ryohei Emori1,2,∗ , Shinya Suzumura3 , Nobuyuki Shimizu3 and Takahiro Hoshino1,2
                         1
                           Keio University, 2-15-45, Mita, Minato-ku, Tokyo, Japan
                         2
                           Riken AIP center, 1-4-1 Nihonbashi, Chuo-ku, Tokyo, Japan
                         3
                           LY Corporation, Kioi Tower 1-3 Kioicho, Chiyoda-ku, Tokyo, Japan


                                        Abstract
                                        Predicting the click-through rate (CTR) in online ad auctions is essential for calculating bid amounts and forming rankings. However,
                                        predicting CTR from historical data faces some difficulties, one of which is the cold-start problem. Our research uses the instrumental
                                        variables (IVs) framework to address the cold-start problem and selection bias, validating robust CTR prediction in online advertising
                                        auctions. Although generally identifying IVs in wide applications is notably challenging, their potential use is not limited to CTR
                                        prediction; they can potentially be used to address practical issues and research questions in advertising auctions in general. We put
                                        forth bid amounts as IVs, discussing their validity as IVs and testing the robustness of predictions using IVs in both simulations and real
                                        data scenarios. Moreover, we enhanced our methodology by integrating explicit interactions between bid amounts and other features,
                                        demonstrating that accounting for heterogeneity in IVs significantly improves prediction accuracy in actual data. Our proposal on IVs
                                        and its refined CTR prediction approach enriches the research fields on causal inference robustness and invariant prediction.

                                        Keywords
                                        Instrumental Variables, Omitted Variable Bias, Robustness, Cold-start Problem, Click-Through-Rate, Online Advertising Auction


                         1. Introduction                                                                                           that often lead to erroneous predictions due to the unreal-
                                                                                                                                   istic absence of unobserved confounding factors between
                         Online advertising, an essential backbone of the digital econ-                                            treatment and outcome relationships [8]; and 4) potentially
                         omy, relies heavily on accurate prediction models to allocate                                             infer the causal effect of impressions on conversion as well
                         ads effectively and enhance the user experience. Crucially,                                               as clicks.
                         the accuracy of click-through rate (CTR) prediction plays a                                                  Furthermore, we demonstrate that the explicit use of first-
                         pivotal role in determining the success in terms of welfare of                                            stage heterogeneity in the IVs method can be strongly rec-
                         of online advertising auctions, and at the same time, hover                                               ommended in online ad auctions [9, 10]. First-stage hetero-
                         the potential biases that may skew results [1, 2].                                                        geneity in the IVs method has been relatively overlooked
                            In addition to the problem of bias that lurks in some on-                                              compared to heterogeneity in the second stage, namely, user
                         line ad auctions and is often the subject of research, the                                                response. However, we find that increasing the association
                         cold-start problem arises when we must make predictions                                                   between IVs and impression probability shows robust predic-
                         for new advertisements or infrequent users, leading to de-                                                tions for the overall prediction and the cold-start problem.
                         creased predictive accuracy. Against the backdrop of prob-                                                   The contributions of the paper have three main points:
                         lems arising from those various factors, causal methods of
                         predicting user behavior that capture invariant user behav-                                                   1. We identify and propose valid IVs tailored to online
                         ior have risen as a subject of high research interest [3, 4, 5].                                                 advertising auctions. The IVs suit broad advertising
                         Among them, prior research [3] has highlighted that one                                                          auction contexts, including display and search ad-
                         of those causal methods, the instrumental variables (IVs)                                                        vertising. Furthermore, the IVs method is expected
                         method, has the potential to contribute to solving the cold-                                                     to have further applications such as causal inference
                         start problem. [6] provided a methodology for IVs using                                                          of medium- and long-term effects of ad impressions
                         neural networks, but specific IVs always need to be identi-                                                      on conversions, etc., not limited to causal effects on
                         fied in a specific research domain. [7] uses the user’s search                                                   user click behavior in online ad auctions.
                         query as an instrumental variable; their use of IVs is lim-                                                   2. There have been few empirical examples the IVs
                         ited to search advertising and may not satisfy one of the                                                        method has been demonstrated to be capable of mak-
                         conditions for IVs, the exclusion restriction.                                                                   ing invariant behavioral predictions. We identify
                            In this paper, we identify bid amounts as IVs in online ad                                                    valid IVs for further application in the setting of
                         auction settings and demonstrate that click prediction using                                                     online ad auctions, a setting in which the research
                         the IVs method exhibits robust predictions in the overall                                                        field has been broaden, and demonstrated the robust-
                         prediction and cold start problems.                                                                              ness of the IVs method’s prediction accuracy for the
                            Although IVs are generally considered difficult to identify,                                                  overall forecast and the cold-start scenario in our
                         they have the potential to: 1) maximize the use of data, in-                                                     experiments.
                         cluding impressions of ads with low historical win rates; 2)                                                  3. Notably, our research advances the concept of utiliz-
                         not require random impressions of ads; 3) avoid assumptions                                                      ing the first stage heterogeneity in the IVs method
                                                                                                                                          in the context of prediction. By considering hetero-
                                                                                                                                          geneity in the strength of IVs concerning impression
                         AdKDD’24 30th ACM SIGKDD Conference on Knowledge Discovery and
                         Data Mining, August 25–29, 2024, Barcelona, Spain                                                                probability, our method shows more significantly
                         ∗
                              Corresponding author.                                                                                       robust prediction performance in whole prediction
                         Envelope-Open ryohey3569@keio.jp (R. Emori); ssuzumur@lycorp.co.jp                                               and the cold-start scenario.
                         (S. Suzumura); nobushim@lycorp.co.jp (N. Shimizu);
                         hoshino@econ.keio.ac.jp (T. Hoshino)
                         Orcid 0009-0003-1247-8327 (R. Emori)
                                    © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                    Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2. Identification of Instrumental                                                                                                                    where 𝐹 (⋅) is the generated distribution of bid amounts.
                                                                                                                                                         As summarized by [2], bias in the recommendation sys-
   Variables in Ad Auctions                                                                                                                          tem is a looping process. Figure 1 depicts the looping of
                                                                                                                                                     several biases, focused in ad auctions setting, which are in-
2.1. Ad Auctions and Biases
                                                                                                                                                     terdependent. In particular, the auction score will be biased
                                                                                                                                                     if the platform’s prediction of the pCTR is a biased estimator.
                                                                    Popularity Bias
                                                                                                                                                     The same is true for pCVR and adjust term. The assignment
                                   Data Imbalance                                         Score Prediction
                                                                                                                                                     of impressions by the auction score with bias is as follows:
                 𝑈𝑠𝑒𝑟 - 𝑠 𝑦./0.1, !! , 𝑋!! , 𝐷!! 𝑎𝑛𝑑 𝑦.345678034,!! , 𝑋!! , 𝐷!!
                          𝑎𝑟𝑒 𝑙𝑜𝑔𝑔𝑒𝑑 𝑜𝑛𝑡𝑜 𝑝𝑙𝑎𝑡𝑓𝑜𝑟𝑚! 𝑠 𝑑𝑎𝑡𝑎𝑏𝑎𝑠𝑒.
                                                                                      𝐴𝑑𝑣𝑒𝑟𝑡𝑖𝑠𝑒𝑟𝑠 𝑚𝑎𝑛𝑢𝑎𝑙𝑙𝑦 𝑠𝑒𝑡 𝑏𝑖𝑑 𝑎𝑚𝑜𝑢𝑛𝑡𝑠 𝑐𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝑜𝑛 𝑋!!
                                                                                                               𝑜𝑟
                                                                                                                                                                   𝑗𝑖∗ = arg max 𝐴𝑢𝑐𝑡𝑖𝑜𝑛 𝑆𝑐𝑜𝑟𝑒𝑗biased
                                                                                                                                                                                               𝑖
                                                                                                                                                                                                      .
                                                                                           𝑡𝑎𝑟𝑔𝑒𝑡 𝐶𝑃𝐴 𝑋!! × 𝒑𝑪𝑽𝑹 𝒚𝒄𝒐𝒏𝒗𝒆𝒓𝒔𝒊𝒐𝒏,!! = 𝟏 𝑿!! , 𝑫!! = 𝟏)                       𝑗𝑖 ∈{1,⋯,𝑚𝑖 }
             Exposure Bias
                                                                   𝐴𝑢𝑐𝑡𝑖𝑜𝑛 𝑆𝑐𝑜𝑟𝑒!! = 𝑨𝒅𝒋𝒖𝒔𝒕𝒆𝒅 𝑩𝒊𝒅 𝑋!! , 𝐷!! = 1

                                                                                        × 𝒑𝑪𝑻𝑹 𝒚𝒄𝒍𝒊𝒄𝒌,!! = 𝟏 𝑿!! , 𝑫!! = 𝟏)

                                                                                        + 𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑇𝑒𝑟𝑚(𝑋!! )
                                                                                                                                                     2.2. Causal View of Online Ad Auctions
 User Response                                                                                                    𝑂𝑝𝑡𝑖𝑜𝑛𝑎𝑙𝑙𝑦

                                                                                                                               Inductive Bias
                                                   Ad Non-Impression


                    Ad Impression                                                           Ad Auction
                                                                                                                                                      𝑦!!                      𝜀!!        Conditional
                                                                                                                                                                                                                  υ!!
                                                           Selection Bias
                                                                                                                                                       Click                             Independence
                                                                                                                                                                          Exclusion
                                                                                                                                                       binary
Figure 1: Inductive, Selection, Exposure, and Popularity Bias in                                                                                                         Restriction

Users’ Click Behavior and Ad Auction System

                                                                                                                                                                               𝐷!!                                      𝑍!!
                                                                                                                                                                                               Relevance


   Before we explain that the bid amounts is IVs, we describe                                                                                                               Impression                                  Bid
                                                                                                                                                                               binary                               continuous
the setting in ad auctions. This is because it is essential to
examine the actual flow of data generation to ascertain the
IVs.
   The notations used to describe the auction mechanism
                                                                                                                                                                              𝑋!!                        𝑝𝐶𝑇𝑅!!
                                                                                                                                                                             Features
are as follows: the total number of auctions is N, the number
of auctioneers participating in auction 𝑖 ∈ {1, ⋯ , 𝑁 } is 𝑚𝑖 ,                                                                                      Figure 2: Users’ Click Behavior and Bid Amounts as Instrumental
and the auctioneer’s advertisement is 𝑗𝑖 ∈ {1, ⋯ , 𝑚𝑖 }. Let                                                                                         Variables in Ad Auctions
𝐵𝑖𝑑𝑗𝑖 be the bid amount that the auctioneer spends on the
ad 𝑗𝑖 , 𝑝𝐶𝑇 𝑅𝑗𝑖 be the predictive click-through-rate, and 𝑗𝑖∗ be                                                                                        Treatment 𝐷𝑗𝑖 , impressions in ad auctions, can be eas-
the ad that wins an impression to the user in the auction                                                                                            ily correlated with the error term for the unobserved het-
𝑖. Also, 𝑦𝑗𝑖 is the outcome that is 1 if ad 𝑗𝑖 is clicked and 0 if                                                                                   erogeneity of users’ click behavior. This can be explicitly
not, 𝑋𝑗𝑖 is a variables vector used to target ads and users in                                                                                       expressed in the pCTR formulation as follows:
ad 𝑗𝑖 . To simplify complex effects such as position bias, we
                                                                                                                                                                  𝑝(𝑦𝑗𝑖 = 1) ∶= 𝜃 ∗ (𝑋𝑗𝑖 , 𝜂𝑗𝑖 , 𝜖𝑗𝑖 |𝐷𝑗𝑖 = 1),
assume a setting where there is only one ad that wins an
impression. Therefore, let 𝐷𝑗𝑖 be a binary dummy that is 1                                                                                           where 𝜖𝑗𝑖 represents the error term in the user’s click re-
when 𝑗𝑖 = 𝑗𝑖∗ and 0 otherwise. Also, let 𝑦𝑗𝑖 be the outcome                                                                                          sponse, and 𝜂𝑗𝑖 is unobserved heterogeneity of click behavior
that is 1 if the ad 𝑗𝑖∗ is clicked and 0 otherwise.                                                                                                  that correlates with some or all of 𝑋𝑗𝑖 consisting of user and
   Here, 𝑝𝐶𝑇 𝑅𝑗𝑖 is as followed:                                                                                                                     ad features but cannot be observed, known as the omitted
                                                                                                                                                     variable. 𝜃 ∗ (⋅) is a function returns a predictive probability
                               𝑝𝐶𝑇 𝑅𝑗𝑖 = 𝑝(𝑦𝑗𝑖 = 1|𝐷𝑗𝑖 = 1, 𝑋𝑗𝑖 ),                                                                                   when 𝑦𝑗𝑖 = 1.
                                                                                                                                                          Treatments are determined in the auction system together
where 𝑝𝐶𝑇 𝑅𝑗𝑖 is the probability of whether ad 𝑗𝑖 will be                                                                                            with predicted values such as pCTR and pCVR, which are
clicked given winning impression, target and other vari-                                                                                             conditioned on the user and ad features involved in ad
ables.                                                                                                                                               auctions, and the advertiser’s bid amount. At this point,
   In ad auctions, there can be various methods for deter-                                                                                           pCTR and pCVR are not conditioned on omitted variables
mining auction scores. Here, for instance, the auction score                                                                                         𝜂𝑗𝑖 , which generates a bias in the estimates of predictive
is calculated as follows:                                                                                                                            outcome. Since the bid amount is determined from the pre-
                              𝐴𝑢𝑐𝑡𝑖𝑜𝑛 𝑆𝑐𝑜𝑟𝑒𝑗𝑖 = 𝐵𝑖𝑑𝑗𝑖 × 𝑝𝐶𝑇 𝑅𝑗𝑖 ,                                                                                    dictions with this bias and an auction is formed, there is a
                                                                                                                                                     strong suspicion that the impressions 𝐷𝑗𝑖 are endogenous
This determination scheme, which takes into account bid                                                                                              variables, which are variables correlated with the error term
amount and predictive CTR in the auction score, has been                                                                                             amplified through the auction with the omitted variable
studied under the name ”weighted GSP” [11, 12]. When the                                                                                             bias. We consider the assumption that no omitted variables
bid amount is a manual bid by the auctioneer, it is generated                                                                                        exist as a type of inductive bias, a convenient assumption
from the distribution of bid amounts conditional on the                                                                                              for pCTR model.
target variable of the ad set by the auctioneer. Alternatively,                                                                                           Unconfoundedness, i.e., a situation where no omitted
when the bid amount is an automated bid by the platform,                                                                                             variables exist, is a somewhat severe assumption for real-
the bid amount is generated by, for example, predictive                                                                                              world data. Therefore, IVs methods that do not require the
conversion rate (pCVR) and target CPA. In this case, 𝑝𝐶𝑉 𝑅𝑗𝑖                                                                                         assumption of unconfoundedness can be compelling and
is a function of 𝑋𝑗𝑖 . That is, bid amounts is generated from                                                                                        valuable.
some distribution conditioned on the target variables of
the ad set by the auctioneer or other variables used by the                                                                                          2.3. Validating Bid Amounts as IVs
platform. Thus,
                                                                                                                                                     There are three conditions that valid IVs satisfy. The first is
                                                          𝐵𝑖𝑑𝑗𝑖 ∼ 𝐹 (𝑋𝑗𝑖 ),                                                                          the relevance of the IVs to a treatment variable. The second
is an exclusion restriction, where the IVs does not directly             • Q.1 Do prediction methods using simple neural net-
affect the outcome but rather affects the outcome through                  works with IVs perform in the online ad auction
the treatment variable. The third is the independence of the               setting? and
IVs with respect to the treatment and the outcome. Notating              • Q.2 Is IVs heterogeneity strongly present in online
IVs vector in ad 𝑗𝑖 as 𝑍𝑗𝑖 and combining these conditions, we              ad auction settings and is explicitly addressing it
can write them as follows:                                                 effective in prediction?,
       𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑐𝑒 ∶                                𝐷𝑗𝑖 ⟂𝑍̸ 𝑗𝑖 ,           • Q.3 Heterogeneity in treatment effects is widely
                                                                           known, but by how much improvement relative to
       𝐸𝑥𝑐𝑙𝑢𝑠𝑖𝑜𝑛 𝑅𝑒𝑠𝑡𝑟𝑖𝑐𝑡𝑖𝑜𝑛 ∶            {𝜖𝑗𝑖 , 𝐷𝑗𝑖 } ⟂ 𝑍𝑗𝑖 ,             accounting for heterogeneity in IVs?
       𝐶𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛𝑎𝑙 𝐼 𝑛𝑑𝑒𝑝𝑒𝑛𝑑𝑒𝑛𝑐𝑒 ∶          𝜖𝑗𝑖 | 𝑋𝑗𝑖 ⟂ 𝑍𝑗𝑖 ,
                                                                   To introduce models that respond to those questions, the
We argue that bid amounts is valid as IVs in ad auctions.          methodology section is organized as follows. For Q.1, We
The reason bid amounts function as IVs is summarized in            first introduce the basic structure of the nonparametric IVs
Figure 2 under our proposed IVs formulation.                       method and highlight its heterogeneous relevance to the
   With regard to the relevance between bid amounts and            probability of winning impressions in ad auctions. Next,
impressions, the relevance is explicitly acknowledged by the       Q.2, we present a method based on an attention network
fact that the main item in the auction score is the bid amount.    that explicitly considers interactions between IVs and their
Concerning the exclusion restriction, the bid amount only          other features. Finally, Q.3, we explicitly incorporate hetero-
influences impressions through the auction score. There-           geneity in click probabilities by employing an interaction
fore, the bid amounts does not influence the user’s click          structure similar to the heterogeneity of instrumental vari-
behavior. Conditional on the variables used by advertisers         ables. Figure 3 summarizes our proposed final IVs method.
and platforms to set bid amounts, bid amounts are valid               For simplicity in subscripting the training data, 𝑙 corre-
instruments.                                                       sponds to the record number in this section.

2.4. Reasons Other Variables are Not Valid                              y!"#!$                        𝒑𝑰𝑴𝑷        Sigmoid                        𝒁


     IVs                                                                             Attention
                                                                                     Network
                                                                                                     𝑥&, '()*                  Attention
                                                                                                                               Network
                                                                                                                                              𝑥&, '()*


                                                                                                       …


                                                                                                                                                …
Here, we introduce why other variables, such as bid times
                                                                       Sigmoid            i.e.,                                     i.e.,
                                                                                      Leveraging                                Leveraging
                                                                                         pIMP
                                                                                                     𝑥', '()*                        IV
                                                                                                                                              𝑥', '()*
                                                                                     Interactions                              Interactions

used for targeting, do not meet the conditions of an instru-                                          𝑥&, $%
                                                                                                                   NN
                                                                                                                                              𝑥&, $%

mental variable in ad auctions.
                                                                                                       …


                                                                                                                                                …
                                                                         NN
   Relevance : Take targeting variables as an example.                                                𝑥!, $%                                   𝑥!, $%


From the perspective of relevance, advertisers determine                         Second Stage                               First Stage

bid amounts based on targeting users, which should relate                                           Multi-task Learning
to the probability of assignment. Bid amounts influence
the auction score directly, ensuring more vital relevance          Figure 3: IV-IMP Approach Leveraging First- and Second-stage
than targeting variables, while targeting variables have an        Heterogeneity with Multi-task Learning Structure
”indirect” relevance to the auction score.
   Conditional Independence : The more crucial condi-
tion, however, is that targeting variables do not satisfy the
independence from the unobserved factors affecting the             3.1. First-stage IVs Heterogeneity in Ad
user’s probability of clicking. For instance, consider bid              Auctions
times as one of the targeting variables. The time when a           In principle, we can estimate a user’s click response 𝑦𝑙 using
user requests an advertisement, that is, the user’s visitation     IVs in a two-stage approach. Following nonparametric IVs
process, and the probability of clicking the ad can be re-         notation by [13], the incorporation of heterogeneity in the
lated. Users visiting at 10 AM may have a higher or lower          first stage can be written as follows:
probability of clicking an ad, and even if conditioned on
other targeting variables, the presence of unobserved fac-                             𝑝(𝑦𝑙 = 1) = 𝜙 ∗ (𝑋𝑙 , 𝑝(𝑍𝑙 , 𝑋𝑙 ), 𝜖𝑙 ),
tors makes it impossible to guarantee the independence of
                                                                                         𝑝(𝑍𝑙 , 𝑋𝑙 ) = 𝑝(𝐷𝑙 = 1|𝑋𝑙 , 𝑍𝑙 ),
bid times from the click probability. On the other hand, the
probability that a user will click is considered independent       where 𝑝(𝑍𝑙 , 𝑋𝑙 ) is an instrument summarized by the interac-
of the bid amount, conditioned on the targeting variables,         tion of multiple IVs, and we assume that 𝐷𝑙 depends only on
since the user cannot know how much was paid for the               𝑋𝑙 through 𝑝(𝑍𝑙 , 𝑋𝑙 ) and call it first stage. 𝜙 ∗ is a function
specific advertising at the time of the click.                     that returns a predictive probability of the event 𝑦𝑙 = 1,
   Exclusion Restriction : From the perspective of the ex-         which is called second stage. In the ad auctions, 𝑝(𝑍𝑙 , 𝑋𝑙 )
clusion restriction, targeting variables affect the probability    is the predicted impression probability, henceforth 𝑝𝐼 𝑀𝑃,
of a user’s click, and do not ensure that their influence on the   which is a multi-task learning frame and can be trained
click probability is exerted solely through the assignment         in one step together with 𝑝𝐶𝑇 𝑅. Using neural networks,
of impressions.                                                    a layer structure can be used that follows the simplified
                                                                   manner of IVs, which we henceforth refer to as the IV-BS
3. Click Prediction with First-stage                               approach.
                                                                      Although there can be several approaches incorporating
   IVs Heterogeneity                                               interactions between features and IVs, we use an attention
                                                                   network. This is because it is suitable merely for validating
In the methodology section, we propose several variants of         the idea of bid amount heterogeneity.
the IVs method to examine the following questions:
3.2. Leveraging First-Stage IVs by                               4. Experiments
     Interactions
                                                                 The experimental section is divided into two parts: simu-
Given a dataset, let the input feature matrix be represented     lation and evaluation in scenarios approximating the cold-
as 𝐾 after passing through an input layer where all units are    start problem with real data sets. The code for replication
fully connected, including units from 𝑝𝐼 𝑀𝑃 and features.        is available at the following link: https://github.com/ryohei-
Let 𝐵 denote the batch size and 𝐿 represent the number of        emori/NPIV-pCTR. Please note that the repository excludes
units in the input layer, leading to 𝐾 having dimensions         sections related to private data.
of 𝐵 × 𝐿. The instrumental variable, represented as matrix          The notation is consistent with that used in Section 3.
𝑍, has dimensions 𝐵 × 1. To align with the shape of 𝐾,
matrix 𝑄 iv is formed by performing a tiling operation on 𝑍.
                                                                 4.1. Simulated Datasets
Specifically, each row of 𝑍 is replicated on the basis of the
number of columns in 𝐾. Furthermore, the weight matrix for
IVs interaction is denoted as 𝑊 iv and has dimensions 𝐿 × 𝐿.     Algorithm 1 Simulating auction data and validating base-
Using these matrices, the attention score 𝛼iv is calculated      lines
as:                                                                1: 1. Initializing paramaters:
                                                                   2: Set parameters (𝛼, 𝛽, 𝛾 )
            𝛼 iv = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥(𝑊 iv (𝑄 iv ⊙ 𝐾 ) + 𝑏 iv ).             3: 𝑘 ∶= 0
Here, we use the swish function as an activation function in       4: while 𝑘 < 5, 000 do
the weight matrix 𝑊 iv so as to represent the non-linear           5:    Generate 𝑋𝑘 and 𝜂𝑘
strength in the heterogeneity of bid amounts. We feed              6:    𝐷𝑘 ∼ Bernoulli(𝑝𝐷𝑘 ), where 𝑝𝐷𝑘 = Logistic(𝑋𝑘′ 𝛼 + 𝜂𝑘 )
element-wise products as interactions into the fully con-
nected layer with the softmax function as the activation          7:    if 𝐷𝑘 = 1 then
function to generate the attention score 𝛼 𝑖𝑣 . Then, we ob-      8:        𝑦𝑘 ∼ Bernoulli(𝑝𝑦𝑘 ), where 𝑝𝑦𝑘 = Logistic(𝑋𝑘′ 𝛽 +
tain the representation g by the element-wise product of                    𝜂𝑘 )
the input layer 𝐾 and the generated attention scores 𝛼 iv .       9:        𝑘 ∶= 𝑘 + 1
                                                                 10:    end if
                         𝑔 iv = 𝛼 iv ⊙ 𝐾                         11: end while
                                                                 12: Train pCTR: 𝑝(𝑦𝑘 = 1|𝐷𝑘 = 1) ∶= 𝜃(𝑋𝑘 )
We combine the representation g obtained by the attention
                                                                 13: 2. Generating historical auction data:
layer and the features input in a fully connected neural
                                                                 14: for each auction 𝑖 in 5, 000 do
network to form the hidden layer.
                                                                 15:    𝑚𝑖 = 20
                                                                 16:    Generate 𝑋𝑗𝑖 and 𝜂𝑗𝑖
3.3. Second-stage Heterogeneity                                  17:    𝐵𝑖𝑑𝑗𝑖 ∼ Beta(𝜇, 2) by [14], where 𝜇 ∶= Logistic(𝑋𝑗′𝑖 𝛾 )
In the second stage, namely in 𝑝𝐶𝑇 𝑅 side, it is evident that    18:    𝑝𝐶𝑇 𝑅𝑗𝑖 = 𝜃(𝑋𝑗𝑖 )
heterogeneity exists when conditioning on user and ad-           19:    𝑗𝑖∗ ∶= arg max𝑗𝑖 ∈{1,⋯,𝑚𝑖 } Auction Score𝑗𝑖 ,
vertisement features regarding the effect of impressions.                    where Auction Score𝑗𝑖 ∶= 𝐵𝑖𝑑𝑗𝑖 × 𝑝𝐶𝑇 𝑅𝑗𝑖
Similarly to how we took the dot product of bid amounts          20:    𝑦𝑗𝑖 ∼ Bernoulli(𝑝𝑗𝑖 ) & 𝐷𝑗𝑖 = 1 if 𝑗𝑖 = 𝑗𝑖∗
and feature units in the input layer in the first stage, we                  where 𝑝𝑗𝑖 = Logistic(𝑋𝑗′𝑖 𝛽 + 𝜂𝑗𝑖 )
symmetrically use the same in the second stage. The input        21:    𝑦𝑗𝑖 = 0 & 𝐷𝑗𝑖 = 0, otherwise
layer consists of fully connected units from 𝑝𝐼 𝑀𝑃 and fea-      22: end for
tures. The structure of the entire network including 𝑝𝐼 𝑀𝑃       23: 3. Learning 𝑝𝐶𝑇 𝑅 with historical data:
and 𝑝𝐶𝑇 𝑅 is drawn in Figure 3. The attention score and                {(𝑦𝑗𝑖 , 𝑋𝑗𝑖 , 𝐵𝑖𝑑𝑗𝑖 , 𝐷𝑗𝑖 ), 𝑗𝑖 = 1, ⋯ , 𝑚𝑖 , 𝑖 = 1, ⋯ , 5, 000}
representation g can be written as follows:                      24: 4. Validating 𝑝𝐶𝑇 𝑅 with independently displayed
                                                                     data:
        𝛼 imp = 𝑆𝑜𝑓 𝑡𝑚𝑎𝑥(𝑊 imp (𝑄 imp ⊙ 𝐾 ) + 𝑏 imp ),                 {(𝑦𝑙 , 𝑋𝑙 , 𝐷𝑙 = 1), 𝑙 ∈ {1, ⋯ , 50, 000}},
        𝑔 imp = 𝛼 imp ⊙ 𝐾 ,                                                  where 𝑦𝑙 ∼ Bernoulli(𝑝𝑙 ), 𝑝𝑙 = Logistic(𝑋𝑙′ 𝛽 + 𝜂𝑙 ),
                                                                             generated 𝑋𝑙 and 𝜂𝑙 .
where 𝑄 imp is formed by performing a tiling operation on
𝑝𝐼 𝑀𝑃 to align with the shape of 𝐾. Specifically, each row of
                                                                    The procedures for simulating the auction data are sum-
𝑝𝐼 𝑀𝑃 is replicated on the basis of the number of columns
                                                                 marized in Algorithm 1, aligning with procedure and no-
in 𝐾. 𝑊 imp is a weight matrix of 𝐿 × 𝐿 for 𝑝𝐼 𝑀𝑃 interaction.
                                                                 tation in section 3.1. The experiment is replicated 20 times.
                                                                 The subscripts 𝑘 and 𝑙 correspond to the number of records
3.4. Loss Function for Multi-task Learning                       in step 1 and 4, respectively. 𝜃(𝑋𝑘 ) is learned by logistic
In the multi-task learning framework for pIMP and pCTR,          regression. We use the Beta distribution for generating bid
we adjust the loss function for pCTR by applying sample          amounts, which satisfies non-negative constraints. Specifi-
weights through an indicator function, 1{𝐷𝑙 =1} :                cally, we use the reparametrized Beta distribution by [14]
                                                                 to model the mean of bid amounts. For simplicity, the num-
               𝐿𝑜𝑠𝑠𝑝𝐶𝑇 𝑅 = 𝐿𝑜𝑠𝑠𝑝𝐶𝑇 𝑅 × 1{𝐷𝑙 =1}                  ber of auctioneers 𝑚𝑖 participating in auction 𝑖 is fixed, but
                                                                 in reality, it may vary depending on the attractiveness of
This function ensures that the 𝐿𝑜𝑠𝑠𝑝𝐶𝑇 𝑅 is only computed        users, represented by 𝑋𝑗𝑖 . The link function Logistic(⋅) is
for data points with impressions, when 𝐷𝑙 = 1, filtering out
                                                                 defined as (1 + exp(−⋅))−1 . The feature vectors 𝑋𝑘 , 𝑋𝑗𝑖 , and
instances without impressions from affecting the pCTR loss
                                                                 𝑋𝑙 are 25 × 1 vectors respectively. Each 𝑋𝑠,𝑘 is drawn from
calculation. This approach allows us to concentrate on the
                                                                 a specific distribution: Uniform[−5, 5] for 𝑠 ∈ {1, ⋯ , 10},
performance of the model to predict CTR.
Bernoulli(0.5) for 𝑠 ∈ {11, ⋯ , 20}, and Uniform[−2, 2] for        4.4. Ablation studies
𝑠 ∈ {21, ⋯ , 25}. These vectors are generated similarly. The
                                                                   To evaluate our proposed methods with instrumental vari-
vectors 𝜂𝑘 , 𝜂𝑗 , and 𝜂𝑙 are generated from a Uniform[−5, 5]
                                                                   ables, we took a naive benchmark and comparative base-
distribution. The parameters 𝛼, 𝛽, and 𝛾 are coefficient vec-
                                                                   lines.
tors with 25 × 1 elements each, independently generated
from a normal distribution with a mean of 0.1 and variance                            1. Naive: The Naive has three hidden layers between
of 1.                                                                                    the input layer of features and their passage to the
   We assume that rare ads and users have more prominent                                 sigmoid function, building a pCTR model. Each of
unobserved confounding factors, and thus evaluate predic-                                these hidden layers consists of 256 units. The first
tive CTR by dividing the degree of magnitude of the omitted                              layer uses the swish activation function, while the
variable values. Thus, the test data is separated by the dis-                            second and third layers use the ReLU activation func-
tance of 𝜂𝑙 from the mean. Out of a total number of 50, 000                              tion.
records, we move the outside quantiles of the distribution                            2. IV-BS: The baseline is described in section 3.1. Its
of 𝜂𝑙 by 10% on each side.                                                               pCTR model has the same network structure as
                                                                                         Naive, including 𝑝𝐼 𝑀𝑃 in the input layer.
4.2. Real Datasets                                                                    3. IV-FS: The baseline is described in section 3.2. In
                                                                                         𝑝𝐶𝑇 𝑅 side, it has the same network structure as IV-
The actual dataset consists of user responses to advertise-                              BS.
ments displayed on websites such as Yahoo! JAPAN oper-                                4. IV-SSFS: The baseline in 𝑝𝐶𝑇 𝑅 side is described in
ated by LY corporation and auction history records including                             section 3.3, while its network has the same structure
bidding. The datasets are divided into a training dataset,                               as IV-FS in 𝑝𝐼 𝑀𝑃 side.
in which ad impressions and clicks are observed through                               5. UBIPS : It consists of 𝑝𝐼 𝑀𝑃 times 𝑝𝐶𝑇 𝑅 for unbi-
ad auctions, and a test dataset, in which ad impressions are                             ased inverse propensity weighting estimator [15].
randomly made to visiting users.                                                         Its network structure is consistent with IV-BS for
                                                                                         𝑝𝐼 𝑀𝑃 and 𝑝𝐶𝑇 𝑅 excluding 𝑝𝐼 𝑀𝑃 in the input of
4.2.1. Training data                                                                     𝑝𝐶𝑇 𝑅. It also uses a multitasking framework.
The training data covers a sample of 50, 000 records ran-          The IV-FS and IV-SSFS are not tested in our simulated
domly drawn from the population for a past seven-day pe-           dataset for two reasons: one is the IV-BS is sufficient to test
riod. The training data were generated from ad auctions            whether bid amounts are efficient and valid IVs in ad auc-
system, which produced data not satisfying the condition           tions. Another is those approaches are not suitable to the
of conditional independence between the treatment 𝐷𝑗𝑖 and          simplicity, such as the linear interactions, in the heterogene-
unobserved confounders 𝜖𝑗𝑖 .                                       ity of IVs and the user’s click probability in our simulated
                                                                   dataset.
4.2.2. Test data                                                      In this experiments, the loss function is unified across
                                                                   comparative beselines. 𝑝𝐶𝑇 𝑅 and 𝑝𝐼 𝑀𝑃 models both use
In the test data, the prediction baselines using the day after
                                                                   binary cross entropy as their loss function. We trained the
the 7 days of training data is evaluated. The test dataset
                                                                   comparison models until convergence, where no further
consists of all independently displayed records conditional
                                                                   improvement in the loss function in 𝑝𝐶𝑇 𝑅 was observed.
on ads’ targeting variables.
                                                                   For all comparative approaches, the optimization method
   To evaluate the model’s performance in cold-start scenar-
                                                                   was Adamax, and the learning rate was fixed at 0.001.
ios, the test data was divided based on previous ad impres-
sions. Specifically, the data was split into 20 subsets at every
5% quantile, with each subset containing data points below         4.5. Comparing Each Baselines
the respective quantile. To ensure sufficient sample size,
the test data included 2,000,000 records. Predicting clicks
with more past impressions is generally easier, even with a                  7
                                                                                                                                                                  100

                                                                                                                                                                  0
                                                                                                                                                                                                                                                                                                     20

                                                                             6

simple baseline.
                                                                                                                                                                                                           0.9
                                                                                                                                                                                                                                                                                                     0
                                                                                                                                                                        Relative LogLoss Improvement (%)


                                                                                                                                                                                                                                                                                                          Relative AUC Improvement (%)


                                                                             5                                                                                        100

                                                                                                                                                                                                           0.8                                                                                           20
                                                                             4                                                                                        200
                                                                   LogLoss


                                                                                                                       Naive                                                                                                                                                                 Naive
                                                                                                                                                                                       AUC


                                                                                                                       IV-BS                                                                                                                                                                 IV-BS
                                                                                                                       UBIPS                                                                                                                                                                 UBIPS
                                                                             3                                                                                        300                                                                                                                                40
                                                                                                                                                                                                           0.7

4.3. Evaluation Score                                                        2

                                                                             1
                                                                                                                                                                      400

                                                                                                                                                                      500
                                                                                                                                                                                                           0.6
                                                                                                                                                                                                                                                                                                         60


                                                                                                                                                                                                                                                                                                         80

We used log loss, known as a standard evaluation metric for
                                                                             0
                                                                                 10   20   30        40         50          60          70        80   90   100                                                  10   20   30        40         50          60          70        80   90   100
                                                                                                Outside Quantiles of l in Users' Click Response                                                                                 Outside Quantiles of l in Users' Click Response


pCTR, and the area under the curve (AUC) scores. AUC is a                        (a) LogLoss & Relative LogLoss                                                                                                       (b) AUC & Relative AUC
proper metric for evaluating rankings in assessing the ability
                                                                    Figure 4: Simulation: Performance scores at each outside quan-
to predict the correct position in auction rankings. For the        tile of 𝜂𝑙 . Box plots show actual scores. Line plots show relative
simulation data, we employes the actual scores and relative         scores, with the bold line as the mean and shaded area showing
scores to compare improvements. For our real dataset, we            replication variation.
present relative evaluation scores due to confidentiality. The     4.5.1. In Simulated datasets
relative scores are defined as follows:
                                                                    Figure 4 shows that IV-BS improves AUC and LogLoss
                  Naive LogLoss − Compared LogLoss                  performance even with omitted variables. IV-BS remains
Relative LogLoss =                                  × 100,
                             Naive LogLoss                          stable and robust, especially on the left side where the test
                    Compared AUC − 0.5                              data’s 𝜂𝑙 value is high. Notably, omitted variable bias cannot
   Relative AUC = (                     − 1) × 100.                 be ignored even in the Weighted GSP impression assignment
                      Naive AUC − 0.5
                                                                    algorithm, and in this regard, IV-BS demonstrates superior
                                                                    performance.
                                                                                                 Naive                                     70                                                           Naive
                                                                                                                                                                                                                       causal user modeling, Advances in Neural Information
                                    35
                                                                                                                                                                                                                       Processing Systems 35 (2022) 14419–14433.
                                                                                                 IV-BS                                                                                                  IV-BS
Relative LogLoss Improvements (%)

                                                                                                 IV-FS                                                                                                  IV-FS
                                                                                                                                           60


                                                                                                           Relative AUC Improvements (%)
                                                                                                 IV-SSFS                                                                                                IV-SSFS
                                    30                                                           UBIPS                                                                                                  UBIPS

                                    25                                                                                                     50
                                                                                                                                           40
                                                                                                                                                                                                                   [6] J. Hartford, G. Lewis, K. Leyton-Brown, M. Taddy,
                                    20
                                    15                                                                                                     30                                                                          Deep iv: A flexible approach for counterfactual pre-
                                    10                                                                                                     20                                                                          diction, in: International Conference on Machine
                                     5                                                                                                     10
                                     0                                                                                                      0
                                                                                                                                                                                                                       Learning, PMLR, 2017, pp. 1414–1423.
                                         0          20           40         60           80
                                             Outside Quantiles of Number of Previous Ad Impression
                                                                                                   100                                          0          20           40         60           80
                                                                                                                                                    Outside Quantiles of Number of Previous Ad Impression
                                                                                                                                                                                                          100
                                                                                                                                                                                                                   [7] Z. Si, X. Han, X. Zhang, J. Xu, Y. Yin, Y. Song, J.-R.
                                               (a) Relative LogLoss                                                                                      (b) Relative AUC                                              Wen, A model-agnostic causal learning framework
                                                                                                                                                                                                                       for recommendation using search data, in: Proceed-
     Figure 5: Real data: Performance scores at each quantile of
                                                                                                                                                                                                                       ings of the ACM Web Conference 2022, WWW ’22,
     previous ad impressions.
                                                                                                                                                                                                                       Association for Computing Machinery, New York, NY,
                                                                                                                                                                                                                       USA, 2022, p. 224–233. URL: https://doi.org/10.1145/
                                                                                                                                                                                                                       3485447.3511951. doi:10.1145/3485447.3511951 .
   4.5.2. In Real dataset                                                                                                                                                                                          [8] G. W. Imbens, Instrumental variables: An econo-
                                                                                                                                                                                                                       metrician’s perspective, Statistical Science 29 (2014)
   An evaluation of our proposed methods on the real dataset
                                                                                                                                                                                                                       323–358. URL: http://www.jstor.org/stable/43288511.
   is shown in Figure 5. It is expected that Naive performs rel-
                                                                                                                                                                                                                   [9] A. Belloni, D. Chen, V. Chernozhukov, C. Hansen,
   atively well since the training data includes many ads with
                                                                                                                                                                                                                       Sparse models and methods for optimal instruments
   numerous impressions. However, our proposed methods,
                                                                                                                                                                                                                       with an application to eminent domain, Econometrica
   IV-BS, IV-FS, and IV-SSFS, show significant improvement
                                                                                                                                                                                                                       80 (2012) 2369–2429.
   in relative AUC, particularly for ads with few previous im-
                                                                                                                                                                                                                  [10] A. Abadie, J. Gu, S. Shen, Instrumental variable es-
   pressions. The improvement of UBIPS over Naive, unlike
                                                                                                                                                                                                                       timation with first-stage heterogeneity, Journal of
   in the simulation experiment, is likely attributable to the
                                                                                                                                                                                                                       econometrics (2023) 105425–.
   confounder being associated with the variable observed in
                                                                                                                                                                                                                  [11] D. R. Thompson, K. Leyton-Brown, Revenue opti-
   the actual data.
                                                                                                                                                                                                                       mization in the generalized second-price auction, in:
      Improvement for ads with few impressions matches that
                                                                                                                                                                                                                       Proceedings of the fourteenth ACM conference on
   for ads with many, likely due to the infrequent inclusion
                                                                                                                                                                                                                       Electronic commerce, 2013, pp. 837–852.
   of rare ads in training data, causing popularity bias. No-
                                                                                                                                                                                                                  [12] Y. Sun, Y. Zhou, X. Deng,             Optimal reserve
   tably, the increasing improvement of IVs methods for the
                                                                                                                                                                                                                       prices in weighted gsp auctions,             Electronic
   0 − 20 quantile of previous impressions demonstrates their
                                                                                                                                                                                                                       Commerce Research and Applications 13 (2014)
   robustness in predicting rare ads.
                                                                                                                                                                                                                       178–187. URL: https://www.sciencedirect.com/
                                                                                                                                                                                                                       science/article/pii/S1567422314000106. doi:https:
     5. Conclusion                                                                                                                                                                                                     //doi.org/10.1016/j.elerap.2014.02.003 .
                                                                                                                                                                                                                  [13] M. Frolich, Nonparametric iv estimation of local av-
   This paper argues that bid amount is a valid instrumen-                                                                                                                                                             erage treatment effects with covariates, Journal of
   tal variable under the assumption of conditional indepen-                                                                                                                                                           econometrics 139 (2007) 35–75.
   dence, and tested its validity by applying it to predictive                                                                                                                                                    [14] S. Ferrari, F. Cribari-Neto, Beta regression for mod-
   CTR. Our experiment on a real dataset showed that explicitly                                                                                                                                                        elling rates and proportions, Journal of applied statis-
   accounting for heterogeneity in the strength of IVs allows                                                                                                                                                          tics 31 (2004) 799–815.
   for efficient and robust predictions. For greater extensi-                                                                                                                                                     [15] Y. Saito, S. Yaginuma, Y. Nishino, H. Sakata, K. Nakata,
   bility, incorporating complex interactions between IVs and                                                                                                                                                          Unbiased recommender learning from missing-not-
   other features with more developed approachs such asgraph                                                                                                                                                           at-random implicit feedback, in: Proceedings of the
   neural networks is recommended. Additionally, addressing                                                                                                                                                            13th International Conference on Web Search and
   other looping bias and validating prediction methods in                                                                                                                                                             Data Mining, WSDM ’20, Association for Comput-
   repeated auctions would be valuable.                                                                                                                                                                                ing Machinery, New York, NY, USA, 2020, p. 501–509.
                                                                                                                                                                                                                       URL: https://doi.org/10.1145/3336191.3371783. doi:10.
                                                                                                                                                                                                                       1145/3336191.3371783 .
     References
                               [1] V. Marotta, Y. Wu, K. Zhang, A. Acquisti, The welfare
                                   impact of targeted advertising technologies, Infor-
                                   mation Systems Research 33 (2022) 131–151. doi:10.
                                   1287/isre.2021.1024 .
                               [2] J. Chen, H. Dong, X. Wang, F. Feng, M. Wang, X. He,
                                   Bias and debias in recommender system: A survey and
                                   future directions, ACM Transactions on Information
                                   Systems 41 (2023) 1–39.
                               [3] P. Bühlmann, Invariance, causality and robustness,
                                   Statistical science 35 (2020) 404–426.
                               [4] Y. He, Z. Wang, P. Cui, H. Zou, Y. Zhang, Q. Cui,
                                   Y. Jiang, Causpref: Causal preference learning for
                                   out-of-distribution recommendation, in: Proceedings
                                   of the ACM Web Conference 2022, 2022, pp. 410–421.
                               [5] A. Feder, G. Horowitz, Y. Wald, R. Reichart, N. Rosen-
                                   feld, In the eye of the beholder: Robust prediction with