<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Perils of the MoB. The Challenges in the Use of Big Data in Causal Investigations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Roberto Leombruni</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sonia Della Monica</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Turin, Department of Economics and Statistics</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Month of birth (MoB) is often used as an instrumental variable (IV) in studying the causal patterns linking work careers and health. A recent literature, however, is questioning its validity: since parents may manipulate the timing of births there is a potential correlation e.g. with mothers' characteristics which may be also relevant for the outcome. In this paper we consider a related but different matter. The use of the MoB as an IV rests in its empirical association with several later outcomes, such as e.g. educational attainment and age at marriage. This implies that there may be multiple pathways going from the IV to the outcome, threatening the validity of a straightforward application of an IV estimator. We discuss the issue considering the relationship between work careers and health, exemplifying it with a Monte Carlo simulation and an application to Italian data.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Month of Birth</kwd>
        <kwd>Instrumental Variables</kwd>
        <kwd>Work and Health relations</kwd>
        <kwd>1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Besides its impact on businesses and the economy at large, the advent of Big Data is more and more
characterising also the production of official statistics and academic research. Already a decade ago,
in leading economic journals such as the American Economic Review and the Quarterly Journal of
Economics the share of articles based on traditional statistical surveys was fading out in favor of the
ones based on Big Data of administrative source [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. The high level of detail and the huge sample
size they provide, in particular, are revealing particularly fit for the implementation of causal
modelling, for instance in the case of regression discontinuity designs, where the identification
strategy rests on the availability of a mass of individual closely around, e.g, a legal age requirement
set by a policy. Another interesting setting is the case of instrumental variables estimators, which
require a first stage in which a potentially endogenous explicative is modelled as a function of other
characteristics exogenous to the model. Here the leverage provided by Big Data is twofold: along
the “wide” side, the availability of high-dimensional data allows the implementation of ML predictive
models in the first stage to more effectively control for bias, as in double or de-biased machine
learning [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. On the other side, instrumental variables are typically based on some “weird” and
unexpected causal connection between the endogenous variable and a characteristic apparently
unrelated with the phenomenon of interest. Here, the leverage of sample size and the accuracy of
information is essential to turn these correlations into strong instruments to identify the causal
relationship of interest.
      </p>
      <p>
        An interesting example of the latter case is the diffusion of causal research exploiting detailed
information on individuals’ month of birth (MoB). Since the seminal paper by Angrist and Krueger
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] on the returns to education, the exact moment of the year when individuals are born has more
and more been used as an instrumental variable in investigating also other causal patterns, thanks
2nd Workshop "New frontiers in Big Data and Artificial Intelligence" (BDAI 2025), May 29 -30, 2025, Aosta, Italy
∗ Corresponding author.
      </p>
      <p>roberto.leombruni@unito.it (R. Leombruni); sonia.dellamonica@unito.it (S. Della Monica)</p>
      <p>
        © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
to its empirical association with several later outcomes besides educational attainment, such as age
at marriage and maternity [
        <xref ref-type="bibr" rid="ref5 ref6 ref7">5, 6, 7</xref>
        ], and work-career transitions [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. A recent literature, however, is
raising questions about its use, arguing about the validity (exogeneity) of the instrument. As an
example, parents may actually manipulate the timing of birth for several reasons, creating a potential
correlation e.g. with individual characteristics of the mother or the socio-economic position of the
family which may in turn be relevant for the outcome of interest [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ].
      </p>
      <p>In this paper we focus on a related but different matter, connected with the statistical power
granted by large administrative datasets. The IV identification strategy, such as in Angrist and
Krueger’s paper, rests on the identification of some unexpected causal pathway going from the
instrument (season of birth) to an endogenous variable (education) to the outcome of interest
(wages). A potential paradox, here, is that the larger and more detailed are the data, the more
probable is that other causal pathways may be revealed too. In other words, the more an instrumental
variable is successfully associated with different characteristics, the more it is probable that there are
other pathways going from the selected instrument to the outcome of interest, threatening the
validity of IV also assuming no parenthood manipulation of the timing of birth. As an example, the
season of birth may have a relation with contemporary factors such as climate conditions or air
pollution, which may be important when health is a relevant dimension of the study. In this
contribution we discuss the issue sketching various causal settings using Directed Acyclic Graphs
(paragraph 2); illustrating the proper use of IV with a Monte Carlo simulation (paragraph 3);
discussing the relevance of the issue considering the relationship between work careers and health
in the case of the Italian context.</p>
    </sec>
    <sec id="sec-2">
      <title>2. A graphical representation of the causal pathways</title>
      <p>In Figure 1 we use two Directed Acyclic Graphs (DAGs) to represent Angrist and Krueger example
(left panel) and the issue of parenthood manipulation (right panel). The interest is on the returns to
education, where the identification issue is due to a confounding, unobserved factor (ability), which
exerts an influence on both education and wage. An OLS regression of wage on education will
overestimate the returns to education, since education – which is positively correlated with ability
– will tend to capture also the positive effect of ability on the wage. In this case, however, since the
MoB has no other direct or indirect connections with ability and wage, it can be used as an
instrumental variable, and the returns to education may be identified regressing wage on
E(education|MoB).</p>
      <sec id="sec-2-1">
        <title>Exogenous MoB</title>
      </sec>
      <sec id="sec-2-2">
        <title>Parenthood manipulation</title>
        <p>The issue with parenthood manipulation (right panel) is due to another confounding factor, the
family background, which exerts an impact on both the MoB and wage. This issue, however, does
not necessarily hamper the identification of the return to education. When family background is
observable, it is sufficient to add it as a control in both stages of the IV estimator. When unobservable,
one can exploit the fact that births’ manipulation is not perfect, comparing those born in December
of year t with those born in January of year t+1, i.e., individuals born in the same season but in
different age cohorts.</p>
        <p>In Figure 2 we represent the issue we are focusing on, i.e., the investigation of the relationship
between career features and health. The identification issue here is reverse causality: working
conditions entail many risk- and protective factors for health, and at the same time health exert an
influence on individuals’ work career. A classic example is the “healthy worker effect”, where
employed individuals are healthier than the general population – pointing to an apparent beneficial
impact of work on health – but this is due to a selection into employment, since healthier people are
more likely to enter and remain in the workforce.</p>
      </sec>
      <sec id="sec-2-3">
        <title>MoB and residual confounding</title>
      </sec>
      <sec id="sec-2-4">
        <title>MoB multiple pathways</title>
        <p>In the left panel we represent a situation in which a researcher is interested in the impact of a
workcareer feature (e.g. work exposure) on a health endpoint (e.g. stroke), while controlling for general
health at baseline (Health T0). The latter however is measured with error, and a proxy for general
health is used. This strategy is prone to the residual confounding produced by the impact of the
residual (unobserved) variability in health at baseline on both the work career and the health
outcome. Here, the use of an instrumental variable such as the MoB may serve as a way of blocking
the causal path from Health T0 to the work career and hence to identify the effect of interest
regressing the health endpoint on E(career feature | MoB).</p>
        <p>
          The right panel highlights the fact that the MoB has actually an influence also on the education
level, which in turn has a causal connection with general health at the baseline. Note that in this
DAG education level has no direct effect on the health endpoint, nor a direct effect on the career
feature. Regarding the first point, the literature is largely unanimous in the presence of a causal
connection between the level of education and general health. Here we simply do not assume a
further, direct effect of education on the specific endpoint of interest , once controlling for health at
the baseline. Regarding the second point, education has indeed many potential connections with
several career features, but not necessarily all career events are causally linked to the level of
education (see e.g. [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] on Job-hopping). What we are highlighting here is that, even in a highly
simplified situation in which there are no direct links between education and work-career, there are
two causal pathways going from MoB to the health endpoint, threatening the validity of regressing
the outcome on E(career feature | MoB). In other words, even though education is technically not a
confounder, its connection with the exogenous instrumental variable “restore” the residual
confounding due to the imperfect measure of general health.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. A Monte-Carlo illustration</title>
      <p>In this section we illustrate the impact of multiple pathways in the DAG on the application of an IV
strategy. We consider a data generating process closely mimicking the DAG in Figure 2, with the
following steps:
•
•
•</p>
      <p>
        Month of birth is sampled from a discrete U[
        <xref ref-type="bibr" rid="ref1 ref12">1, 12</xref>
        ]
Education is sampled from a discrete U[
        <xref ref-type="bibr" rid="ref8">8, 24</xref>
        ], with up to 2 years for those born later in the year
General health is sampled from a U(0, 1), with up to 0.16 for individuals with higher education
•
•
•
      </p>
      <p>The health proxy adds to general health a classical measurement error sampled from a U(-0.3,
0.3)
The career feature, interpreted as the age at retirement, is sampled from a discrete U[55, 68],
with up to 2 years for healthier individuals and up to 2 years for those born earlier in the year
The DGP of the health outcome is Y = 100 + 100*health_T0 – age_at_retirement + , with  ~
N(0, 20)
In Table 1 we present the results of three different models. The first three columns report the
MonteCarlo coefficient estimates using an OLS model with various specification, averaged over 1,000
iterations with a sample size of 10,000 simulated individuals. Column (a) reports the unbiased average
estimates obtained with full data, i.e. regressing the outcome on total work exposure and a correct
measure of baseline health. Using the health proxy (column b) it is apparent the residual confounding
due to the share of general health variability not captured by the proxy (-0.56 with respect to a true
value of -1). Adding education to the specification, as a further control for general health at baseline,
does not notably correct the bias (Column c).</p>
      <p>Using the MoB as an instrument for total work exposure in order to control for endogeneity does
not correct the bias, either; in our setting, it leads to an overestimation of the effect size (Column d).
Bias correction is achieved only adding education as an instrument in the first stage of IV estimation
(Column e).</p>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion</title>
      <p>OLS
(a)
In the last decade there has been a surge in academic research and policy interest on social
determinants of health and health inequalities, both internationally and in Italy. Due to the
bidirectional nexus between work- and health biographies, however, the correct assessment of causal
relationships is still an open question. IV estimators, one of the gold standards for causal inference,
are increasingly used also in the specialized literature on our matter, thanks also to the availability
of large datasets with very granular information on crucial individuals’ and work career’s
characteristics.</p>
      <p>In this contribution we considered an instrument – the month of birth – which due to its
connection with many life course decisions may end up in many causal pathway from the instrument
to the outcome. This situation, also in a very simplified setting where education does not exert any
direct influence on the explicative of interest, may lead to highly biased estimates when the IV
strategy is applied without properly controlling for education.</p>
      <p>
        The relevance of the matter depends on the actual correlation between the instrument and the
variables of interest. In the case of Italy, a link between MoB and various work careers features,
among which labour market entry and age at retirement, has been already documented in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. As
regards the connection between MoB and education, in Figure 3 we represent an original estimate
using administrative data of the Ministry of Welfare. The figure plots the partial correlation between
MoB and the average education level (average ISCED level, 4 categories), as estimated with a linear
regression model for individuals born in the years from 1940 to 1995 stratified by gender. Birth
cohoorts differences are controlled for with a set of dummies (reference category is 1940), so that the
plots represent the average effect of eleven MoBs on education level (reference category is January).
We find a statistically significant increase in educational attainment for all months up to july,
coherently with the seminal idea by A&amp;K, which is particularly strong for males, less strong and less
statistically significant for women.
      </p>
      <p>Females Males</p>
      <p>The third “ingredient” of the DAG of Figure 2, the measurement error on health at baseline, is
also relevant for Italian studies, since a thorough representation of individuals’ health is not present
in currently available socio-economic data on work- and health biograpies. As a consequence, to
properly investigate causal links between work exposures and health outcomes using MoB as an IV,
the level of education – even in an ideal situation in which it does not play any role for the work
dimension of interest – should be included in the first- and second-stage of an IV estimation. Its
exclusion could induce a confounding through the causal pathway from MoB to education to health.</p>
      <p>
        The general point regards the fact that the availability of high frequency and highly granular
socio-economic and spatial data allows the detection of several statistically significant correlations
between apparently unconnected phenomena. From the one side this is allowing a broader
application of causal modelling using an instrumental variable strategy, but exposes also to the risk
of multiple causal chains going from the instrument to the outcome. This is a point which has
recently been posed also by Mellon [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] in the case of climate conditions, who proposed a strategy
to test the sensitivity of the estimates to possible violations of the exclusion restrictions. In the case
here exemplified, we tested a direct solution of the issue using DAGs and a Monte-Carlo simulation
of it. The somewhat counter-intuitive result is that, in our simplified illustration, the level of
education – which is neither an explicative of the outcome, nor a mediator, nor a confounder – has
to be controlled for in the first stage of the estimator to achieve bias correction. As exemplified with
the DAG, this is necessary in order to block a secondary pathway going from the month of birth to
the health outcome, in line with Brito &amp; Pearl conditions for identification using DAGs [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
Declaration on Generative AI
      </p>
      <sec id="sec-4-1">
        <title>The authors have not employed any Generative AI tools.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Chetty</surname>
          </string-name>
          ,
          <article-title>Time trends in the use of administrative data for empirical research</article-title>
          .
          <source>34th Annual NBER Summer Institute</source>
          . Cambridge, Mass.,
          <source>July</source>
          <volume>9</volume>
          -
          <issue>27</issue>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Leombruni</surname>
          </string-name>
          ,
          <article-title>Interviewing administrative records. A conceptual map for the use of big data for economic research</article-title>
          .
          <source>Italian Journal of Applied Statistics</source>
          ,
          <volume>36</volume>
          (
          <issue>3</issue>
          ),
          <year>2024</year>
          ,
          <fpage>295</fpage>
          -
          <lpage>326</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>V.</given-names>
            <surname>Chernozhukov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chetverikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Demirer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Duflo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hansen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Newey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Robins</surname>
          </string-name>
          ,
          <article-title>Double/debiased machine learning for treatment and structural parameters</article-title>
          ,
          <source>The Econometrics Journal</source>
          , Volume
          <volume>21</volume>
          ,
          <string-name>
            <surname>Issue</surname>
            <given-names>1</given-names>
          </string-name>
          <source>, 1 February</source>
          <year>2018</year>
          ,
          <fpage>C1</fpage>
          -
          <lpage>C68</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Angrist</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Krueger</surname>
          </string-name>
          ,
          <source>Does Compulsory School Attendance Affect Schooling and Earnings? The Quarterly Journal of Economics</source>
          ,
          <volume>106</volume>
          (
          <issue>4</issue>
          ),
          <year>1991</year>
          ,
          <fpage>979</fpage>
          -
          <lpage>1014</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>V.</given-names>
            <surname>Skirbekk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. P.</given-names>
            <surname>Kohler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Prskawetz</surname>
          </string-name>
          ,
          <article-title>Birth month, school graduation, and the timing of births and marriages</article-title>
          .
          <source>Demography 41</source>
          ,
          <year>2004</year>
          ,
          <fpage>547</fpage>
          -
          <lpage>568</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Cavalli</surname>
          </string-name>
          ,
          <article-title>Does the Month of Birth Influence the Timing of Life Course Decisions? Evidence from a Natural Experiment in Italy</article-title>
          .
          <source>Open Journal of Social Sciences, 2</source>
          ,
          <year>2014</year>
          ,
          <fpage>101</fpage>
          -
          <lpage>118</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Kirdar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dayioğlu</surname>
          </string-name>
          , İ. Koç,
          <article-title>The effects of compulsory-schooling laws on teenage marriage and births in Turkey</article-title>
          .
          <source>Journal of Human Capital</source>
          ,
          <volume>12</volume>
          .4,
          <year>2018</year>
          ,
          <fpage>640</fpage>
          -
          <lpage>668</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Ardito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Leombruni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Blane</surname>
          </string-name>
          , and A.
          <string-name>
            <surname>d'Errico</surname>
          </string-name>
          , To Work or Not to Work?
          <article-title>The Effect of Higher Pension Age on Cardiovascular Health</article-title>
          .
          <source>Industrial Relations</source>
          ,
          <volume>59</volume>
          ,
          <year>2020</year>
          ,
          <fpage>399</fpage>
          -
          <lpage>434</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>K. S.</given-names>
            <surname>Buckles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Hungerman</surname>
          </string-name>
          , Season of Birth and Later Outcomes: Old Questions, New Answers.
          <source>The Review of Economics and Statistics</source>
          ,
          <volume>95</volume>
          (
          <issue>3</issue>
          ),
          <year>2013</year>
          ,
          <fpage>711</fpage>
          -
          <lpage>724</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Torun</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Tumen.</surname>
          </string-name>
          <article-title>The empirical content of season-of-birth effects: An investigation with Turkish data</article-title>
          .
          <source>Demographic Research</source>
          <volume>37</volume>
          ,
          <year>2017</year>
          ,
          <fpage>1825</fpage>
          -
          <lpage>60</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>E.</given-names>
            <surname>Fan</surname>
          </string-name>
          , J.-T. Liu, e
          <string-name>
            <given-names>Y.-C.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>Is the Quarter of Birth Endogenous? New Evidence from Taiwan, the US, and Indonesia</article-title>
          .
          <source>Oxford Bulletin of Economics and Statistics</source>
          ,
          <volume>79</volume>
          (
          <issue>6</issue>
          ),
          <year>2017</year>
          ,
          <fpage>1087</fpage>
          -
          <lpage>1124</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>K.</given-names>
            <surname>Steenackers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Guerry</surname>
          </string-name>
          ,
          <article-title>Determinants of job-hopping: an empirical study in Belgium</article-title>
          .
          <source>International Journal of Manpower</source>
          ,
          <volume>37</volume>
          (
          <issue>3</issue>
          ),
          <year>2016</year>
          ,
          <fpage>494</fpage>
          -
          <lpage>510</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Mellon</surname>
          </string-name>
          , Rain, rain, go away:
          <article-title>194 potential exclusion‐restriction violations for studies using weather as an instrumental variable</article-title>
          .
          <source>American Journal of Political Science</source>
          ,
          <year>2024</year>
          ,
          <fpage>1</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>C.</given-names>
            <surname>Brito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pearl</surname>
          </string-name>
          .
          <article-title>Generalized Instrumental Variables</article-title>
          . arXiv,
          <volume>12</volume>
          december
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>