<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>What's Mine is Yours, What's Yours is Mine</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Simplifying Significance Testing With Big Data</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karan Matnani Last Mile Tech</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amazon kmatnani@amazon.com</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valerie Liptak Last Mile Tech</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amazon liptav@amazon.com</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>George Forman Last Mile Tech</institution>
          ,
          <addr-line>Amazon</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>At Amazon Last Mile, we deliver over 3.5 Billion [1] packages every year, making us one of the largest delivery companies in the world. At this scale, even small changes can have a big business impact. In general, business impact is assessed using controlled experimentation. A standard approach to evaluating whether controlled experiments resulted in a significant change has been to use a t-test. However, despite our scale the law of large numbers fails to produce a normal distribution, and the t-test fails up to 99% of the time. In addition to exhibiting non-normal distributions our application has restrictions on the granularity of control and treatment group splits and also sufers from geospatial correlation which causes treatment efects to be applied across both control and treatment groups at finer granularities (e.g., delivery of packages to multiple homes by the same delivery agent in one stop; a building falling on diferent routes on two diferent days depending on other stops on that day). This introduces a tradeof between separability of efects through coarse granularity and detection of smaller treatment efects with fine granularity. In this paper we solve the t-test dilemma using a resampling test at scale, and further leverage this test to create a scalable, repeatable methodology for randomization split granularity choice under these constraints. We produce a sensitivity optimized randomization strategy using a data driven approach that has been applied successfully within multiple real experiments at Amazon Last Mile Tech and is generalizable to any experiment.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>At the scale of billions of packages delivered annually, small
changes can have a substantial efect on customer experience.
Principled experimentation drives business in the right direction.
At Amazon Last Mile, changes afecting customers are rolled out
based on the results of a controlled experiment. Examples of
experiments are: Changing the experience of the mobile application
used by delivery agents, updates to routing and navigation, or
using new algorithms to pick the places they drop of the package
to.</p>
      <p>To quantify the efect of an experiment, the subjects of the
experiment are split into Control (C) and Treatment (T) groups,
and the goal is to make these splits as fair as possible to have
unbiased, robust experimentation. The quality of the split is
measured on three factors. Bias: whether there is a diference in the
target variable between the groups before applying the treatment.
Power: the ability to detect small efects of the applied treatment.
Mixing: the amount of control instances that experience
treatment efects, and vice versa. Ideally, we want to maximize power
while minimizing bias and mixing.</p>
      <p>Making mistakes in experimentation is costly because of
rollbacks, developer and scientist time spent in deep dives, and the
lost opportunity cost. Raising the bar with resampling tests adds
value with informed decision making, and rework prevention.
With this context, we present the motivation, then the
experiment design, followed by the experiment results, and finally a
real world application with ideas for more applications. Our
contribution includes presenting a case for the permutation test, and
showing how it can be made scalable in a real-world scenario.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>MOTIVATION: T-TEST SIGNIFICANCE</title>
    </sec>
    <sec id="sec-3">
      <title>TESTS FAIL AT SCALE</title>
    </sec>
    <sec id="sec-4">
      <title>What makes a good Randomization</title>
    </sec>
    <sec id="sec-5">
      <title>Strategy</title>
      <p>In the design of this methodology, we decided to measure the
quality of the split using three types of metrics.</p>
      <p>
        (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) Unbiasedness: There will always be some diferences
between Control and Treatment, but we don’t want to
declare they are statistically significant unless they are from
the efect of the application of our treatment.
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) Power: the ability to detect small efects of the applied
treatment.
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) Mixing: the amount of spill of control into treatment and
vice versa.
      </p>
      <p>Ideally, there would be negligible pre-treatment diferences in the
distributions of a business metric like delivery time across control
and treatment groups. Minute lifts or drops post treatment would
be detectable. There would be no spilling of control into treatment
or vice versa.
2.2</p>
    </sec>
    <sec id="sec-6">
      <title>Why not use a t-test?</title>
      <p>The t-test is a parametric statistical test for determining whether
two samples were drawn from diferent underlying distributions
[4]. It is a standard default approach for most experimentation.
As a parametric test it has several underlying assumptions which
are violated frequently in practice.</p>
      <p>• Assumption of normal distribution The t-test uses
summary statistics of the two samples to fit a known
distribution to each sample. Then these two distributions are
compared for overlap to determine the probability that
the two samples could have been drawn from the same
underlying distribution. Standard t-test approaches fit to a
Gaussian (normal) distribution. However, the actual data
may not be normally distributed. For example, Figure 1
shows that the target variable in our experiment exhibits a
heavy-tailed distribution, and we observe that the
matching Gaussian does not imitate the true distribution well.
Many practitioners have asserted that due to the central
limit theorem the non-normal distribution will
approximate a Gaussian as the number of samples increases [7].
However, this is only true if you perform certain clever</p>
      <p>
        transformations of the data. Sampling more from a
heavytailed distribution will not produce data that is normally
distributed.
• Assumption of independence The t-test assumes
independence of data points. This assumption is frequently
violated in real world scenarios. In our experiment,
delivery time per package is not independent for mixed stops
because both control and treatment addresses may be
delivered to in the same stop so the attribution of treatment
will get mixed into both buckets, clearly violating the
independence assumption. Additionally, delivery time in prior
stops could afect the delivery time of subsequent stops.
• Assumption of specific knowledge: Which t-test will
you pick? If the end user is not a scientist, they would find
it dificult to pick the right type of t-test. Expecting users to
have specific knowledge reduces adoption, so this method
doesn’t scale. For example, in the Python Scikit-learn
implementation [5], the t-test requires making decisions on
whether tests are:
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) One-sample or two-sample.
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) One-sided or two-sided.
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) Paired or unpaired (for two-sample tests).
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) Homoscedastic (equal variance assumption) or
heteroscedastic (for two sample tests).
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        ) Fixed significance level (boolean-valued) or returning
p-values.
      </p>
      <p>To test how the t-test works with our Amazon delivery data
we ran an A/A experiment that tested 1,000 diferent potential
C/T splits to see if there was an a-priori significant diference
between the two groups (no treatment was applied so the two
groups should be the same). At the end of our simulation we
found that the t-test almost always declared significance when it
should not have. The column “% Significant Runs based on t-test
(p &lt; 0.01)" in Table 1 shows how often it declared significance
with p &lt; 0.01. If the test had worked, it would be around 1%, as
seen in Table 2. Note that this could actually mislead us to think
that there was a significant impact when there was no change.
This shows that a t-test is not reliable when its assumptions are
violated.
2.3.1 Permutation Test. Statistical significance tests are
intended to determine whether two sample datasets were drawn
from diferent underlying distributions. The t-test is a parametric
test that does this through making assumptions on the sample
distribution to figure out the probability that the two samples
were drawn from the same underlying distribution. In fact there
is another nonparametric approach we can take which calculates
these distributions directly from the data and estimates their
overlap directly. This is an established methodology known as a
permutation test, with various formulations including
bootstrapping and Monte Carlo tests [2].</p>
      <p>To understand how permutation tests work we first assume
that the null hypothesis is true [3]. In that case the C and T
assignment of a given measurement is interchangeable because the
treatment had no efect. Therefore we can directly calculate the
distribution of diferences between C and T distributions under
the null hypothesis by randomly sampling C and T groups from
the dataset and calculating the diference between the summary
statistics of that assignment. We can then compare the true C/T
diference to the diferences under the null hypothesis to
determine if the true C/T test was significantly diferent. Pseudocode
for the approach follows.</p>
      <p>Given Data  with original assignment  and  ,
summary statistic , p-value , and desired number of
resampled permutations 
for i in R do</p>
      <p>Select  and  from ;</p>
      <p>Add  ( ) −  ( ) to null distribution 
end
Calculate the rank  of  ( ) −  ( ) in  . If
 /100 &lt; /2 or  /100 &gt; /2 (for a two-sided test) then
the treatment had a significant efect.</p>
      <p>Algorithm 1: Algorithm for Permutation Test
2.3.2 Scalability. While this test is simple in theory,
historically it was not frequently used due to the expense of performing
the permutations. However due to advances in computational
capabilities we have implemented the permutation test in a
distributed library using Dask [6] and have successfully used it in
multi-GB datasets involving millions of measurements.</p>
      <p>One step described in the process of selecting a randomization
strategy is the choice of key to split on. We index the data by
granularity key to promote fast joins using Dask’s data frames. A
Dask DataFrame is a large parallel DataFrame composed of many
smaller Pandas DataFrames, split along the index. Batching the
data into multiple mini-batches, which is a factor of the
number of available processor cores, ensures complete utilization of
processing power, accelerating computation. Data that has been
processed, gets dropped leaving behind a summary generated
by a summarization function, like mean. Arbitrary
summarization functions are enabled through an abstract class. The code is
heavily parallelized but can run in a few minutes on a laptop on
a dataset with millions of rows.</p>
      <p>2.3.3 Implementation. We randomly split all data points into
C/T buckets allocating 50% to each bucket. We used each group’s
mean delivery time as our summary statistic. We permuted the
treatment assignments of our addresses 1,000 times. We assume
that with 1,000 permutations we can efectively approximate
the true distribution that we would obtain from testing every
possible permutation of the data (which would take a long time
since there are an order of tens of millions of events in our data).</p>
      <p>As we can see in Table 2 the permutation test performs as
expected (by definition), unlike the parametric t-test whose
assumptions are violated by our data.
3
3.1</p>
    </sec>
    <sec id="sec-7">
      <title>EXPERIMENT DESIGN</title>
    </sec>
    <sec id="sec-8">
      <title>Picking a randomization strategy</title>
      <p>Controlled experiments, where the experiment set is split
between a control and treatment set followed by comparing the
efect of treatment on the target variables, is the gold standard for
experimentation. A critical decision for randomization between
C and T is the choice of key that is used for splitting. In general,
the advice has been to make that key as granular as possible for
proper randomization (e.g., every delivery, every page load for
ads, every session for a customer visiting a website), keeping in
mind customer experience and system limitations. In the next
section, we describe a methodology for evaluating diferent keys
to provide a way to select the randomization key in order to
obtain a fair split.</p>
      <p>Previously, in order to find a random split, we would choose
a level of granularity that seemed intuitively correct, and then
test for pre-existing diferences. If at that point there was a
preexisting diference, then we would re-roll the dice and generate
a new split, and iterate on this process till a fair split was found.
So for example, if we are planning an experiment in week 27 we
would use week 26 data and repeatedly select a random C/T set
until one was found where there is no statistically significant
diference between the C and T. Imagine we iterated 6 times
before finding this split. If we used week 25 data, would we have
found that this split was fair? If we needed to choose 6 times
before finding a good split it is improbable that this splitting
methodology produces splits whose fairness quality is stable.
What does that mean for week 27, our experimental period? We
need a better methodology for choosing a granularity that is
likely to produce fair splits from the start.</p>
      <p>Using the permutation test methodology described above we
apply this to the diferent potential splitting keys available to us
for an experiment in changing the delivery locations vended for
package delivery. We judge the eficacy of the result based on
the criteria described in section 2.1.
3.2</p>
    </sec>
    <sec id="sec-9">
      <title>Choice of key</title>
      <p>We had a choice of randomizing by the following keys in
decreasing order of granularity:
• Building: The full normalized string address. eg: 425 106th
Ave NE, Anytown, State 12345. It is a grouping context,
typically referring to units in a building or complex.
Figure 2 shows delivery events colored by Building Id, with a
random C/T split. Every pixel represents a hypothetical
delivery event. When colored by control and treatment,
the figure shows how a split would look. Assume the red
group is control, and the blue group is treatment.
• Streets: The name of the street. eg: 106th Ave NE. Figure
3 shows delivery events colored by Street, with a random
C/T split. Every pixel represents a hypothetical delivery
event. When colored by control and treatment, the figure
shows how a split would look. Assume the red group is
control, and the blue group is treatment.
• Postal Codes: The zip code. eg: 98004. Imagine a city split
by zip codes with half of them in control and the other
half in treatment.
• Delivery Stations: The site where packages are received,
sorted and prepared for delivery. This is the origin for all
deliveries planned for a particular route taken for package
delivery. eg: DS-xxx.</p>
      <p>Our key metric for this experiment is Delivery Time. Delivery
time is defined as the time taken to deliver packages at a particular
stop/address. It includes time to park at the stop, find packages
in the vehicle, walking to and from the delivery point, handing
over the packages. Pointing them to the right location to drop
of the package is optimized by Delivery Point models on which
this strategy was first applied.</p>
      <p>When addresses are split between C/T for each of these keys
we need a metric to measure the desirability of this split. Among
these strategies, there are two general risks that make
experiments inefective.</p>
      <p>• Mixing In GeoSpatial experiments efects are frequently
geospatially correlated, so finer granularity results in a
larger portion of data with mixed control/treatment efects.
When the unit of randomization is fine grained, say at the
address level, then neighboring houses could be in
diferent splits, one in C, and the other in T. In this scenario,
when a delivery agent stops the van to drop of a package
for one of these houses, they also drop of a package at the
neighboring house in the same stop thus leaking the
benefit of the vehicle stop time for the treatment house, into
the control house and vice versa. Intuitively, this should
be a frequently occurring problem when the unit of
randomization is very fine grained. In Figures 2 and 3, you
can look at the overlap between the red and blue pixels
and notice that in the building based split, the overlap
is a lot more frequent, whereas for streets, the overlap
is mostly on street intersections, much fewer in number
than buildings.
• Biased Selection To avoid the mixing problem, we
consider a coarse granularity like postal codes. The only time
mixing would happen would be for the rare cases where a
stop is on the edge between multiple postal codes.
However, in selecting postal codes, we introduce biases in our
business metrics like Delivery Time. For example, in
Seattle the neighborhood of Queen Anne is more spread out
than the densely populated neighborhood of South Lake
Union, which means that on a metric like delivery time,
we would expect widely variant results even before we
apply our treatment, thus polluting the experiment.</p>
      <p>Firstly we want to choose a randomization strategy which
does not have a high probability of significant diferences in A/A
testing on historical data. In addition we want to choose a
randomization strategy that minimizes the impact of mixing on our
detection of efects that are significant to our application. So in
our case we want to choose a split that is able to detect important
changes in spite of the mixing and bias efects. To estimate this we
simulate this by applying the minimum important change to real
historical delivery data and then apply the permutation test to
see whether the change was detected as statistically significant.</p>
      <p>Data To make this concrete, we used actual delivery stops
across all of the US. Every row in this joined dataset corresponds
to a package delivery event to an address. We aggregated all
deliveries made to an address at the same entry time by the same
delivery agent, into one delivery time data point. Building ids
were used to determine house number and street name of the
addresses. We then picked one strategy at a time, and repeated
the process for each of them, producing cumulative metrics for
every day of data.</p>
      <p>Control Treatment Split Streaming this data, we randomly
allocated every address into the control or treatment group with
a 50/50 split based on the strategy being evaluated. So if we were
splitting by postal codes, we allocated, say, every address within
98101 into C and for 98108 into T. The allocation of an address
to a group varied per simulation. To account for the variance in
this bucketing method, we ran 1000 simulations of allocating an
address to a C/T group. Therefore, the same address may be in
the control group in one simulation and in treatment in another.
This is an integral part of using a permutation test, where a large
sample of possible permutations of allocation are considered.</p>
      <p>Delivery Time Calculation To calculate delivery time for a
delivery event, we took the total vehicle stop time in seconds,
divided that by the number of packages delivered at that stop. If
a stop contained addresses that were mixed between control and
treatment, we made a note of it, and split the delivery time in
the appropriate ratio (if out of 5 packages 3 were for a control
address then we split the time as 60:40 between C/T).</p>
      <p>Distribution of target Variable We worked assumption free
about the distribution on the target random variable across both
splits.
4</p>
    </sec>
    <sec id="sec-10">
      <title>EXPERIMENT RESULTS</title>
      <p>After running 1000 simulations over tens of Millions of
shipments and calculating the Mean Delivery Time over each of the
permuted C/T splits, we analyzed results based on the criteria of
a good randomization strategy as enlisted in section 2.1.
4.1</p>
    </sec>
    <sec id="sec-11">
      <title>Power &amp; Mixing</title>
      <p>Walking through Table 3, the first column is the randomization
strategy in increasing order of granularity. As there are lot more
houses than postal codes, the number of distinct groups increases
(finer granularity) as we go downwards. Since there was no
applied diference between Control and Treatment, any diference
in their means is entirely due to chance as seen by the tiny
diference noted in the Diference (C-T) column . But there can be some
variation due to random assignments to diferent groups, and the
95th percent confidence interval in the next column shows how
wide this diefrence can be. Thus we can only declare statistically
significant diferences that are larger than this interval.
However, diferences smaller than this interval may be significant
to the business, but we may not be able to detect that they had
happened if we pick the wrong strategy. This is a distribution
resampling method [2], which does not make any assumptions
on the shape of the distributions.</p>
      <p>From the granularity of split, we note: because the last 2 rows
have many more distinct groups, they are sensitive enough to
detect smaller diferences, performing better on the sensitivity
metric. If we randomize by Street, we could detect an
improvement or deterioration of greater than 0.13 seconds.</p>
      <p>If there were too much mixing (e.g. 50% of stops), we wouldn’t
see any diference in the delivery times for C vs. T, limiting our
statistical power to detect a real diference. As expected, there
was no mixing at the Delivery Station level, while it increased
with increased granularity. The 7.03% mixing at Building id makes
sense because often delivery agents deliver to adjacent buildings
in one stop, and 7.03% of them seem to overlap between C/T in
our dataset.</p>
      <p>Notice that although the Building id has the finest granularity,
it does not yield the most sensitive confidence interval in Table 3.
This is because 7.03% of the stops experienced mixing of control
and treatment groups.
4.2</p>
    </sec>
    <sec id="sec-12">
      <title>Statistical Significance, and the efects of</title>
    </sec>
    <sec id="sec-13">
      <title>Mixing</title>
      <p>
        We wanted to confirm that we would be able to detect actual
diferences regardless of mixing. Let’s say the treatment reduces
delivery time, but because of mixing, that reduction also applies
to control addresses in the same stop. To test this pre-experiment,
we pretended that treatment resulted in the smallest important
improvement in delivery time. We did this by artificially reducing
the delivery time in the treatment set after the split. In order to
maximize the efect of mixing, we also extended that
improvement to any control address within the same stop ( if it was a
mixed C/T stop). Then we ran the permutation test as normal to
see if we could detect this improvement. We hoped to find that,
even under large amounts of simulated mixing:
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) We can still detect a sizable diference between the T &amp; C
averages.
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) We find a significant diference under random divisions
of T &amp; C (exceeds the 95% confidence interval).
      </p>
      <p>We found that we could detect this diference with Street and
Building ids, but couldn’t detect it with Postal Code and Delivery
Station splits. This ruled out the first two options, because even if
an experiment worked well, choosing these units wouldn’t give
us the confidence that it did work. Between Street and Building id,
we noted that Street had the highest sensitivity, least mixing and
could clearly detect a change with a small confidence interval.
As a result, we choose Street for this dataset from the US.
The first version of this analysis was done to validate if we can
come up with a good methodology that is statistically correct,
repeatable, and applicable on available data. At Last Mile,
having found a useful application with Delivery Time estimates,
this is now being applied to other business metrics that indicate
customer satisfaction.</p>
      <p>We have used this approach for making decisions about
experiments in the US and a number of other countries, rapidly
with the help of this tool to measure the efectiveness of those
experiments. The next sub-section highlights one example of the
application of this method to a recent successful experiment. We
can now expand this method to be applied to any experiment
and find the smallest important diference of significance, and
confidently apply it. This approach has been used in various
geospatial experiments at Amazon and is extensible to almost
any other metric and use case due to its nonparametric nature.
5.2</p>
    </sec>
    <sec id="sec-14">
      <title>Application to Production Experiments</title>
      <p>While A/A tests on historical data are helpful, the only way to
truly tell if a treatment helped the customer is to launch it and
do a controlled test to estimate the treatment efect.</p>
      <p>Table 5 shows the results of an experiment launch for a large
country. Control and Treatment have an even 50/50 split. The
ifrst column is the business metric we care about. Here, we only
mention the type to preserve business confidentiality. The next
column is the diference in Control and Treatment means for each
of these metrics. The rank is the rank of the diference in the
known range. The “Is Significant?" column shows whether or not
the efects observed are statistically significant. As we can see the
nonparametric permutation test has been successfully applied
to various metrics with diferent qualities including continuous
metrics, binary metrics, and rates/percentages–without changes
in formulation.
5.3</p>
    </sec>
    <sec id="sec-15">
      <title>Generalizability of the Idea</title>
      <p>The permutation test and granularity selection approaches we
propose here are general and readily expand to other domains
and data types. As a nonparametric test it is straightforward to
apply and simplifies the experimental pipeline so non-scientists
can do experiments easily and count on the robustness of the
statistical results.</p>
      <p>5.3.1 Outliers with Outsize Impact. In general, this method
can be used for distributions that are heavy-tailed or contain
significant outliers, or (especially) where independence
assumptions are violated frequently enough for t-tests to fail (which
can be determined by A/A tests on random splits prior to the
experiment). As an example, a common issue in retail is where a
popular item can disproportionately drive outcome metrics like
number of sales or clicks. The same concept applies to many
other applications. Because the permutation test will randomly
assign this popular item to C or T over multiple permutations
it will account for the fact that a large portion of the expected
diference between C and T is due to only one item, resulting in
a more accurate assessment of whether the treatment diference
is significant or simply the result of one popular item leading the
metrics astray.</p>
      <p>5.3.2 Non-Normal Distributions. Because the permutation
test is nonparametric we have no assumptions on the
underlying qualities of the distribution. While we have illustrated a
heavy-tailed distribution here we have also successfully applied
this test to other types of distributions, including binary/bimodal
and standard normal distributions. This means we do not have
to worry about characterizing our underlying distribution before
applying a significance testing methodology, something which
simplifies the experimental process considerably.</p>
    </sec>
    <sec id="sec-16">
      <title>6 CONCLUSION</title>
      <p>Careful and correct experimentation is key to making the right
decisions for systems and organizations. While our computational
tools have become more powerful our significance analysis has
generally stagnated with the t-test. As we have shown,
appropriately applying the t-test is a nontrivial problem (especially
if sophisticated scientific expertise is not available), and the
ttest can be shockingly wrong when its assumptions are violated.
Difference T-C</p>
      <p>Rank</p>
      <p>Is Significant?
0.023
(0.01032)
0.000005
2.31%
98.2
0.0
74.3
100</p>
      <p>YES
YES
NO
YES
Our results show that it is time for us to re-evaluate the use of
nonparametric methods like the permutation test that were
previously computationally intractable for most big data use cases.
This is particularly important as we work to simplify the
process of experimentation and open experimental tools to a larger
audience.</p>
      <p>In addition to showing how this nonparametric test can be
used on a variety of diferent metrics and use cases in a big
data setting, we also show how it can be used to inform other
experimental choices such as the choice of split granularity. We
provide a decision framework for these types of experimental
choices and explore the use of this framework in practice.</p>
    </sec>
    <sec id="sec-17">
      <title>ACKNOWLEDGEMENTS</title>
      <p>We would like to thank our managers Amber Roy Chowdhury
for doing a thorough review of our paper and suggesting edits,
Sanjay Kumar and Umar Farooq, for their support and guidance.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>[1] 2020 Don Davis | May</source>
          <volume>26</volume>
          , 2020 Georg Richter | May 20, 2020 Bloomberg News | Mar 30,
          <source>2020 Harry Drajpuch | Mar</source>
          <volume>12</volume>
          ,
          <article-title>and</article-title>
          2020 Bloomberg News | May 20.
          <year>2020</year>
          .
          <article-title>Amazon is the fourth-largest US delivery service and growing fast</article-title>
          . https://www.digitalcommerce360.com/
          <year>2020</year>
          /05/26/amazon
          <article-title>-is-the-fourth% E2%80%91largest-us-delivery-service-and-growing-fast/</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>BT</given-names>
            <surname>Efron and RJ Tibshirani</surname>
          </string-name>
          .
          <year>1994</year>
          .
          <article-title>An Introduction to the Bootstrap</article-title>
          . New York, NY: Chapman &amp;
          <article-title>HallHall</article-title>
          . CRC Monographs on Statistics &amp; Applied
          <string-name>
            <surname>Probability</surname>
          </string-name>
          (
          <year>1994</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Ronald</given-names>
            <surname>Aylmer Fisher</surname>
          </string-name>
          .
          <year>1936</year>
          .
          <article-title>Design of experiments</article-title>
          .
          <source>Br Med J</source>
          <volume>1</volume>
          ,
          <issue>3923</issue>
          (
          <year>1936</year>
          ),
          <fpage>554</fpage>
          -
          <lpage>554</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Guo</surname>
          </string-name>
          and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yuan</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>A comparative review of methods for comparing means using partially paired data</article-title>
          .
          <source>Statistical Methods in Medical Research</source>
          <volume>26</volume>
          (
          <year>2017</year>
          ),
          <fpage>1323</fpage>
          -
          <lpage>1340</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Fabian</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          , Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          , Ron Weiss,
          <string-name>
            <surname>Vincent Dubourg</surname>
          </string-name>
          , et al.
          <year>2011</year>
          .
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          .
          <source>Journal of machine learning research 12</source>
          ,
          <string-name>
            <surname>Oct</surname>
          </string-name>
          (
          <year>2011</year>
          ),
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Matthew</given-names>
            <surname>Rocklin</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Dask: Parallel Computation with Blocked algorithms and Task Scheduling</article-title>
          .
          <source>In Proceedings of the 14th Python in Science Conference, Kathryn Huf and James Bergstra (Eds.)</source>
          .
          <fpage>130</fpage>
          -
          <lpage>136</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Eugene</given-names>
            <surname>Seneta</surname>
          </string-name>
          et al.
          <year>2013</year>
          .
          <article-title>A tricentenary history of the law of large numbers</article-title>
          .
          <source>Bernoulli</source>
          <volume>19</volume>
          ,
          <issue>4</issue>
          (
          <year>2013</year>
          ),
          <fpage>1088</fpage>
          -
          <lpage>1121</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>