=Paper= {{Paper |id=Vol-2841/SIMPLIFY_2 |storemode=property |title=What’s Mine is Yours, What’s Yours is Mine: Simplifying Significance Testing With Big Data |pdfUrl=https://ceur-ws.org/Vol-2841/SIMPLIFY_2.pdf |volume=Vol-2841 |authors=Karan Matnani,Valerie Liptak,George Forman |dblpUrl=https://dblp.org/rec/conf/edbt/MatnaniLF21 }} ==What’s Mine is Yours, What’s Yours is Mine: Simplifying Significance Testing With Big Data== https://ceur-ws.org/Vol-2841/SIMPLIFY_2.pdf
                          What’s Mine is Yours, What’s Yours is Mine
                                              Simplifying Significance Testing With Big Data

                 Karan Matnani                                             Valerie Liptak                              George Forman
             Last Mile Tech, Amazon                                   Last Mile Tech, Amazon                        Last Mile Tech, Amazon
             kmatnani@amazon.com                                        liptav@amazon.com                           ghforman@amazon.com

ABSTRACT                                                                                  Making mistakes in experimentation is costly because of roll-
At Amazon Last Mile, we deliver over 3.5 Billion [1] packages ev-                      backs, developer and scientist time spent in deep dives, and the
ery year, making us one of the largest delivery companies in the                       lost opportunity cost. Raising the bar with resampling tests adds
world. At this scale, even small changes can have a big business                       value with informed decision making, and rework prevention.
impact. In general, business impact is assessed using controlled                       With this context, we present the motivation, then the experi-
experimentation. A standard approach to evaluating whether                             ment design, followed by the experiment results, and finally a
controlled experiments resulted in a significant change has been                       real world application with ideas for more applications. Our con-
to use a t-test. However, despite our scale the law of large num-                      tribution includes presenting a case for the permutation test, and
bers fails to produce a normal distribution, and the t-test fails up                   showing how it can be made scalable in a real-world scenario.
to 99% of the time. In addition to exhibiting non-normal distribu-
tions our application has restrictions on the granularity of control                   2  MOTIVATION: T-TEST SIGNIFICANCE
and treatment group splits and also suffers from geospatial cor-                          TESTS FAIL AT SCALE
relation which causes treatment effects to be applied across both
                                                                                       2.1 What makes a good Randomization
control and treatment groups at finer granularities (e.g., delivery
of packages to multiple homes by the same delivery agent in one                            Strategy
stop; a building falling on different routes on two different days                     In the design of this methodology, we decided to measure the
depending on other stops on that day). This introduces a tradeoff                      quality of the split using three types of metrics.
between separability of effects through coarse granularity and                             (1) Unbiasedness: There will always be some differences be-
detection of smaller treatment effects with fine granularity. In                               tween Control and Treatment, but we don’t want to de-
this paper we solve the t-test dilemma using a resampling test at                              clare they are statistically significant unless they are from
scale, and further leverage this test to create a scalable, repeatable                         the effect of the application of our treatment.
methodology for randomization split granularity choice under                               (2) Power: the ability to detect small effects of the applied
these constraints. We produce a sensitivity optimized randomiza-                               treatment.
tion strategy using a data driven approach that has been applied                           (3) Mixing: the amount of spill of control into treatment and
successfully within multiple real experiments at Amazon Last                                   vice versa.
Mile Tech and is generalizable to any experiment.
                                                                                       Ideally, there would be negligible pre-treatment differences in the
                                                                                       distributions of a business metric like delivery time across control
                                                                                       and treatment groups. Minute lifts or drops post treatment would
1    INTRODUCTION
                                                                                       be detectable. There would be no spilling of control into treatment
At the scale of billions of packages delivered annually, small                         or vice versa.
changes can have a substantial effect on customer experience.
Principled experimentation drives business in the right direction.
                                                                                       2.2     Why not use a t-test?
At Amazon Last Mile, changes affecting customers are rolled out
based on the results of a controlled experiment. Examples of ex-                       The t-test is a parametric statistical test for determining whether
periments are: Changing the experience of the mobile application                       two samples were drawn from different underlying distributions
used by delivery agents, updates to routing and navigation, or                         [4]. It is a standard default approach for most experimentation.
using new algorithms to pick the places they drop off the package                      As a parametric test it has several underlying assumptions which
to.                                                                                    are violated frequently in practice.
    To quantify the effect of an experiment, the subjects of the                             • Assumption of normal distribution The t-test uses
experiment are split into Control (C) and Treatment (T) groups,                                summary statistics of the two samples to fit a known dis-
and the goal is to make these splits as fair as possible to have                               tribution to each sample. Then these two distributions are
unbiased, robust experimentation. The quality of the split is mea-                             compared for overlap to determine the probability that
sured on three factors. Bias: whether there is a difference in the                             the two samples could have been drawn from the same
target variable between the groups before applying the treatment.                              underlying distribution. Standard t-test approaches fit to a
Power: the ability to detect small effects of the applied treatment.                           Gaussian (normal) distribution. However, the actual data
Mixing: the amount of control instances that experience treat-                                 may not be normally distributed. For example, Figure 1
ment effects, and vice versa. Ideally, we want to maximize power                               shows that the target variable in our experiment exhibits a
while minimizing bias and mixing.                                                              heavy-tailed distribution, and we observe that the match-
                                                                                               ing Gaussian does not imitate the true distribution well.
© 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed-           Many practitioners have asserted that due to the central
ings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia, Cyprus)               limit theorem the non-normal distribution will approxi-
on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0)                                                                      mate a Gaussian as the number of samples increases [7].
                                                                                               However, this is only true if you perform certain clever
                                                                           Table 1: t−test based significance on real delivery data. Ac-
                                                                           tual figures replaced with order of magnitude.


                                                                              Randomization        Order of Magnitude      % Significant
                                                                              Unit (C/T)           of # Distinct Groups    Runs (p<0.01)
                                                                              Delivery Station     Hundreds                99.90%
                                                                              Postal Code          Thousands               97.80%
                                                                              Street               Millions                53%
                                                                              Buildings            Tens of Millions        79.50%



Figure 1: Gaussian super-imposed on observed delivery                      2.3    A better method: Permutation Test (a
time distribution. Both control (blue) and treatment (red)                        Resampling Test)
overlap and have a heavy-tailed distribution. By compar-                       2.3.1 Permutation Test. Statistical significance tests are in-
ison the Gaussian (green), or normal distribution with                     tended to determine whether two sample datasets were drawn
the same mean and standard deviation is much different.                    from different underlying distributions. The t-test is a parametric
The standard t-test approximates our heavy-tailed distri-                  test that does this through making assumptions on the sample
bution with this Gaussian distribution.                                    distribution to figure out the probability that the two samples
                                                                           were drawn from the same underlying distribution. In fact there
                                                                           is another nonparametric approach we can take which calculates
                                                                           these distributions directly from the data and estimates their
       transformations of the data. Sampling more from a heavy-            overlap directly. This is an established methodology known as a
       tailed distribution will not produce data that is normally          permutation test, with various formulations including bootstrap-
       distributed.                                                        ping and Monte Carlo tests [2].
    • Assumption of independence The t-test assumes inde-                      To understand how permutation tests work we first assume
       pendence of data points. This assumption is frequently              that the null hypothesis is true [3]. In that case the C and T as-
       violated in real world scenarios. In our experiment, deliv-         signment of a given measurement is interchangeable because the
       ery time per package is not independent for mixed stops             treatment had no effect. Therefore we can directly calculate the
       because both control and treatment addresses may be de-             distribution of differences between C and T distributions under
       livered to in the same stop so the attribution of treatment         the null hypothesis by randomly sampling C and T groups from
      will get mixed into both buckets, clearly violating the inde-        the dataset and calculating the difference between the summary
       pendence assumption. Additionally, delivery time in prior           statistics of that assignment. We can then compare the true C/T
       stops could affect the delivery time of subsequent stops.           difference to the differences under the null hypothesis to deter-
    • Assumption of specific knowledge: Which t-test will                  mine if the true C/T test was significantly different. Pseudocode
       you pick? If the end user is not a scientist, they would find       for the approach follows.
       it difficult to pick the right type of t-test. Expecting users to
       have specific knowledge reduces adoption, so this method              Given Data 𝐷 with original assignment 𝐶𝑡𝑟𝑢𝑒 and 𝑇𝑡𝑟𝑢𝑒 ,
       doesn’t scale. For example, in the Python Scikit-learn im-             summary statistic 𝑆, p-value 𝑝, and desired number of
       plementation [5], the t-test requires making decisions on              resampled permutations 𝑅
      whether tests are:                                                     for i in R do
     (1) One-sample or two-sample.                                               Select 𝐶𝑖 and 𝑇𝑖 from 𝐷;
     (2) One-sided or two-sided.                                                 Add 𝑆 (𝐶𝑖 ) − 𝑆 (𝑇𝑖 ) to null distribution 𝑁
     (3) Paired or unpaired (for two-sample tests).                          end
     (4) Homoscedastic (equal variance assumption) or heteroscedas-          Calculate the rank 𝑟 of 𝑆 (𝐶𝑡𝑟𝑢𝑒 ) − 𝑆 (𝑇𝑡𝑟𝑢𝑒 ) in 𝑁 . If
          tic (for two sample tests).                                         𝑟 /100 < 𝑝/2 or 𝑟 /100 > 𝑝/2 (for a two-sided test) then
     (5) Fixed significance level (boolean-valued) or returning               the treatment had a significant effect.
          p-values.                                                             Algorithm 1: Algorithm for Permutation Test
To test how the t-test works with our Amazon delivery data
we ran an A/A experiment that tested 1,000 different potential
C/T splits to see if there was an a-priori significant difference             2.3.2 Scalability. While this test is simple in theory, histori-
between the two groups (no treatment was applied so the two                cally it was not frequently used due to the expense of performing
groups should be the same). At the end of our simulation we                the permutations. However due to advances in computational
found that the t-test almost always declared significance when it          capabilities we have implemented the permutation test in a dis-
should not have. The column “% Significant Runs based on t-test            tributed library using Dask [6] and have successfully used it in
(p < 0.01)" in Table 1 shows how often it declared significance            multi-GB datasets involving millions of measurements.
with p < 0.01. If the test had worked, it would be around 1%, as              One step described in the process of selecting a randomization
seen in Table 2. Note that this could actually mislead us to think         strategy is the choice of key to split on. We index the data by
that there was a significant impact when there was no change.              granularity key to promote fast joins using Dask’s data frames. A
This shows that a t-test is not reliable when its assumptions are          Dask DataFrame is a large parallel DataFrame composed of many
violated.                                                                  smaller Pandas DataFrames, split along the index. Batching the
Table 2: Permutation test based significance on real deliv-
ery data.


   Randomization        Order of Magnitude       % Significant
   Unit (C/T)           of # Distinct Groups     Runs (p<0.01)
   Delivery Station     Hundreds                 1.0%
   Postal Code          Thousands                1.0%
   Street               Millions                 1.0%
   Buildings            Tens of Millions         1.0%




data into multiple mini-batches, which is a factor of the num-
ber of available processor cores, ensures complete utilization of
processing power, accelerating computation. Data that has been
processed, gets dropped leaving behind a summary generated
by a summarization function, like mean. Arbitrary summariza-
tion functions are enabled through an abstract class. The code is       Figure 2: A random split of a part of a city visualized with
heavily parallelized but can run in a few minutes on a laptop on        hypothetical (not actual) delivery events colored by Build-
a dataset with millions of rows.                                        ing Ids, split C/T groups. Every colored pixel represents
                                                                        a delivery event. The red group is control and the blue is
   2.3.3 Implementation. We randomly split all data points into         treatment. As we can see it is possible that the same deliv-
C/T buckets allocating 50% to each bucket. We used each group’s         ery stop could contain buildings in both C and T (mixing).
mean delivery time as our summary statistic. We permuted the
treatment assignments of our addresses 1,000 times. We assume           need a better methodology for choosing a granularity that is
that with 1,000 permutations we can effectively approximate             likely to produce fair splits from the start.
the true distribution that we would obtain from testing every              Using the permutation test methodology described above we
possible permutation of the data (which would take a long time          apply this to the different potential splitting keys available to us
since there are an order of tens of millions of events in our data).    for an experiment in changing the delivery locations vended for
   As we can see in Table 2 the permutation test performs as            package delivery. We judge the efficacy of the result based on
expected (by definition), unlike the parametric t-test whose as-        the criteria described in section 2.1.
sumptions are violated by our data.
                                                                        3.2    Choice of key
3 EXPERIMENT DESIGN                                                     We had a choice of randomizing by the following keys in decreas-
3.1 Picking a randomization strategy                                    ing order of granularity:
Controlled experiments, where the experiment set is split be-                • Building: The full normalized string address. eg: 425 106th
tween a control and treatment set followed by comparing the                    Ave NE, Anytown, State 12345. It is a grouping context,
effect of treatment on the target variables, is the gold standard for           typically referring to units in a building or complex. Fig-
experimentation. A critical decision for randomization between                  ure 2 shows delivery events colored by Building Id, with a
C and T is the choice of key that is used for splitting. In general,            random C/T split. Every pixel represents a hypothetical
the advice has been to make that key as granular as possible for                delivery event. When colored by control and treatment,
proper randomization (e.g., every delivery, every page load for                 the figure shows how a split would look. Assume the red
ads, every session for a customer visiting a website), keeping in               group is control, and the blue group is treatment.
mind customer experience and system limitations. In the next                 • Streets: The name of the street. eg: 106th Ave NE. Figure
section, we describe a methodology for evaluating different keys                3 shows delivery events colored by Street, with a random
to provide a way to select the randomization key in order to                    C/T split. Every pixel represents a hypothetical delivery
obtain a fair split.                                                            event. When colored by control and treatment, the figure
   Previously, in order to find a random split, we would choose                 shows how a split would look. Assume the red group is
a level of granularity that seemed intuitively correct, and then                control, and the blue group is treatment.
test for pre-existing differences. If at that point there was a pre-         • Postal Codes: The zip code. eg: 98004. Imagine a city split
existing difference, then we would re-roll the dice and generate                by zip codes with half of them in control and the other
a new split, and iterate on this process till a fair split was found.           half in treatment.
So for example, if we are planning an experiment in week 27 we               • Delivery Stations: The site where packages are received,
would use week 26 data and repeatedly select a random C/T set                   sorted and prepared for delivery. This is the origin for all
until one was found where there is no statistically significant                 deliveries planned for a particular route taken for package
difference between the C and T. Imagine we iterated 6 times                     delivery. eg: DS-xxx.
before finding this split. If we used week 25 data, would we have          Our key metric for this experiment is Delivery Time. Delivery
found that this split was fair? If we needed to choose 6 times          time is defined as the time taken to deliver packages at a particular
before finding a good split it is improbable that this splitting        stop/address. It includes time to park at the stop, find packages
methodology produces splits whose fairness quality is stable.           in the vehicle, walking to and from the delivery point, handing
What does that mean for week 27, our experimental period? We            over the packages. Pointing them to the right location to drop
                                                                            business metrics like Delivery Time. For example, in Seat-
                                                                            tle the neighborhood of Queen Anne is more spread out
                                                                            than the densely populated neighborhood of South Lake
                                                                            Union, which means that on a metric like delivery time,
                                                                            we would expect widely variant results even before we
                                                                            apply our treatment, thus polluting the experiment.
                                                                        Firstly we want to choose a randomization strategy which
                                                                     does not have a high probability of significant differences in A/A
                                                                     testing on historical data. In addition we want to choose a ran-
                                                                     domization strategy that minimizes the impact of mixing on our
                                                                     detection of effects that are significant to our application. So in
                                                                     our case we want to choose a split that is able to detect important
                                                                     changes in spite of the mixing and bias effects. To estimate this we
                                                                     simulate this by applying the minimum important change to real
                                                                     historical delivery data and then apply the permutation test to
                                                                     see whether the change was detected as statistically significant.
                                                                        Data To make this concrete, we used actual delivery stops
                                                                     across all of the US. Every row in this joined dataset corresponds
                                                                     to a package delivery event to an address. We aggregated all
                                                                     deliveries made to an address at the same entry time by the same
                                                                     delivery agent, into one delivery time data point. Building ids
                                                                     were used to determine house number and street name of the
                                                                     addresses. We then picked one strategy at a time, and repeated
                                                                     the process for each of them, producing cumulative metrics for
Figure 3: A random split of a part of a city visualized with
                                                                     every day of data.
hypothetical (not actual) delivery events colored by Street
                                                                        Control Treatment Split Streaming this data, we randomly
Name, split into C/T groups. Every colored pixel repre-
                                                                     allocated every address into the control or treatment group with
sents a delivery event. The red group is control and the
                                                                     a 50/50 split based on the strategy being evaluated. So if we were
blue is treatment. We see it is less likely that a stop would
                                                                     splitting by postal codes, we allocated, say, every address within
contain both C and T deliveries (mixing).
                                                                     98101 into C and for 98108 into T. The allocation of an address
                                                                     to a group varied per simulation. To account for the variance in
                                                                     this bucketing method, we ran 1000 simulations of allocating an
off the package is optimized by Delivery Point models on which       address to a C/T group. Therefore, the same address may be in
this strategy was first applied.                                     the control group in one simulation and in treatment in another.
   When addresses are split between C/T for each of these keys       This is an integral part of using a permutation test, where a large
we need a metric to measure the desirability of this split. Among    sample of possible permutations of allocation are considered.
these strategies, there are two general risks that make experi-         Delivery Time Calculation To calculate delivery time for a
ments ineffective.                                                   delivery event, we took the total vehicle stop time in seconds,
                                                                     divided that by the number of packages delivered at that stop. If
    • Mixing In GeoSpatial experiments effects are frequently
                                                                     a stop contained addresses that were mixed between control and
      geospatially correlated, so finer granularity results in a
                                                                     treatment, we made a note of it, and split the delivery time in
      larger portion of data with mixed control/treatment effects.
                                                                     the appropriate ratio (if out of 5 packages 3 were for a control
      When the unit of randomization is fine grained, say at the
                                                                     address then we split the time as 60:40 between C/T).
      address level, then neighboring houses could be in differ-
                                                                        Distribution of target Variable We worked assumption free
      ent splits, one in C, and the other in T. In this scenario,
                                                                     about the distribution on the target random variable across both
      when a delivery agent stops the van to drop off a package
                                                                     splits.
      for one of these houses, they also drop off a package at the
      neighboring house in the same stop thus leaking the ben-
      efit of the vehicle stop time for the treatment house, into
                                                                     4     EXPERIMENT RESULTS
      the control house and vice versa. Intuitively, this should     After running 1000 simulations over tens of Millions of ship-
      be a frequently occurring problem when the unit of ran-        ments and calculating the Mean Delivery Time over each of the
      domization is very fine grained. In Figures 2 and 3, you       permuted C/T splits, we analyzed results based on the criteria of
      can look at the overlap between the red and blue pixels        a good randomization strategy as enlisted in section 2.1.
      and notice that in the building based split, the overlap
      is a lot more frequent, whereas for streets, the overlap       4.1    Power & Mixing
      is mostly on street intersections, much fewer in number        Walking through Table 3, the first column is the randomization
      than buildings.                                                strategy in increasing order of granularity. As there are lot more
    • Biased Selection To avoid the mixing problem, we con-          houses than postal codes, the number of distinct groups increases
      sider a coarse granularity like postal codes. The only time    (finer granularity) as we go downwards. Since there was no ap-
      mixing would happen would be for the rare cases where a        plied difference between Control and Treatment, any difference
      stop is on the edge between multiple postal codes. How-        in their means is entirely due to chance as seen by the tiny differ-
      ever, in selecting postal codes, we introduce biases in our    ence noted in the Difference (C-T) column . But there can be some
Table 3: Experiment Results Table. Using the permutation test, none of the candidate randomization units have a high
probability of a significant difference in A/A testing on historical data. Note that in the finest granularity we can have a
great deal of mixed C/T effects. However, in the two finer granularities, we can also detect smaller differences between C
and T.


  Randomization Unit                                    Mean difference (C - T)   95% Confidence Interval      % Stops with     Avg. Mixing
  C/T                                                                                                           any mixing         per stop
  Delivery Station.                                               0.002                   [-1.93, 1.94]             0.00%              0.00%
  Postal Code                                                     0.006                   [-1.01, 1.03]             0.02%              0.00%
  Street                                                          0.000                   [-0.13, 0.13]             0.78%              0.30%
  Building                                                        0.002                   [-0.15, 0.15]             7.03%              3.10%



variation due to random assignments to different groups, and the          an experiment worked well, choosing these units wouldn’t give
95th percent confidence interval in the next column shows how             us the confidence that it did work. Between Street and Building id,
wide this difference can be. Thus we can only declare statistically       we noted that Street had the highest sensitivity, least mixing and
significant differences that are larger than this interval. How-          could clearly detect a change with a small confidence interval.
ever, differences smaller than this interval may be significant           As a result, we choose Street for this dataset from the US.
to the business, but we may not be able to detect that they had
happened if we pick the wrong strategy. This is a distribution            5 APPLICATION
resampling method [2], which does not make any assumptions
on the shape of the distributions.
                                                                          5.1 Last Mile at Amazon
    From the granularity of split, we note: because the last 2 rows       The first version of this analysis was done to validate if we can
have many more distinct groups, they are sensitive enough to              come up with a good methodology that is statistically correct,
detect smaller differences, performing better on the sensitivity          repeatable, and applicable on available data. At Last Mile, hav-
metric. If we randomize by Street, we could detect an improve-            ing found a useful application with Delivery Time estimates,
ment or deterioration of greater than 0.13 seconds.                       this is now being applied to other business metrics that indicate
    If there were too much mixing (e.g. 50% of stops), we wouldn’t        customer satisfaction.
see any difference in the delivery times for C vs. T, limiting our           We have used this approach for making decisions about ex-
statistical power to detect a real difference. As expected, there         periments in the US and a number of other countries, rapidly
was no mixing at the Delivery Station level, while it increased           with the help of this tool to measure the effectiveness of those
with increased granularity. The 7.03% mixing at Building id makes         experiments. The next sub-section highlights one example of the
sense because often delivery agents deliver to adjacent buildings         application of this method to a recent successful experiment. We
in one stop, and 7.03% of them seem to overlap between C/T in             can now expand this method to be applied to any experiment
our dataset.                                                              and find the smallest important difference of significance, and
    Notice that although the Building id has the finest granularity,      confidently apply it. This approach has been used in various
it does not yield the most sensitive confidence interval in Table 3.      geospatial experiments at Amazon and is extensible to almost
This is because 7.03% of the stops experienced mixing of control          any other metric and use case due to its nonparametric nature.
and treatment groups.
                                                                          5.2     Application to Production Experiments
4.2    Statistical Significance, and the effects of                       While A/A tests on historical data are helpful, the only way to
       Mixing                                                             truly tell if a treatment helped the customer is to launch it and
We wanted to confirm that we would be able to detect actual               do a controlled test to estimate the treatment effect.
differences regardless of mixing. Let’s say the treatment reduces            Table 5 shows the results of an experiment launch for a large
delivery time, but because of mixing, that reduction also applies         country. Control and Treatment have an even 50/50 split. The
to control addresses in the same stop. To test this pre-experiment,       first column is the business metric we care about. Here, we only
we pretended that treatment resulted in the smallest important            mention the type to preserve business confidentiality. The next
improvement in delivery time. We did this by artificially reducing        column is the difference in Control and Treatment means for each
the delivery time in the treatment set after the split. In order to       of these metrics. The rank is the rank of the difference in the
maximize the effect of mixing, we also extended that improve-             known range. The “Is Significant?" column shows whether or not
ment to any control address within the same stop ( if it was a            the effects observed are statistically significant. As we can see the
mixed C/T stop). Then we ran the permutation test as normal to            nonparametric permutation test has been successfully applied
see if we could detect this improvement. We hoped to find that,           to various metrics with different qualities including continuous
even under large amounts of simulated mixing:                             metrics, binary metrics, and rates/percentages–without changes
    (1) We can still detect a sizable difference between the T & C        in formulation.
        averages.
    (2) We find a significant difference under random divisions           5.3     Generalizability of the Idea
        of T & C (exceeds the 95% confidence interval).                   The permutation test and granularity selection approaches we
   We found that we could detect this difference with Street and          propose here are general and readily expand to other domains
Building ids, but couldn’t detect it with Postal Code and Delivery        and data types. As a nonparametric test it is straightforward to
Station splits. This ruled out the first two options, because even if     apply and simplifies the experimental pipeline so non-scientists
Table 4: Effects of Mixing. Given a simulated important change, the change was detectable in two out of four options.
While the change was detected as significant, we can see that the greater level of mixing in the Building group impairs
our ability to detect fine changes as compared to using the street level of granularity.


 Randomization                                                 % Stops with         95% Confidence                   Real C – T       Improvement
 Unit C/T                                                        any mixing   Interval of C - T difference           difference        Detected?
                                                                                      in seconds                     in seconds
 Delivery Station                                                     0.00%              [-1.93, 2.47]                       0.883          No
 Postal Code                                                          0.02%              [-1.03, 1.06]                       0.937          No
 Street                                                               0.78%              [-0.13, 0.13]                       1.204          Yes
 Building                                                             7.03%              [-0.15, 0.15]                       0.957          Yes


Table 5: Sample 50% dial up results: Shows statistically significant change in metric #1, #2 and #4. Also shows metric #3
was not improved significantly. Note that even though some of these changes seem small they were being detected as
significant by a t-test. Though all of these metrics have different types and some are rare events we are able to run the
nonparametric permutation test on all of them to correctly detect significance.


 Business Metric                                                        Difference T-C        Rank       Is Significant?      Normal Range (95%)
 Metric 1: a continuous number                                                      0.023      98.2                  YES             [-0.0212:0.0223]
 Metric 2: a binary metric                                                      (0.01032)       0.0                  YES             [-0.0057:0.0058]
 Metric 3: a percentage metric                                                  0.000005       74.3                  NO              [-0.0012:0.0013]
 Metric 4: a percentage metric                                                      2.31%      100                   YES             [-0.0035:0.0038]



can do experiments easily and count on the robustness of the           Our results show that it is time for us to re-evaluate the use of
statistical results.                                                   nonparametric methods like the permutation test that were pre-
                                                                       viously computationally intractable for most big data use cases.
    5.3.1 Outliers with Outsize Impact. In general, this method        This is particularly important as we work to simplify the pro-
can be used for distributions that are heavy-tailed or contain         cess of experimentation and open experimental tools to a larger
significant outliers, or (especially) where independence assump-       audience.
tions are violated frequently enough for t-tests to fail (which           In addition to showing how this nonparametric test can be
can be determined by A/A tests on random splits prior to the           used on a variety of different metrics and use cases in a big
experiment). As an example, a common issue in retail is where a        data setting, we also show how it can be used to inform other
popular item can disproportionately drive outcome metrics like         experimental choices such as the choice of split granularity. We
number of sales or clicks. The same concept applies to many            provide a decision framework for these types of experimental
other applications. Because the permutation test will randomly         choices and explore the use of this framework in practice.
assign this popular item to C or T over multiple permutations
it will account for the fact that a large portion of the expected      ACKNOWLEDGEMENTS
difference between C and T is due to only one item, resulting in
                                                                       We would like to thank our managers Amber Roy Chowdhury
a more accurate assessment of whether the treatment difference
                                                                       for doing a thorough review of our paper and suggesting edits,
is significant or simply the result of one popular item leading the
                                                                       Sanjay Kumar and Umar Farooq, for their support and guidance.
metrics astray.

   5.3.2 Non-Normal Distributions. Because the permutation             REFERENCES
test is nonparametric we have no assumptions on the under-             [1] 2020 Don Davis | May 26, 2020 Georg Richter | May 20, 2020 Bloomberg News |
                                                                           Mar 30, 2020 Harry Drajpuch | Mar 12, and 2020 Bloomberg News | May 20.
lying qualities of the distribution. While we have illustrated a           2020. Amazon is the fourth-largest US delivery service and growing fast.
heavy-tailed distribution here we have also successfully applied           https://www.digitalcommerce360.com/2020/05/26/amazon-is-the-fourth%
                                                                           E2%80%91largest-us-delivery-service-and-growing-fast/
this test to other types of distributions, including binary/bimodal    [2] BT Efron and RJ Tibshirani. 1994. An Introduction to the Bootstrap. New York,
and standard normal distributions. This means we do not have               NY: Chapman & HallHall. CRC Monographs on Statistics & Applied Probability
to worry about characterizing our underlying distribution before           (1994).
                                                                       [3] Ronald Aylmer Fisher. 1936. Design of experiments. Br Med J 1, 3923 (1936),
applying a significance testing methodology, something which               554–554.
simplifies the experimental process considerably.                      [4] B. Guo and Y. Yuan. 2017. A comparative review of methods for comparing
                                                                           means using partially paired data. Statistical Methods in Medical Research 26
                                                                           (2017), 1323 – 1340.
6   CONCLUSION                                                         [5] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel,
                                                                           Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron
Careful and correct experimentation is key to making the right de-         Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python.
cisions for systems and organizations. While our computational             Journal of machine learning research 12, Oct (2011), 2825–2830.
tools have become more powerful our significance analysis has          [6] Matthew Rocklin. 2015. Dask: Parallel Computation with Blocked algorithms
                                                                           and Task Scheduling. In Proceedings of the 14th Python in Science Conference,
generally stagnated with the t-test. As we have shown, appro-              Kathryn Huff and James Bergstra (Eds.). 130 – 136.
priately applying the t-test is a nontrivial problem (especially       [7] Eugene Seneta et al. 2013. A tricentenary history of the law of large numbers.
if sophisticated scientific expertise is not available), and the t-        Bernoulli 19, 4 (2013), 1088–1121.

test can be shockingly wrong when its assumptions are violated.