=Paper= {{Paper |id=Vol-2841/SIMPLIFY_2 |storemode=property |title=What’s Mine is Yours, What’s Yours is Mine: Simplifying Significance Testing With Big Data |pdfUrl=https://ceur-ws.org/Vol-2841/SIMPLIFY_2.pdf |volume=Vol-2841 |authors=Karan Matnani,Valerie Liptak,George Forman |dblpUrl=https://dblp.org/rec/conf/edbt/MatnaniLF21 }} ==What’s Mine is Yours, What’s Yours is Mine: Simplifying Significance Testing With Big Data== https://ceur-ws.org/Vol-2841/SIMPLIFY_2.pdf

What’s Mine is Yours, What’s Yours is Mine
Simplifying Significance Testing With Big Data

Karan Matnani Valerie Liptak George Forman
Last Mile Tech, Amazon Last Mile Tech, Amazon Last Mile Tech, Amazon
kmatnani@amazon.com liptav@amazon.com ghforman@amazon.com

ABSTRACT Making mistakes in experimentation is costly because of roll-
At Amazon Last Mile, we deliver over 3.5 Billion [1] packages ev- backs, developer and scientist time spent in deep dives, and the
ery year, making us one of the largest delivery companies in the lost opportunity cost. Raising the bar with resampling tests adds
world. At this scale, even small changes can have a big business value with informed decision making, and rework prevention.
impact. In general, business impact is assessed using controlled With this context, we present the motivation, then the experi-
experimentation. A standard approach to evaluating whether ment design, followed by the experiment results, and finally a
controlled experiments resulted in a significant change has been real world application with ideas for more applications. Our con-
to use a t-test. However, despite our scale the law of large num- tribution includes presenting a case for the permutation test, and
bers fails to produce a normal distribution, and the t-test fails up showing how it can be made scalable in a real-world scenario.
to 99% of the time. In addition to exhibiting non-normal distribu-
tions our application has restrictions on the granularity of control 2 MOTIVATION: T-TEST SIGNIFICANCE
and treatment group splits and also suffers from geospatial cor- TESTS FAIL AT SCALE
relation which causes treatment effects to be applied across both
2.1 What makes a good Randomization
control and treatment groups at finer granularities (e.g., delivery
of packages to multiple homes by the same delivery agent in one Strategy
stop; a building falling on different routes on two different days In the design of this methodology, we decided to measure the
depending on other stops on that day). This introduces a tradeoff quality of the split using three types of metrics.
between separability of effects through coarse granularity and (1) Unbiasedness: There will always be some differences be-
detection of smaller treatment effects with fine granularity. In tween Control and Treatment, but we don’t want to de-
this paper we solve the t-test dilemma using a resampling test at clare they are statistically significant unless they are from
scale, and further leverage this test to create a scalable, repeatable the effect of the application of our treatment.
methodology for randomization split granularity choice under (2) Power: the ability to detect small effects of the applied
these constraints. We produce a sensitivity optimized randomiza- treatment.
tion strategy using a data driven approach that has been applied (3) Mixing: the amount of spill of control into treatment and
successfully within multiple real experiments at Amazon Last vice versa.
Mile Tech and is generalizable to any experiment.
Ideally, there would be negligible pre-treatment differences in the
distributions of a business metric like delivery time across control
and treatment groups. Minute lifts or drops post treatment would
1 INTRODUCTION
be detectable. There would be no spilling of control into treatment
At the scale of billions of packages delivered annually, small or vice versa.
changes can have a substantial effect on customer experience.
Principled experimentation drives business in the right direction.
2.2 Why not use a t-test?
At Amazon Last Mile, changes affecting customers are rolled out
based on the results of a controlled experiment. Examples of ex- The t-test is a parametric statistical test for determining whether
periments are: Changing the experience of the mobile application two samples were drawn from different underlying distributions
used by delivery agents, updates to routing and navigation, or [4]. It is a standard default approach for most experimentation.
using new algorithms to pick the places they drop off the package As a parametric test it has several underlying assumptions which
to. are violated frequently in practice.
To quantify the effect of an experiment, the subjects of the • Assumption of normal distribution The t-test uses
experiment are split into Control (C) and Treatment (T) groups, summary statistics of the two samples to fit a known dis-
and the goal is to make these splits as fair as possible to have tribution to each sample. Then these two distributions are
unbiased, robust experimentation. The quality of the split is mea- compared for overlap to determine the probability that
sured on three factors. Bias: whether there is a difference in the the two samples could have been drawn from the same
target variable between the groups before applying the treatment. underlying distribution. Standard t-test approaches fit to a
Power: the ability to detect small effects of the applied treatment. Gaussian (normal) distribution. However, the actual data
Mixing: the amount of control instances that experience treat- may not be normally distributed. For example, Figure 1
ment effects, and vice versa. Ideally, we want to maximize power shows that the target variable in our experiment exhibits a
while minimizing bias and mixing. heavy-tailed distribution, and we observe that the match-
ing Gaussian does not imitate the true distribution well.
© 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed- Many practitioners have asserted that due to the central
ings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia, Cyprus) limit theorem the non-normal distribution will approxi-
on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0) mate a Gaussian as the number of samples increases [7].
However, this is only true if you perform certain clever
Table 1: t−test based significance on real delivery data. Ac-
tual figures replaced with order of magnitude.

Randomization Order of Magnitude % Significant
Unit (C/T) of # Distinct Groups Runs (p<0.01)
Delivery Station Hundreds 99.90%
Postal Code Thousands 97.80%
Street Millions 53%
Buildings Tens of Millions 79.50%

Figure 1: Gaussian super-imposed on observed delivery 2.3 A better method: Permutation Test (a
time distribution. Both control (blue) and treatment (red) Resampling Test)
overlap and have a heavy-tailed distribution. By compar- 2.3.1 Permutation Test. Statistical significance tests are in-
ison the Gaussian (green), or normal distribution with tended to determine whether two sample datasets were drawn
the same mean and standard deviation is much different. from different underlying distributions. The t-test is a parametric
The standard t-test approximates our heavy-tailed distri- test that does this through making assumptions on the sample
bution with this Gaussian distribution. distribution to figure out the probability that the two samples
were drawn from the same underlying distribution. In fact there
is another nonparametric approach we can take which calculates
these distributions directly from the data and estimates their
transformations of the data. Sampling more from a heavy- overlap directly. This is an established methodology known as a
tailed distribution will not produce data that is normally permutation test, with various formulations including bootstrap-
distributed. ping and Monte Carlo tests [2].
• Assumption of independence The t-test assumes inde- To understand how permutation tests work we first assume
pendence of data points. This assumption is frequently that the null hypothesis is true [3]. In that case the C and T as-
violated in real world scenarios. In our experiment, deliv- signment of a given measurement is interchangeable because the
ery time per package is not independent for mixed stops treatment had no effect. Therefore we can directly calculate the
because both control and treatment addresses may be de- distribution of differences between C and T distributions under
livered to in the same stop so the attribution of treatment the null hypothesis by randomly sampling C and T groups from
will get mixed into both buckets, clearly violating the inde- the dataset and calculating the difference between the summary
pendence assumption. Additionally, delivery time in prior statistics of that assignment. We can then compare the true C/T
stops could affect the delivery time of subsequent stops. difference to the differences under the null hypothesis to deter-
• Assumption of specific knowledge: Which t-test will mine if the true C/T test was significantly different. Pseudocode
you pick? If the end user is not a scientist, they would find for the approach follows.
it difficult to pick the right type of t-test. Expecting users to
have specific knowledge reduces adoption, so this method Given Data 𝐷 with original assignment 𝐶𝑡𝑟𝑢𝑒 and 𝑇𝑡𝑟𝑢𝑒 ,
doesn’t scale. For example, in the Python Scikit-learn im- summary statistic 𝑆, p-value 𝑝, and desired number of
plementation [5], the t-test requires making decisions on resampled permutations 𝑅
whether tests are: for i in R do
(1) One-sample or two-sample. Select 𝐶𝑖 and 𝑇𝑖 from 𝐷;
(2) One-sided or two-sided. Add 𝑆 (𝐶𝑖 ) − 𝑆 (𝑇𝑖 ) to null distribution 𝑁
(3) Paired or unpaired (for two-sample tests). end
(4) Homoscedastic (equal variance assumption) or heteroscedas- Calculate the rank 𝑟 of 𝑆 (𝐶𝑡𝑟𝑢𝑒 ) − 𝑆 (𝑇𝑡𝑟𝑢𝑒 ) in 𝑁 . If
tic (for two sample tests). 𝑟 /100 < 𝑝/2 or 𝑟 /100 > 𝑝/2 (for a two-sided test) then
(5) Fixed significance level (boolean-valued) or returning the treatment had a significant effect.
p-values. Algorithm 1: Algorithm for Permutation Test
To test how the t-test works with our Amazon delivery data
we ran an A/A experiment that tested 1,000 different potential
C/T splits to see if there was an a-priori significant difference 2.3.2 Scalability. While this test is simple in theory, histori-
between the two groups (no treatment was applied so the two cally it was not frequently used due to the expense of performing
groups should be the same). At the end of our simulation we the permutations. However due to advances in computational
found that the t-test almost always declared significance when it capabilities we have implemented the permutation test in a dis-
should not have. The column “% Significant Runs based on t-test tributed library using Dask [6] and have successfully used it in
(p < 0.01)" in Table 1 shows how often it declared significance multi-GB datasets involving millions of measurements.
with p < 0.01. If the test had worked, it would be around 1%, as One step described in the process of selecting a randomization
seen in Table 2. Note that this could actually mislead us to think strategy is the choice of key to split on. We index the data by
that there was a significant impact when there was no change. granularity key to promote fast joins using Dask’s data frames. A
This shows that a t-test is not reliable when its assumptions are Dask DataFrame is a large parallel DataFrame composed of many
violated. smaller Pandas DataFrames, split along the index. Batching the
Table 2: Permutation test based significance on real deliv-
ery data.

Randomization Order of Magnitude % Significant
Unit (C/T) of # Distinct Groups Runs (p<0.01)
Delivery Station Hundreds 1.0%
Postal Code Thousands 1.0%
Street Millions 1.0%
Buildings Tens of Millions 1.0%

data into multiple mini-batches, which is a factor of the num-
ber of available processor cores, ensures complete utilization of
processing power, accelerating computation. Data that has been
processed, gets dropped leaving behind a summary generated
by a summarization function, like mean. Arbitrary summariza-
tion functions are enabled through an abstract class. The code is Figure 2: A random split of a part of a city visualized with
heavily parallelized but can run in a few minutes on a laptop on hypothetical (not actual) delivery events colored by Build-
a dataset with millions of rows. ing Ids, split C/T groups. Every colored pixel represents
a delivery event. The red group is control and the blue is
2.3.3 Implementation. We randomly split all data points into treatment. As we can see it is possible that the same deliv-
C/T buckets allocating 50% to each bucket. We used each group’s ery stop could contain buildings in both C and T (mixing).
mean delivery time as our summary statistic. We permuted the
treatment assignments of our addresses 1,000 times. We assume need a better methodology for choosing a granularity that is
that with 1,000 permutations we can effectively approximate likely to produce fair splits from the start.
the true distribution that we would obtain from testing every Using the permutation test methodology described above we
possible permutation of the data (which would take a long time apply this to the different potential splitting keys available to us
since there are an order of tens of millions of events in our data). for an experiment in changing the delivery locations vended for
As we can see in Table 2 the permutation test performs as package delivery. We judge the efficacy of the result based on
expected (by definition), unlike the parametric t-test whose as- the criteria described in section 2.1.
sumptions are violated by our data.
3.2 Choice of key
3 EXPERIMENT DESIGN We had a choice of randomizing by the following keys in decreas-
3.1 Picking a randomization strategy ing order of granularity:
Controlled experiments, where the experiment set is split be- • Building: The full normalized string address. eg: 425 106th
tween a control and treatment set followed by comparing the Ave NE, Anytown, State 12345. It is a grouping context,
effect of treatment on the target variables, is the gold standard for typically referring to units in a building or complex. Fig-
experimentation. A critical decision for randomization between ure 2 shows delivery events colored by Building Id, with a
C and T is the choice of key that is used for splitting. In general, random C/T split. Every pixel represents a hypothetical
the advice has been to make that key as granular as possible for delivery event. When colored by control and treatment,
proper randomization (e.g., every delivery, every page load for the figure shows how a split would look. Assume the red
ads, every session for a customer visiting a website), keeping in group is control, and the blue group is treatment.
mind customer experience and system limitations. In the next • Streets: The name of the street. eg: 106th Ave NE. Figure
section, we describe a methodology for evaluating different keys 3 shows delivery events colored by Street, with a random
to provide a way to select the randomization key in order to C/T split. Every pixel represents a hypothetical delivery
obtain a fair split. event. When colored by control and treatment, the figure
Previously, in order to find a random split, we would choose shows how a split would look. Assume the red group is
a level of granularity that seemed intuitively correct, and then control, and the blue group is treatment.
test for pre-existing differences. If at that point there was a pre- • Postal Codes: The zip code. eg: 98004. Imagine a city split
existing difference, then we would re-roll the dice and generate by zip codes with half of them in control and the other
a new split, and iterate on this process till a fair split was found. half in treatment.
So for example, if we are planning an experiment in week 27 we • Delivery Stations: The site where packages are received,
would use week 26 data and repeatedly select a random C/T set sorted and prepared for delivery. This is the origin for all
until one was found where there is no statistically significant deliveries planned for a particular route taken for package
difference between the C and T. Imagine we iterated 6 times delivery. eg: DS-xxx.
before finding this split. If we used week 25 data, would we have Our key metric for this experiment is Delivery Time. Delivery
found that this split was fair? If we needed to choose 6 times time is defined as the time taken to deliver packages at a particular
before finding a good split it is improbable that this splitting stop/address. It includes time to park at the stop, find packages
methodology produces splits whose fairness quality is stable. in the vehicle, walking to and from the delivery point, handing
What does that mean for week 27, our experimental period? We over the packages. Pointing them to the right location to drop
business metrics like Delivery Time. For example, in Seat-
tle the neighborhood of Queen Anne is more spread out
than the densely populated neighborhood of South Lake
Union, which means that on a metric like delivery time,
we would expect widely variant results even before we
apply our treatment, thus polluting the experiment.
Firstly we want to choose a randomization strategy which
does not have a high probability of significant differences in A/A
testing on historical data. In addition we want to choose a ran-
domization strategy that minimizes the impact of mixing on our
detection of effects that are significant to our application. So in
our case we want to choose a split that is able to detect important
changes in spite of the mixing and bias effects. To estimate this we
simulate this by applying the minimum important change to real
historical delivery data and then apply the permutation test to
see whether the change was detected as statistically significant.
Data To make this concrete, we used actual delivery stops
across all of the US. Every row in this joined dataset corresponds
to a package delivery event to an address. We aggregated all
deliveries made to an address at the same entry time by the same
delivery agent, into one delivery time data point. Building ids
were used to determine house number and street name of the
addresses. We then picked one strategy at a time, and repeated
the process for each of them, producing cumulative metrics for
Figure 3: A random split of a part of a city visualized with
every day of data.
hypothetical (not actual) delivery events colored by Street
Control Treatment Split Streaming this data, we randomly
Name, split into C/T groups. Every colored pixel repre-
allocated every address into the control or treatment group with
sents a delivery event. The red group is control and the
a 50/50 split based on the strategy being evaluated. So if we were
blue is treatment. We see it is less likely that a stop would
splitting by postal codes, we allocated, say, every address within
contain both C and T deliveries (mixing).
98101 into C and for 98108 into T. The allocation of an address
to a group varied per simulation. To account for the variance in
this bucketing method, we ran 1000 simulations of allocating an
off the package is optimized by Delivery Point models on which address to a C/T group. Therefore, the same address may be in
this strategy was first applied. the control group in one simulation and in treatment in another.
When addresses are split between C/T for each of these keys This is an integral part of using a permutation test, where a large
we need a metric to measure the desirability of this split. Among sample of possible permutations of allocation are considered.
these strategies, there are two general risks that make experi- Delivery Time Calculation To calculate delivery time for a
ments ineffective. delivery event, we took the total vehicle stop time in seconds,
divided that by the number of packages delivered at that stop. If
• Mixing In GeoSpatial experiments effects are frequently
a stop contained addresses that were mixed between control and
geospatially correlated, so finer granularity results in a
treatment, we made a note of it, and split the delivery time in
larger portion of data with mixed control/treatment effects.
the appropriate ratio (if out of 5 packages 3 were for a control
When the unit of randomization is fine grained, say at the
address then we split the time as 60:40 between C/T).
address level, then neighboring houses could be in differ-
Distribution of target Variable We worked assumption free
ent splits, one in C, and the other in T. In this scenario,
about the distribution on the target random variable across both
when a delivery agent stops the van to drop off a package
splits.
for one of these houses, they also drop off a package at the
neighboring house in the same stop thus leaking the ben-
efit of the vehicle stop time for the treatment house, into
4 EXPERIMENT RESULTS
the control house and vice versa. Intuitively, this should After running 1000 simulations over tens of Millions of ship-
be a frequently occurring problem when the unit of ran- ments and calculating the Mean Delivery Time over each of the
domization is very fine grained. In Figures 2 and 3, you permuted C/T splits, we analyzed results based on the criteria of
can look at the overlap between the red and blue pixels a good randomization strategy as enlisted in section 2.1.
and notice that in the building based split, the overlap
is a lot more frequent, whereas for streets, the overlap 4.1 Power & Mixing
is mostly on street intersections, much fewer in number Walking through Table 3, the first column is the randomization
than buildings. strategy in increasing order of granularity. As there are lot more
• Biased Selection To avoid the mixing problem, we con- houses than postal codes, the number of distinct groups increases
sider a coarse granularity like postal codes. The only time (finer granularity) as we go downwards. Since there was no ap-
mixing would happen would be for the rare cases where a plied difference between Control and Treatment, any difference
stop is on the edge between multiple postal codes. How- in their means is entirely due to chance as seen by the tiny differ-
ever, in selecting postal codes, we introduce biases in our ence noted in the Difference (C-T) column . But there can be some
Table 3: Experiment Results Table. Using the permutation test, none of the candidate randomization units have a high
probability of a significant difference in A/A testing on historical data. Note that in the finest granularity we can have a
great deal of mixed C/T effects. However, in the two finer granularities, we can also detect smaller differences between C
and T.

Randomization Unit Mean difference (C - T) 95% Confidence Interval % Stops with Avg. Mixing
C/T any mixing per stop
Delivery Station. 0.002 [-1.93, 1.94] 0.00% 0.00%
Postal Code 0.006 [-1.01, 1.03] 0.02% 0.00%
Street 0.000 [-0.13, 0.13] 0.78% 0.30%
Building 0.002 [-0.15, 0.15] 7.03% 3.10%

variation due to random assignments to different groups, and the an experiment worked well, choosing these units wouldn’t give
95th percent confidence interval in the next column shows how us the confidence that it did work. Between Street and Building id,
wide this difference can be. Thus we can only declare statistically we noted that Street had the highest sensitivity, least mixing and
significant differences that are larger than this interval. How- could clearly detect a change with a small confidence interval.
ever, differences smaller than this interval may be significant As a result, we choose Street for this dataset from the US.
to the business, but we may not be able to detect that they had
happened if we pick the wrong strategy. This is a distribution 5 APPLICATION
resampling method [2], which does not make any assumptions
on the shape of the distributions.
5.1 Last Mile at Amazon
From the granularity of split, we note: because the last 2 rows The first version of this analysis was done to validate if we can
have many more distinct groups, they are sensitive enough to come up with a good methodology that is statistically correct,
detect smaller differences, performing better on the sensitivity repeatable, and applicable on available data. At Last Mile, hav-
metric. If we randomize by Street, we could detect an improve- ing found a useful application with Delivery Time estimates,
ment or deterioration of greater than 0.13 seconds. this is now being applied to other business metrics that indicate
If there were too much mixing (e.g. 50% of stops), we wouldn’t customer satisfaction.
see any difference in the delivery times for C vs. T, limiting our We have used this approach for making decisions about ex-
statistical power to detect a real difference. As expected, there periments in the US and a number of other countries, rapidly
was no mixing at the Delivery Station level, while it increased with the help of this tool to measure the effectiveness of those
with increased granularity. The 7.03% mixing at Building id makes experiments. The next sub-section highlights one example of the
sense because often delivery agents deliver to adjacent buildings application of this method to a recent successful experiment. We
in one stop, and 7.03% of them seem to overlap between C/T in can now expand this method to be applied to any experiment
our dataset. and find the smallest important difference of significance, and
Notice that although the Building id has the finest granularity, confidently apply it. This approach has been used in various
it does not yield the most sensitive confidence interval in Table 3. geospatial experiments at Amazon and is extensible to almost
This is because 7.03% of the stops experienced mixing of control any other metric and use case due to its nonparametric nature.
and treatment groups.
5.2 Application to Production Experiments
4.2 Statistical Significance, and the effects of While A/A tests on historical data are helpful, the only way to
Mixing truly tell if a treatment helped the customer is to launch it and
We wanted to confirm that we would be able to detect actual do a controlled test to estimate the treatment effect.
differences regardless of mixing. Let’s say the treatment reduces Table 5 shows the results of an experiment launch for a large
delivery time, but because of mixing, that reduction also applies country. Control and Treatment have an even 50/50 split. The
to control addresses in the same stop. To test this pre-experiment, first column is the business metric we care about. Here, we only
we pretended that treatment resulted in the smallest important mention the type to preserve business confidentiality. The next
improvement in delivery time. We did this by artificially reducing column is the difference in Control and Treatment means for each
the delivery time in the treatment set after the split. In order to of these metrics. The rank is the rank of the difference in the
maximize the effect of mixing, we also extended that improve- known range. The “Is Significant?" column shows whether or not
ment to any control address within the same stop ( if it was a the effects observed are statistically significant. As we can see the
mixed C/T stop). Then we ran the permutation test as normal to nonparametric permutation test has been successfully applied
see if we could detect this improvement. We hoped to find that, to various metrics with different qualities including continuous
even under large amounts of simulated mixing: metrics, binary metrics, and rates/percentages–without changes
(1) We can still detect a sizable difference between the T & C in formulation.
averages.
(2) We find a significant difference under random divisions 5.3 Generalizability of the Idea
of T & C (exceeds the 95% confidence interval). The permutation test and granularity selection approaches we
We found that we could detect this difference with Street and propose here are general and readily expand to other domains
Building ids, but couldn’t detect it with Postal Code and Delivery and data types. As a nonparametric test it is straightforward to
Station splits. This ruled out the first two options, because even if apply and simplifies the experimental pipeline so non-scientists
Table 4: Effects of Mixing. Given a simulated important change, the change was detectable in two out of four options.
While the change was detected as significant, we can see that the greater level of mixing in the Building group impairs
our ability to detect fine changes as compared to using the street level of granularity.

Randomization % Stops with 95% Confidence Real C – T Improvement
Unit C/T any mixing Interval of C - T difference difference Detected?
in seconds in seconds
Delivery Station 0.00% [-1.93, 2.47] 0.883 No
Postal Code 0.02% [-1.03, 1.06] 0.937 No
Street 0.78% [-0.13, 0.13] 1.204 Yes
Building 7.03% [-0.15, 0.15] 0.957 Yes

Table 5: Sample 50% dial up results: Shows statistically significant change in metric #1, #2 and #4. Also shows metric #3
was not improved significantly. Note that even though some of these changes seem small they were being detected as
significant by a t-test. Though all of these metrics have different types and some are rare events we are able to run the
nonparametric permutation test on all of them to correctly detect significance.

Business Metric Difference T-C Rank Is Significant? Normal Range (95%)
Metric 1: a continuous number 0.023 98.2 YES [-0.0212:0.0223]
Metric 2: a binary metric (0.01032) 0.0 YES [-0.0057:0.0058]
Metric 3: a percentage metric 0.000005 74.3 NO [-0.0012:0.0013]
Metric 4: a percentage metric 2.31% 100 YES [-0.0035:0.0038]

can do experiments easily and count on the robustness of the Our results show that it is time for us to re-evaluate the use of
statistical results. nonparametric methods like the permutation test that were pre-
viously computationally intractable for most big data use cases.
5.3.1 Outliers with Outsize Impact. In general, this method This is particularly important as we work to simplify the pro-
can be used for distributions that are heavy-tailed or contain cess of experimentation and open experimental tools to a larger
significant outliers, or (especially) where independence assump- audience.
tions are violated frequently enough for t-tests to fail (which In addition to showing how this nonparametric test can be
can be determined by A/A tests on random splits prior to the used on a variety of different metrics and use cases in a big
experiment). As an example, a common issue in retail is where a data setting, we also show how it can be used to inform other
popular item can disproportionately drive outcome metrics like experimental choices such as the choice of split granularity. We
number of sales or clicks. The same concept applies to many provide a decision framework for these types of experimental
other applications. Because the permutation test will randomly choices and explore the use of this framework in practice.
assign this popular item to C or T over multiple permutations
it will account for the fact that a large portion of the expected ACKNOWLEDGEMENTS
difference between C and T is due to only one item, resulting in
We would like to thank our managers Amber Roy Chowdhury
a more accurate assessment of whether the treatment difference
for doing a thorough review of our paper and suggesting edits,
is significant or simply the result of one popular item leading the
Sanjay Kumar and Umar Farooq, for their support and guidance.
metrics astray.

5.3.2 Non-Normal Distributions. Because the permutation REFERENCES
test is nonparametric we have no assumptions on the under- [1] 2020 Don Davis | May 26, 2020 Georg Richter | May 20, 2020 Bloomberg News |
Mar 30, 2020 Harry Drajpuch | Mar 12, and 2020 Bloomberg News | May 20.
lying qualities of the distribution. While we have illustrated a 2020. Amazon is the fourth-largest US delivery service and growing fast.
heavy-tailed distribution here we have also successfully applied https://www.digitalcommerce360.com/2020/05/26/amazon-is-the-fourth%
E2%80%91largest-us-delivery-service-and-growing-fast/
this test to other types of distributions, including binary/bimodal [2] BT Efron and RJ Tibshirani. 1994. An Introduction to the Bootstrap. New York,
and standard normal distributions. This means we do not have NY: Chapman & HallHall. CRC Monographs on Statistics & Applied Probability
to worry about characterizing our underlying distribution before (1994).
[3] Ronald Aylmer Fisher. 1936. Design of experiments. Br Med J 1, 3923 (1936),
applying a significance testing methodology, something which 554–554.
simplifies the experimental process considerably. [4] B. Guo and Y. Yuan. 2017. A comparative review of methods for comparing
means using partially paired data. Statistical Methods in Medical Research 26
(2017), 1323 – 1340.
6 CONCLUSION [5] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel,
Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron
Careful and correct experimentation is key to making the right de- Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python.
cisions for systems and organizations. While our computational Journal of machine learning research 12, Oct (2011), 2825–2830.
tools have become more powerful our significance analysis has [6] Matthew Rocklin. 2015. Dask: Parallel Computation with Blocked algorithms
and Task Scheduling. In Proceedings of the 14th Python in Science Conference,
generally stagnated with the t-test. As we have shown, appro- Kathryn Huff and James Bergstra (Eds.). 130 – 136.
priately applying the t-test is a nontrivial problem (especially [7] Eugene Seneta et al. 2013. A tricentenary history of the law of large numbers.
if sophisticated scientific expertise is not available), and the t- Bernoulli 19, 4 (2013), 1088–1121.

test can be shockingly wrong when its assumptions are violated.