=Paper=
{{Paper
|id=Vol-2841/SIMPLIFY_2
|storemode=property
|title=What’s Mine is Yours, What’s Yours is Mine: Simplifying Significance Testing With Big Data
|pdfUrl=https://ceur-ws.org/Vol-2841/SIMPLIFY_2.pdf
|volume=Vol-2841
|authors=Karan Matnani,Valerie Liptak,George Forman
|dblpUrl=https://dblp.org/rec/conf/edbt/MatnaniLF21
}}
==What’s Mine is Yours, What’s Yours is Mine: Simplifying Significance Testing With Big Data==
What’s Mine is Yours, What’s Yours is Mine Simplifying Significance Testing With Big Data Karan Matnani Valerie Liptak George Forman Last Mile Tech, Amazon Last Mile Tech, Amazon Last Mile Tech, Amazon kmatnani@amazon.com liptav@amazon.com ghforman@amazon.com ABSTRACT Making mistakes in experimentation is costly because of roll- At Amazon Last Mile, we deliver over 3.5 Billion [1] packages ev- backs, developer and scientist time spent in deep dives, and the ery year, making us one of the largest delivery companies in the lost opportunity cost. Raising the bar with resampling tests adds world. At this scale, even small changes can have a big business value with informed decision making, and rework prevention. impact. In general, business impact is assessed using controlled With this context, we present the motivation, then the experi- experimentation. A standard approach to evaluating whether ment design, followed by the experiment results, and finally a controlled experiments resulted in a significant change has been real world application with ideas for more applications. Our con- to use a t-test. However, despite our scale the law of large num- tribution includes presenting a case for the permutation test, and bers fails to produce a normal distribution, and the t-test fails up showing how it can be made scalable in a real-world scenario. to 99% of the time. In addition to exhibiting non-normal distribu- tions our application has restrictions on the granularity of control 2 MOTIVATION: T-TEST SIGNIFICANCE and treatment group splits and also suffers from geospatial cor- TESTS FAIL AT SCALE relation which causes treatment effects to be applied across both 2.1 What makes a good Randomization control and treatment groups at finer granularities (e.g., delivery of packages to multiple homes by the same delivery agent in one Strategy stop; a building falling on different routes on two different days In the design of this methodology, we decided to measure the depending on other stops on that day). This introduces a tradeoff quality of the split using three types of metrics. between separability of effects through coarse granularity and (1) Unbiasedness: There will always be some differences be- detection of smaller treatment effects with fine granularity. In tween Control and Treatment, but we don’t want to de- this paper we solve the t-test dilemma using a resampling test at clare they are statistically significant unless they are from scale, and further leverage this test to create a scalable, repeatable the effect of the application of our treatment. methodology for randomization split granularity choice under (2) Power: the ability to detect small effects of the applied these constraints. We produce a sensitivity optimized randomiza- treatment. tion strategy using a data driven approach that has been applied (3) Mixing: the amount of spill of control into treatment and successfully within multiple real experiments at Amazon Last vice versa. Mile Tech and is generalizable to any experiment. Ideally, there would be negligible pre-treatment differences in the distributions of a business metric like delivery time across control and treatment groups. Minute lifts or drops post treatment would 1 INTRODUCTION be detectable. There would be no spilling of control into treatment At the scale of billions of packages delivered annually, small or vice versa. changes can have a substantial effect on customer experience. Principled experimentation drives business in the right direction. 2.2 Why not use a t-test? At Amazon Last Mile, changes affecting customers are rolled out based on the results of a controlled experiment. Examples of ex- The t-test is a parametric statistical test for determining whether periments are: Changing the experience of the mobile application two samples were drawn from different underlying distributions used by delivery agents, updates to routing and navigation, or [4]. It is a standard default approach for most experimentation. using new algorithms to pick the places they drop off the package As a parametric test it has several underlying assumptions which to. are violated frequently in practice. To quantify the effect of an experiment, the subjects of the • Assumption of normal distribution The t-test uses experiment are split into Control (C) and Treatment (T) groups, summary statistics of the two samples to fit a known dis- and the goal is to make these splits as fair as possible to have tribution to each sample. Then these two distributions are unbiased, robust experimentation. The quality of the split is mea- compared for overlap to determine the probability that sured on three factors. Bias: whether there is a difference in the the two samples could have been drawn from the same target variable between the groups before applying the treatment. underlying distribution. Standard t-test approaches fit to a Power: the ability to detect small effects of the applied treatment. Gaussian (normal) distribution. However, the actual data Mixing: the amount of control instances that experience treat- may not be normally distributed. For example, Figure 1 ment effects, and vice versa. Ideally, we want to maximize power shows that the target variable in our experiment exhibits a while minimizing bias and mixing. heavy-tailed distribution, and we observe that the match- ing Gaussian does not imitate the true distribution well. © 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed- Many practitioners have asserted that due to the central ings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia, Cyprus) limit theorem the non-normal distribution will approxi- on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) mate a Gaussian as the number of samples increases [7]. However, this is only true if you perform certain clever Table 1: t−test based significance on real delivery data. Ac- tual figures replaced with order of magnitude. Randomization Order of Magnitude % Significant Unit (C/T) of # Distinct Groups Runs (p<0.01) Delivery Station Hundreds 99.90% Postal Code Thousands 97.80% Street Millions 53% Buildings Tens of Millions 79.50% Figure 1: Gaussian super-imposed on observed delivery 2.3 A better method: Permutation Test (a time distribution. Both control (blue) and treatment (red) Resampling Test) overlap and have a heavy-tailed distribution. By compar- 2.3.1 Permutation Test. Statistical significance tests are in- ison the Gaussian (green), or normal distribution with tended to determine whether two sample datasets were drawn the same mean and standard deviation is much different. from different underlying distributions. The t-test is a parametric The standard t-test approximates our heavy-tailed distri- test that does this through making assumptions on the sample bution with this Gaussian distribution. distribution to figure out the probability that the two samples were drawn from the same underlying distribution. In fact there is another nonparametric approach we can take which calculates these distributions directly from the data and estimates their transformations of the data. Sampling more from a heavy- overlap directly. This is an established methodology known as a tailed distribution will not produce data that is normally permutation test, with various formulations including bootstrap- distributed. ping and Monte Carlo tests [2]. • Assumption of independence The t-test assumes inde- To understand how permutation tests work we first assume pendence of data points. This assumption is frequently that the null hypothesis is true [3]. In that case the C and T as- violated in real world scenarios. In our experiment, deliv- signment of a given measurement is interchangeable because the ery time per package is not independent for mixed stops treatment had no effect. Therefore we can directly calculate the because both control and treatment addresses may be de- distribution of differences between C and T distributions under livered to in the same stop so the attribution of treatment the null hypothesis by randomly sampling C and T groups from will get mixed into both buckets, clearly violating the inde- the dataset and calculating the difference between the summary pendence assumption. Additionally, delivery time in prior statistics of that assignment. We can then compare the true C/T stops could affect the delivery time of subsequent stops. difference to the differences under the null hypothesis to deter- • Assumption of specific knowledge: Which t-test will mine if the true C/T test was significantly different. Pseudocode you pick? If the end user is not a scientist, they would find for the approach follows. it difficult to pick the right type of t-test. Expecting users to have specific knowledge reduces adoption, so this method Given Data 𝐷 with original assignment 𝐶𝑡𝑟𝑢𝑒 and 𝑇𝑡𝑟𝑢𝑒 , doesn’t scale. For example, in the Python Scikit-learn im- summary statistic 𝑆, p-value 𝑝, and desired number of plementation [5], the t-test requires making decisions on resampled permutations 𝑅 whether tests are: for i in R do (1) One-sample or two-sample. Select 𝐶𝑖 and 𝑇𝑖 from 𝐷; (2) One-sided or two-sided. Add 𝑆 (𝐶𝑖 ) − 𝑆 (𝑇𝑖 ) to null distribution 𝑁 (3) Paired or unpaired (for two-sample tests). end (4) Homoscedastic (equal variance assumption) or heteroscedas- Calculate the rank 𝑟 of 𝑆 (𝐶𝑡𝑟𝑢𝑒 ) − 𝑆 (𝑇𝑡𝑟𝑢𝑒 ) in 𝑁 . If tic (for two sample tests). 𝑟 /100 < 𝑝/2 or 𝑟 /100 > 𝑝/2 (for a two-sided test) then (5) Fixed significance level (boolean-valued) or returning the treatment had a significant effect. p-values. Algorithm 1: Algorithm for Permutation Test To test how the t-test works with our Amazon delivery data we ran an A/A experiment that tested 1,000 different potential C/T splits to see if there was an a-priori significant difference 2.3.2 Scalability. While this test is simple in theory, histori- between the two groups (no treatment was applied so the two cally it was not frequently used due to the expense of performing groups should be the same). At the end of our simulation we the permutations. However due to advances in computational found that the t-test almost always declared significance when it capabilities we have implemented the permutation test in a dis- should not have. The column “% Significant Runs based on t-test tributed library using Dask [6] and have successfully used it in (p < 0.01)" in Table 1 shows how often it declared significance multi-GB datasets involving millions of measurements. with p < 0.01. If the test had worked, it would be around 1%, as One step described in the process of selecting a randomization seen in Table 2. Note that this could actually mislead us to think strategy is the choice of key to split on. We index the data by that there was a significant impact when there was no change. granularity key to promote fast joins using Dask’s data frames. A This shows that a t-test is not reliable when its assumptions are Dask DataFrame is a large parallel DataFrame composed of many violated. smaller Pandas DataFrames, split along the index. Batching the Table 2: Permutation test based significance on real deliv- ery data. Randomization Order of Magnitude % Significant Unit (C/T) of # Distinct Groups Runs (p<0.01) Delivery Station Hundreds 1.0% Postal Code Thousands 1.0% Street Millions 1.0% Buildings Tens of Millions 1.0% data into multiple mini-batches, which is a factor of the num- ber of available processor cores, ensures complete utilization of processing power, accelerating computation. Data that has been processed, gets dropped leaving behind a summary generated by a summarization function, like mean. Arbitrary summariza- tion functions are enabled through an abstract class. The code is Figure 2: A random split of a part of a city visualized with heavily parallelized but can run in a few minutes on a laptop on hypothetical (not actual) delivery events colored by Build- a dataset with millions of rows. ing Ids, split C/T groups. Every colored pixel represents a delivery event. The red group is control and the blue is 2.3.3 Implementation. We randomly split all data points into treatment. As we can see it is possible that the same deliv- C/T buckets allocating 50% to each bucket. We used each group’s ery stop could contain buildings in both C and T (mixing). mean delivery time as our summary statistic. We permuted the treatment assignments of our addresses 1,000 times. We assume need a better methodology for choosing a granularity that is that with 1,000 permutations we can effectively approximate likely to produce fair splits from the start. the true distribution that we would obtain from testing every Using the permutation test methodology described above we possible permutation of the data (which would take a long time apply this to the different potential splitting keys available to us since there are an order of tens of millions of events in our data). for an experiment in changing the delivery locations vended for As we can see in Table 2 the permutation test performs as package delivery. We judge the efficacy of the result based on expected (by definition), unlike the parametric t-test whose as- the criteria described in section 2.1. sumptions are violated by our data. 3.2 Choice of key 3 EXPERIMENT DESIGN We had a choice of randomizing by the following keys in decreas- 3.1 Picking a randomization strategy ing order of granularity: Controlled experiments, where the experiment set is split be- • Building: The full normalized string address. eg: 425 106th tween a control and treatment set followed by comparing the Ave NE, Anytown, State 12345. It is a grouping context, effect of treatment on the target variables, is the gold standard for typically referring to units in a building or complex. Fig- experimentation. A critical decision for randomization between ure 2 shows delivery events colored by Building Id, with a C and T is the choice of key that is used for splitting. In general, random C/T split. Every pixel represents a hypothetical the advice has been to make that key as granular as possible for delivery event. When colored by control and treatment, proper randomization (e.g., every delivery, every page load for the figure shows how a split would look. Assume the red ads, every session for a customer visiting a website), keeping in group is control, and the blue group is treatment. mind customer experience and system limitations. In the next • Streets: The name of the street. eg: 106th Ave NE. Figure section, we describe a methodology for evaluating different keys 3 shows delivery events colored by Street, with a random to provide a way to select the randomization key in order to C/T split. Every pixel represents a hypothetical delivery obtain a fair split. event. When colored by control and treatment, the figure Previously, in order to find a random split, we would choose shows how a split would look. Assume the red group is a level of granularity that seemed intuitively correct, and then control, and the blue group is treatment. test for pre-existing differences. If at that point there was a pre- • Postal Codes: The zip code. eg: 98004. Imagine a city split existing difference, then we would re-roll the dice and generate by zip codes with half of them in control and the other a new split, and iterate on this process till a fair split was found. half in treatment. So for example, if we are planning an experiment in week 27 we • Delivery Stations: The site where packages are received, would use week 26 data and repeatedly select a random C/T set sorted and prepared for delivery. This is the origin for all until one was found where there is no statistically significant deliveries planned for a particular route taken for package difference between the C and T. Imagine we iterated 6 times delivery. eg: DS-xxx. before finding this split. If we used week 25 data, would we have Our key metric for this experiment is Delivery Time. Delivery found that this split was fair? If we needed to choose 6 times time is defined as the time taken to deliver packages at a particular before finding a good split it is improbable that this splitting stop/address. It includes time to park at the stop, find packages methodology produces splits whose fairness quality is stable. in the vehicle, walking to and from the delivery point, handing What does that mean for week 27, our experimental period? We over the packages. Pointing them to the right location to drop business metrics like Delivery Time. For example, in Seat- tle the neighborhood of Queen Anne is more spread out than the densely populated neighborhood of South Lake Union, which means that on a metric like delivery time, we would expect widely variant results even before we apply our treatment, thus polluting the experiment. Firstly we want to choose a randomization strategy which does not have a high probability of significant differences in A/A testing on historical data. In addition we want to choose a ran- domization strategy that minimizes the impact of mixing on our detection of effects that are significant to our application. So in our case we want to choose a split that is able to detect important changes in spite of the mixing and bias effects. To estimate this we simulate this by applying the minimum important change to real historical delivery data and then apply the permutation test to see whether the change was detected as statistically significant. Data To make this concrete, we used actual delivery stops across all of the US. Every row in this joined dataset corresponds to a package delivery event to an address. We aggregated all deliveries made to an address at the same entry time by the same delivery agent, into one delivery time data point. Building ids were used to determine house number and street name of the addresses. We then picked one strategy at a time, and repeated the process for each of them, producing cumulative metrics for Figure 3: A random split of a part of a city visualized with every day of data. hypothetical (not actual) delivery events colored by Street Control Treatment Split Streaming this data, we randomly Name, split into C/T groups. Every colored pixel repre- allocated every address into the control or treatment group with sents a delivery event. The red group is control and the a 50/50 split based on the strategy being evaluated. So if we were blue is treatment. We see it is less likely that a stop would splitting by postal codes, we allocated, say, every address within contain both C and T deliveries (mixing). 98101 into C and for 98108 into T. The allocation of an address to a group varied per simulation. To account for the variance in this bucketing method, we ran 1000 simulations of allocating an off the package is optimized by Delivery Point models on which address to a C/T group. Therefore, the same address may be in this strategy was first applied. the control group in one simulation and in treatment in another. When addresses are split between C/T for each of these keys This is an integral part of using a permutation test, where a large we need a metric to measure the desirability of this split. Among sample of possible permutations of allocation are considered. these strategies, there are two general risks that make experi- Delivery Time Calculation To calculate delivery time for a ments ineffective. delivery event, we took the total vehicle stop time in seconds, divided that by the number of packages delivered at that stop. If • Mixing In GeoSpatial experiments effects are frequently a stop contained addresses that were mixed between control and geospatially correlated, so finer granularity results in a treatment, we made a note of it, and split the delivery time in larger portion of data with mixed control/treatment effects. the appropriate ratio (if out of 5 packages 3 were for a control When the unit of randomization is fine grained, say at the address then we split the time as 60:40 between C/T). address level, then neighboring houses could be in differ- Distribution of target Variable We worked assumption free ent splits, one in C, and the other in T. In this scenario, about the distribution on the target random variable across both when a delivery agent stops the van to drop off a package splits. for one of these houses, they also drop off a package at the neighboring house in the same stop thus leaking the ben- efit of the vehicle stop time for the treatment house, into 4 EXPERIMENT RESULTS the control house and vice versa. Intuitively, this should After running 1000 simulations over tens of Millions of ship- be a frequently occurring problem when the unit of ran- ments and calculating the Mean Delivery Time over each of the domization is very fine grained. In Figures 2 and 3, you permuted C/T splits, we analyzed results based on the criteria of can look at the overlap between the red and blue pixels a good randomization strategy as enlisted in section 2.1. and notice that in the building based split, the overlap is a lot more frequent, whereas for streets, the overlap 4.1 Power & Mixing is mostly on street intersections, much fewer in number Walking through Table 3, the first column is the randomization than buildings. strategy in increasing order of granularity. As there are lot more • Biased Selection To avoid the mixing problem, we con- houses than postal codes, the number of distinct groups increases sider a coarse granularity like postal codes. The only time (finer granularity) as we go downwards. Since there was no ap- mixing would happen would be for the rare cases where a plied difference between Control and Treatment, any difference stop is on the edge between multiple postal codes. How- in their means is entirely due to chance as seen by the tiny differ- ever, in selecting postal codes, we introduce biases in our ence noted in the Difference (C-T) column . But there can be some Table 3: Experiment Results Table. Using the permutation test, none of the candidate randomization units have a high probability of a significant difference in A/A testing on historical data. Note that in the finest granularity we can have a great deal of mixed C/T effects. However, in the two finer granularities, we can also detect smaller differences between C and T. Randomization Unit Mean difference (C - T) 95% Confidence Interval % Stops with Avg. Mixing C/T any mixing per stop Delivery Station. 0.002 [-1.93, 1.94] 0.00% 0.00% Postal Code 0.006 [-1.01, 1.03] 0.02% 0.00% Street 0.000 [-0.13, 0.13] 0.78% 0.30% Building 0.002 [-0.15, 0.15] 7.03% 3.10% variation due to random assignments to different groups, and the an experiment worked well, choosing these units wouldn’t give 95th percent confidence interval in the next column shows how us the confidence that it did work. Between Street and Building id, wide this difference can be. Thus we can only declare statistically we noted that Street had the highest sensitivity, least mixing and significant differences that are larger than this interval. How- could clearly detect a change with a small confidence interval. ever, differences smaller than this interval may be significant As a result, we choose Street for this dataset from the US. to the business, but we may not be able to detect that they had happened if we pick the wrong strategy. This is a distribution 5 APPLICATION resampling method [2], which does not make any assumptions on the shape of the distributions. 5.1 Last Mile at Amazon From the granularity of split, we note: because the last 2 rows The first version of this analysis was done to validate if we can have many more distinct groups, they are sensitive enough to come up with a good methodology that is statistically correct, detect smaller differences, performing better on the sensitivity repeatable, and applicable on available data. At Last Mile, hav- metric. If we randomize by Street, we could detect an improve- ing found a useful application with Delivery Time estimates, ment or deterioration of greater than 0.13 seconds. this is now being applied to other business metrics that indicate If there were too much mixing (e.g. 50% of stops), we wouldn’t customer satisfaction. see any difference in the delivery times for C vs. T, limiting our We have used this approach for making decisions about ex- statistical power to detect a real difference. As expected, there periments in the US and a number of other countries, rapidly was no mixing at the Delivery Station level, while it increased with the help of this tool to measure the effectiveness of those with increased granularity. The 7.03% mixing at Building id makes experiments. The next sub-section highlights one example of the sense because often delivery agents deliver to adjacent buildings application of this method to a recent successful experiment. We in one stop, and 7.03% of them seem to overlap between C/T in can now expand this method to be applied to any experiment our dataset. and find the smallest important difference of significance, and Notice that although the Building id has the finest granularity, confidently apply it. This approach has been used in various it does not yield the most sensitive confidence interval in Table 3. geospatial experiments at Amazon and is extensible to almost This is because 7.03% of the stops experienced mixing of control any other metric and use case due to its nonparametric nature. and treatment groups. 5.2 Application to Production Experiments 4.2 Statistical Significance, and the effects of While A/A tests on historical data are helpful, the only way to Mixing truly tell if a treatment helped the customer is to launch it and We wanted to confirm that we would be able to detect actual do a controlled test to estimate the treatment effect. differences regardless of mixing. Let’s say the treatment reduces Table 5 shows the results of an experiment launch for a large delivery time, but because of mixing, that reduction also applies country. Control and Treatment have an even 50/50 split. The to control addresses in the same stop. To test this pre-experiment, first column is the business metric we care about. Here, we only we pretended that treatment resulted in the smallest important mention the type to preserve business confidentiality. The next improvement in delivery time. We did this by artificially reducing column is the difference in Control and Treatment means for each the delivery time in the treatment set after the split. In order to of these metrics. The rank is the rank of the difference in the maximize the effect of mixing, we also extended that improve- known range. The “Is Significant?" column shows whether or not ment to any control address within the same stop ( if it was a the effects observed are statistically significant. As we can see the mixed C/T stop). Then we ran the permutation test as normal to nonparametric permutation test has been successfully applied see if we could detect this improvement. We hoped to find that, to various metrics with different qualities including continuous even under large amounts of simulated mixing: metrics, binary metrics, and rates/percentages–without changes (1) We can still detect a sizable difference between the T & C in formulation. averages. (2) We find a significant difference under random divisions 5.3 Generalizability of the Idea of T & C (exceeds the 95% confidence interval). The permutation test and granularity selection approaches we We found that we could detect this difference with Street and propose here are general and readily expand to other domains Building ids, but couldn’t detect it with Postal Code and Delivery and data types. As a nonparametric test it is straightforward to Station splits. This ruled out the first two options, because even if apply and simplifies the experimental pipeline so non-scientists Table 4: Effects of Mixing. Given a simulated important change, the change was detectable in two out of four options. While the change was detected as significant, we can see that the greater level of mixing in the Building group impairs our ability to detect fine changes as compared to using the street level of granularity. Randomization % Stops with 95% Confidence Real C – T Improvement Unit C/T any mixing Interval of C - T difference difference Detected? in seconds in seconds Delivery Station 0.00% [-1.93, 2.47] 0.883 No Postal Code 0.02% [-1.03, 1.06] 0.937 No Street 0.78% [-0.13, 0.13] 1.204 Yes Building 7.03% [-0.15, 0.15] 0.957 Yes Table 5: Sample 50% dial up results: Shows statistically significant change in metric #1, #2 and #4. Also shows metric #3 was not improved significantly. Note that even though some of these changes seem small they were being detected as significant by a t-test. Though all of these metrics have different types and some are rare events we are able to run the nonparametric permutation test on all of them to correctly detect significance. Business Metric Difference T-C Rank Is Significant? Normal Range (95%) Metric 1: a continuous number 0.023 98.2 YES [-0.0212:0.0223] Metric 2: a binary metric (0.01032) 0.0 YES [-0.0057:0.0058] Metric 3: a percentage metric 0.000005 74.3 NO [-0.0012:0.0013] Metric 4: a percentage metric 2.31% 100 YES [-0.0035:0.0038] can do experiments easily and count on the robustness of the Our results show that it is time for us to re-evaluate the use of statistical results. nonparametric methods like the permutation test that were pre- viously computationally intractable for most big data use cases. 5.3.1 Outliers with Outsize Impact. In general, this method This is particularly important as we work to simplify the pro- can be used for distributions that are heavy-tailed or contain cess of experimentation and open experimental tools to a larger significant outliers, or (especially) where independence assump- audience. tions are violated frequently enough for t-tests to fail (which In addition to showing how this nonparametric test can be can be determined by A/A tests on random splits prior to the used on a variety of different metrics and use cases in a big experiment). As an example, a common issue in retail is where a data setting, we also show how it can be used to inform other popular item can disproportionately drive outcome metrics like experimental choices such as the choice of split granularity. We number of sales or clicks. The same concept applies to many provide a decision framework for these types of experimental other applications. Because the permutation test will randomly choices and explore the use of this framework in practice. assign this popular item to C or T over multiple permutations it will account for the fact that a large portion of the expected ACKNOWLEDGEMENTS difference between C and T is due to only one item, resulting in We would like to thank our managers Amber Roy Chowdhury a more accurate assessment of whether the treatment difference for doing a thorough review of our paper and suggesting edits, is significant or simply the result of one popular item leading the Sanjay Kumar and Umar Farooq, for their support and guidance. metrics astray. 5.3.2 Non-Normal Distributions. Because the permutation REFERENCES test is nonparametric we have no assumptions on the under- [1] 2020 Don Davis | May 26, 2020 Georg Richter | May 20, 2020 Bloomberg News | Mar 30, 2020 Harry Drajpuch | Mar 12, and 2020 Bloomberg News | May 20. lying qualities of the distribution. While we have illustrated a 2020. Amazon is the fourth-largest US delivery service and growing fast. heavy-tailed distribution here we have also successfully applied https://www.digitalcommerce360.com/2020/05/26/amazon-is-the-fourth% E2%80%91largest-us-delivery-service-and-growing-fast/ this test to other types of distributions, including binary/bimodal [2] BT Efron and RJ Tibshirani. 1994. An Introduction to the Bootstrap. New York, and standard normal distributions. This means we do not have NY: Chapman & HallHall. CRC Monographs on Statistics & Applied Probability to worry about characterizing our underlying distribution before (1994). [3] Ronald Aylmer Fisher. 1936. Design of experiments. Br Med J 1, 3923 (1936), applying a significance testing methodology, something which 554–554. simplifies the experimental process considerably. [4] B. Guo and Y. Yuan. 2017. A comparative review of methods for comparing means using partially paired data. Statistical Methods in Medical Research 26 (2017), 1323 – 1340. 6 CONCLUSION [5] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Careful and correct experimentation is key to making the right de- Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. cisions for systems and organizations. While our computational Journal of machine learning research 12, Oct (2011), 2825–2830. tools have become more powerful our significance analysis has [6] Matthew Rocklin. 2015. Dask: Parallel Computation with Blocked algorithms and Task Scheduling. In Proceedings of the 14th Python in Science Conference, generally stagnated with the t-test. As we have shown, appro- Kathryn Huff and James Bergstra (Eds.). 130 – 136. priately applying the t-test is a nontrivial problem (especially [7] Eugene Seneta et al. 2013. A tricentenary history of the law of large numbers. if sophisticated scientific expertise is not available), and the t- Bernoulli 19, 4 (2013), 1088–1121. test can be shockingly wrong when its assumptions are violated.