Improving Methodology in Spreadsheet Error Research Raymond R. Panko Shidler College of Business University of Hawai`i 2404 Maile Way Honolulu, HI 96821 001.808.377.1149 Ray@Panko.com ABSTRACT Too much spreadsheet research is unpublishable in high-quality oping an option about quality in development. Audits are not journals due to poor methodology. This is especially a problem for comprehensive error detection tools. computer science researchers, who often are untrained in beha- vioral research methodology. This position paper reflects the 2.1 Respect Human Error Research author’s experiences in reviewing submissions to information Inspection methodologies often fail to reflect the fact that software systems and computer science journals.1 and spreadsheet error rates are similar. Consequently, spreadsheet methodologies tend to ignore the rather vast literature on code Categories and Subject Descriptors inspection. By code inspection standards, most spreadsheet inspec- K.8.1: Spreadsheets. D.2.5 Testing and Debugging. tion methodologies do look like mere audits. They lack the required initial understanding of the spreadsheet, are undertaken on whole General Terms spreadsheets instead of modules, use single inspectors, and so forth. Experimentation, Verification. Keywords 2.2 Don’t Trust. Verify. Methodology. Spreadsheet Experiments, Experiments, Inspection. Spreadsheet inspection methodologies are rarely verified. Instead, Sampling, Statistics they tend to be refined until the researchers “feel good” about them. To verify the effectiveness of a methodology, it is important to have 1. INTRODUCTION multiple inspectors independently use the same methodology to inspect the same spreadsheets. Comparing errors from multiple For a number of years, computer science journal editors have taken inspectors can indicate relative effectiveness in finding different to sending me articles to review that involve experimental and other types of errors. If the methodology is strong, cross-analysis can methodology. It is frustrating to review these studies because they even give an estimate of errors remaining. often show a weak understanding of methodology. Fatal methodological errors are too common, and errors that hobble the 2.3 Report Time Spent use of results are even more frequent. In spreadsheet error research, methodological issues have been particularly common in papers by Time spent in testing is important in assessing human error computer scientists. Based on my experience, this paper presents research. It is important to reveal inspection rates for individual some prescriptions for improving spreadsheet error research. We spreadsheets—both time in total and time as a percentage of size will look at issues in inspections (audits) of operational expressed in multiple ways, such as all cells, all formula cells, spreadsheets, spreadsheet development experiments, and spread- unique formulas, and so forth. If a spreadsheet inspection method sheet inspection experiments. has multiple phases, time in each phase should be reported. 2. INSPECTIONS (AUDITS) OF 2.4 Understanding the Spreadsheet First OPERATIONAL SPREADSHEETS Spreadsheets are not self-documenting. It is important for inspec- tors to be given a thorough explanation of the spreadsheet’s detailed Several studies have inspected corpuses of operational spread- logic before they begin testing. sheets to look for errors. Many studies call this auditing, but auditing is a sample-driven statistical analysis method for devel- 2.5 Report Error Seriousness 3.7 Avoid Friends and Family Samples The seriousness of errors—at least the most serious error found— We also need clean samples. Mixing highly experienced pro- should be assessed. Seriousness should be reported by size of each fessionals with rank novices in the sample requires far larger error on monetary or other scales, percentage size of the error samples for statistical validity. relative to the size of the correct value, seriousness of the error in its context, and risk created for the organization. Context must be 3.8 Do Rigorous Random Assignment to understood well. In annual budgeting, small errors can be very Conditions damaging, while in major one-off projects such as the purchasing of another company, errors would have to be large compared to the Doing rigorous random assignment to the control and treatment results variance caused by uncertainties in input numbers. groups is mandatory and critical. This must be done on the basis of individuals. We cannot assign whole class sections to different 3. DEVELOPMENT EXPERIMENTS treatments. Nor can we place earlier arrivers in one condition and later arrivers in another condition. In development experiments, participants create spreadsheet models based on requirements in a word problem. To date, we have 3.9 Use Nonparametric Statistics done well in estimating cell error rate ranges during development. However, there is much more we need to do. It is important to use nonparametric statistics because errors do not follow the normal distribution even roughly. Transforming data so 3.1 Use New Tasks that they are pseudonormal and then applying traditional parametric statistics is not acceptable today. Spreadsheet development experiments have only used a few tasks. We need to do development experiments with more tasks to be 3.10 Be Generous in Presenting Statistical confident about typical cell error rates. The widely used Wall and Results Galumpke tasks have different error patterns. We need to try new tasks to see if new patterns emerge. The Wall task is especially When giving results, do not just give bare minimum result numbers problematic because it was designed to be extremely simple and like means, medians, and standard deviations. Show the full results almost free of domain knowledge requirements. Participants make matrix generated by statistical analysis programs. Also, in com- very few errors on the Wall task. parisons, give overall numerical differences. Do not just say that a difference was statistically significant without giving the numerical 3.2 Have Adequate Task Length differences or correlations. Errors are rare in spreadsheet development. Tasks need to be relatively long or there will be too few errors to analyze. One way 4. INSPECTION EXPERIMENTS to address this is to have subjects do multiple tasks in a balanced Inspection experiments should follow the advice in both previous design and to analyze errors in the total multitask sample. sections. It is wise to avoid seeded errors and go with data from actual development experiments. (The author has such a corpus.) 3.3 Go Beyond Student Samples We also need to do studies on people with different levels of 4.1 Higher Error Rates experience in spreadsheet development to ensure that spreadsheet One good thing is that human error detection rates are worse than research does not suffer from being the science of sophomores. error commission rates, so sample can be a little smaller and still generate enough errors. However, statistical analysis is misleading 3.4 Test Prescriptions for Safety and with less than about 30 subjects per group and rigorous subject Effectiveness randomization. We need to move beyond simply claiming that certain prescriptions 4.2 Test for Safety and Effectiveness (such as have a separate assumptions section) and certain tools are good ideas. We must test them to see if they really are “safe and Again, we need to go beyond simply measuring error detection effective.” We cannot just build tools and make claims about why rates and move to testing alternative methods for finding errors. If they will save the world. Prove it. we test only two methods—such as doing nothing and using a particular method, then we double the required sample size and 3.5 Go All the Way to Error Reduction must be extremely careful about random treatment assignment. Effects size is also critical in selecting sample sizes. Showing that users like it or showing that a tool can help point to earlier cells is not enough. Does it reduce errors? If not, who cares? 5. CONCLUSION 3.6 Use Ample Sample Sizes We need to stop touting untested prescriptions and tools if we are Sample sizes must be large—at least around 30 to 50 participants to put our field on a scientific footing. We must scrutinize pre- per condition. Otherwise, statistical analysis is unreliable. The scriptions for safety and effectiveness, and we must do so with minimum number should be determined empirically, by a power exemplary methodology. We also should be balanced in our pre- test. sentation of results. Everything has strengths and weaknesses. Our results should be honest about weaknesses. Obscuring methodology is a professional sin.