=Paper=
{{Paper
|id=Vol-2884/paper_107
|storemode=property
|title=Lessons Learned from Problem Gambling Classification: Indirect Discrimination and
Algorithmic Fairness
|pdfUrl=https://ceur-ws.org/Vol-2884/paper_107.pdf
|volume=Vol-2884
|authors=Christian Percy,Artur Garcez,Simo Dragicevic,Sanjoy Sarkar
}}
==Lessons Learned from Problem Gambling Classification: Indirect Discrimination and
Algorithmic Fairness==
Lessons Learned from Problem Gambling Classification: Indirect Discrimination and Algorithmic Fairness∗ Christian Percy1 , Artur d’Avila Garcez 2 , Simo Dragicevic 1 , Sanjoy Sarkar 1 1 Playtech Plc 2 City, University of London chris@cspres.co.uk, a.garcez@city.ac.uk, Simo.Dragicevic@playtech.com, Sanjoy.Sarkar@playtech.com Abstract variables. However, protected variables may continue to in- fluence an ML model outcome by proxy (via other variables) Problem gambling is a public health issue with approximately 300,000 individuals suffering harm in England and 1.5 mil- in ways that can make the identification of bias even harder, lion at risk. Many gambling operators rely on Machine Learn- and make the bias correction towards algorithmic fairness ing (ML) algorithms to identify online players at risk. Mod- impossible. els are typically gender-blind (gender not included as an in- In this paper, we report the lessons learned from work by put), reflecting the sensitivity of protected characteristic data. Playtech plc, a provider of B2B and B2C gambling software However, some stakeholders worry that gender continues to services, investigating the role of gender in Playtech’s gam- influence the model via other variables (indirect identifica- bling harm identification algorithms. The paper connects tion) and worry about differential model performance by gen- cross-sector concerns around algorithmic fairness with the der (algorithmic fairness). In this paper, we investigate these specific public health work on problem gambling mitigation. concerns using real-world data from 22,500 players across In online gambling, there is a need to balance harm protec- two gambling operators. We propose a method for testing the indirect identification of a protected variable. We identify tion against e-consumers’ desires and rights to privacy, with near-zero levels of indirect identification of gender. Regard- a default that minimises the use of sensitive data. As such, ing algorithmic fairness, a slight pro-female bias is found in online gambling is a relevant and challenging test case for the first ML model and a moderate pro-female bias in found exploring issues around algorithmic bias. The main contri- in the second ML model. The challenge is to mitigate such butions of this paper are: bias without the intrusion of compulsory gender data collec- • A technique for identifying when indirect discrimination tion. We propose a new approach which uses gender data for training only, constructing separate models for each gender exists in ML models where a potentially problematic vari- and combining trained models into an ensemble that does able has been dropped (‘model-matched indirect identifi- not require gender data once deployed. Since harm identifi- cation’). cation adopts a precautionary principle, if any one model in- • An approach for incorporating insight based on protected dicates potential harm, the player is flagged as at risk. This variables into the classification algorithms without requir- approach is shown to reduce the difference per gender in the ing such data to be collected compulsorily from individu- True Positive Rate (TPR) of the models from 7.2% points to 4.0% points. This is shown to be better than what can be als (‘blind-separate models’). achieved by simply altering the models’ classification thresh- • The evaluation of the above technique and approach us- olds. Both the indirect identification and the algorithmic fair- ing gender in a real-world use case in the gambling sec- ness approaches are part of a wider framework and taxonomy tor, which reveals near-zero indirect identification and yet being proposed towards the ethical use of Artificial Intelli- the potential to improve algorithmic fairness, as defined gence (AI) in the gambling industry. by a reduction in the difference between the model’s true positive rate for men and women. Introduction We see this work as part of a broader project, both within Problem gambling is a public health issue with approxi- the gambling sector and on the ethics of AI more gener- mately 300,000 individuals suffering harm in England and ally. In the gambling sector, this paper offers the basis for 1.5 million at risk. Many gambling operators rely on Ma- the formation of a sector-wide working group to study al- chine Learning (ML) algorithms to identify online players gorithmic fairness and to consider how to address multiple at risk. Models are typically gender-blind (gender not in- objectives, such as overall true positive rates versus true neg- cluded as an input), reflecting the sensitivity of protected ative rates, model performance disparities by key customer ∗ The authors would like to thank Charmaine Hogan and Lau- groups, model aggregation policies, parsimony and transfer- ren Iannarone at Playtech for their constructive challenge and feed- ability. back. On the ethics of AI more generally, a framework and tax- Copyright © 2020 for this paper by its authors. Use permitted onomy is being developed, which includes the concerns of under Creative Commons License Attribution 4.0 International algorithmic fairness and bias identification and other use (CC BY 4.0). cases to improve fairness. Our approach recognises that the increasing awareness of the role of gender in problem gam- diversity and complexity of ML models and data sets, use bling. A UK gambling charity has reported that the rate of cases and stakeholders’ priorities are such that no single problem gambling amongst women increased by a third in technique can be recommended universally, but also certain the preceding five years to 2019, a faster rate of increase principles and a framework should be sought to be devel- compared to men in the same period (15%)3 . Therefore, as oped and applied specifically and across the gambling in- the industry becomes increasingly reliant on ML algorithms dustry given the importance of the issues being discussed1 . to detect problematic play, the changes in demographic pro- The remainder of the paper is organised as follows: in files of problem gamblers raises questions on the suitability Section 2, we position the paper in the context of the related of historic gender-blind data sets, their role in training mod- work; in Section 3, we describe the proposed technique and els, and their potential impact on model efficacy. This can fairness approach within the problem gambling use case; in be seen as part of a broader trend arguing that the traditional Section 4, we evaluate the influence of gender on indirect gender-blind approach in gambling research is inappropriate identification and options to enhance algorithmic fairness; (Baggio et al, 2018) and implicitly male-biased to the detri- in Section 5 we conclude and discuss directions for future ment of female gamblers (McCarthy et al, 2019; Venne et al, work. 2019). The research reported in this paper was initiated to ad- Related Work dress two possible stakeholder concerns that point towards opposite modelling responses. The first is that gender re- Algorithms for problem gambling harm reduction mains an (unwanted) influence on the gender-blind model Problem gambling is a public health issue with 300,000 esti- via its indirect associations with other variables (indirect dis- mated individuals in England self-reporting as experiencing crimination). Here, the goal is to remove as much of this in- harm (0.7% prevalence, Gambling Commission, 2018) and fluence as possible. The second concern is whether there is a further 1.5 million thought to be at risk. Sector organisa- a missed opportunity for using gender data in a way that en- tions take a range of steps to mitigate, identify and intervene hances algorithmic performance and fairness by identifying to reduce gambling-related harm, including the development and mitigating differences in model performance by gender of Machine Learning algorithms to identify players at risk group (algorithmic fairness). The first concern is motivated of harm. Some examples of early interventions that opera- directly by an awareness of the sensitivity of gender data, tors might take, having identified someone above a certain both to consumers and in legislation (GDPR). The second threshold of risk or possible harm, include tailored respon- concern reflects an awareness of average structural differ- sible gambling messages or reduced marketing activity. ences between men and women that may be relevant for Playtech plc has an in-house suite of ML algorithms predicting gambling risk. For instance, research has related trained to identify players with similar characteristics as testosterone levels to risk-taking and pathological gambling players who have self-identified as experiencing harm - (Stenstrom and Saad, 2011), identified gendered behavioural see Percy et al (2016) for background on these supervised patterns in gambling problems (Wong et al, 2013), and ob- ML models. Several of Playtech’s operations fall under served gendered patterns in the types of online behaviour the purview of the European Union’s General Data Protec- that can be addictive (Su et al, 2020). tion Regulations (GDPR; officially adopted in April 2016). Adopting a precautionary interpretation of GDPR principles Bias in AI algorithms and mitigation of data minimisation and protections for special category Concerns about bias in AI algorithms in relation to socio- data, the default decision on algorithms implemented by demographic traits have now become widespread. The sec- Playtech was not to incorporate the player’s gender, achiev- ond of Google’s seven principles for AI is to avoid creating ing typical cross-validation AUROC rates of 95%+ on bal- or reinforcing unfair bias4 . Organisations and researchers are anced data sets by using behaviour and transaction data responding to this concern in different ways, which can be alone. grouped based on whether they seek to intervene at the input However, the adoption of gender-blind algorithms is un- level, at the model level or at the output level. der review. Regulatory advice from the UK Gambling Com- At the input level, one approach adopted, already dis- mission in 2018 suggests that demographic data can be used cussed, is to exclude the variable corresponding to the socio- as part of satisfying regulatory requirements2 . There is also demographic trait in question. For instance, Goldman Sachs in its operation of Apple Card deliberately avoid collecting 1 We seek to respond to calls by AI researchers, such as Good- and using data on sensitive characteristics such as gender, man (2016) of the Oxford Internet Institute, to develop frameworks race or age, using this approach to defend against concerns for so-called ‘algorithm audits’, and sector bodies, such as the EC’s Advisory Committee on Equal Opportunities for Women and Men businesses/Compliance/General-compliance/General-Data- (2020) which recommends monitoring of algorithms for discrimi- Protection-Regulation-GDPR.aspx 3 nation and further work, calling for work to develop and share good www.telegraph.co.uk/news/2020/01/15/female-gambling- practices. The UK Government’s CDEI further describes a lack of addicts-growing-faster-men-amid-rise-online (accessed August clear regulatory standards and quality assurance (e.g. around algo- 2020) rithmic bias) as one of the five key trust-related barriers holding 4 https://ai.google/principles/ (accessed August 2020) back AI (CDEI, 2020:4). (published 2018 at https://www.blog.google/technology/ai/ai- 2 https://www.gamblingcommission.gov.uk/for-gambling- principles/) of gender bias5 . However, this practice typically proves an Tool in TensorBoard launched in 2018 to help ML develop- insufficient defence in the face of evidence of gender bias ers to visualise differences in classification from key vari- in the outcomes - New York’s Department of Financial Ser- ables, identify borderline cases for particular classifications vices opened an investigation into Apple Card in late 2019 and explore the impact of counterfactuals as part of assess- given different credit limits provided to men and women de- ing whether an inappropriate social bias might have been spite apparently similar financial circumstances6 . In another absorbed from the training data or otherwise reflected in example, analysis by Obermeyer et al (2019) revealed that the model9 . The use of counterfactuals for explainable AI, the use, in a widely-used commercial algorithm, of health e.g. White and Garcez (2020), has become increasingly as- costs as a proxy for healthcare needs resulted in anti-Black sociated with the goals of fairness in ML. Various other racial bias; the authors recommend removing health costs methods addressing fairness which have been proposed re- as an input variable as a proxy for needs. Another approach cently, have adopted their own measures of fairness. No- is to increase the availability and diversity of training data tably, Dwork et al (2012) introduces a framework for fair relating to the input variable in question, which was part classification by a task-specific metric for maximizing util- of Microsoft’s 2018 strategy for reducing the error rate dis- ity subject to a fairness constraint. Agarwal et al (2018) pro- crepancy between men and women and between lighter skin poses a cost-sensitive classifier also in an attempt to model a tones and darker skin tones in its image classification tool specific loss function subject to fairness constraints. Results Face API7 . are evaluated empirically on a variety of data sets. Choi et At the model level, algorithms or their parameters can be al (2019) focuses on a specific family of classifiers, naive adjusted to reduce the extent to which a model draws on Bayes, and introduce the notion of a discrimination pattern certain patterns in the input data. One example of this is the alongside an algorithm for mining discrimination patterns in gender-debiasing techniques developed for word embedding a naive Bayes classifier. The approach is iterative and seeks solutions (Bolukbasi et al, 2016), noting that the authors de- to eliminate such patterns until a fair model is obtained. An scribe a mixture of adjusting inputs and model-level adjust- overview of the various notions of fairness can be found ments. in Zemel et al (2013) and Dwork et al (2012). More com- At the output level, Moerel (2018) describes LinkedIn’s prehensive surveys are available in Friedler et al (2019) and recruitment tool as a way of enforcing quotas using the Mehrabi et al (2019). rankings produced by an algorithm in order to match a pre- defined desirable ratio. The tool can subdivide candidates by Problem Gambling Use Case gender, rank each candidate within each gender using its al- gorithm and then put forward an equal number of men and Data available women to the hiring manager for consideration. We work with two real-world data sets used to train the Ran- Some of the techniques above have come under chal- dom Forest harm prediction algorithms currently deployed lenge. For instance, the simplistic approach of dropping by Playtech. The two gambling operators have different socio-demographic input variables (blinding an algorithm) brands (one Bingo-focused and one Slot-Machine-focused), has come under challenge for inadvertently distracting from which will help demonstrate the diversity of circumstances fairness by reducing visibility of the issue, by ignoring pos- even in a narrow ML domain. sible proxy variables for socio-demographic traits and by ig- The binary classification algorithms use data by players noring opportunities to implement other solutions - see, e.g. for whether they voluntarily self-excluded from the gam- the analysis of US College admissions by Kleinberg et al bling platform during the analysis period, as an approxi- (2018) which argues for data-led proactive intervention at mate proxy for experiencing harm. Only regular players who the output level. have been live in the platform for at least 1-2 months are Focusing particularly on identifying output-level bias, included in the training data sets, given the focus of the al- new tools are being developed to identify whether ML al- gorithm on regular players. The open source Weka tool was gorithms are biased in terms of having systematically worse used to replicate a comparable model to the deployed model performance (e.g. lower accuracy) for particular groups. with the same (approx. 40) behavioural input variables. A Facebook announced the testing of an internal tool to do this Random Forest model was trained on the raw, unbalanced in 2018, Fairness Flow, which was discussed further in its training data, resulting in an accuracy under max-accuracy July 2020 Civil Rights Report as part of efforts to tackle ROC curve for the two trained models within 1%pt of the algorithmic discrimination8 . Google’s open-source What If deployed models. The two trained models (one for each op- 5 erator) that are generated by this process are referred to in Reported by Kevin Peachey for BBC News on 18 Novem- ber 2019, Sexist and biased? How credit firms make decisions this paper as the baseline models. https://www.bbc.co.uk/news/business-50432634 These training data sets are enriched for the purpose of 6 11 Nov 2019, Apple’s sexist credit card investigated by this study with gender data voluntarily supplied by play- US regulator. BBC News https://www.bbc.co.uk/news/business- ers during the sign-up process. The gender variable can take 50365609 three values: male (M), female (F) or unspecified/undeclared 7 https://blogs.microsoft.com/ai/gender-skin-tone-facial- (U). Caveats remain with the quality of the available gender recognition-improvement/ 8 9 https://about.fb.com/wp-content/uploads/2020/07/Civil- https://ai.googleblog.com/2018/09/the-what-if-tool-code- Rights-Audit-Final-Report.pdf free-probing-of.html data, including possible gender bias in this voluntary sup- varies from 0.30 to 0.31) and 1.1% of the linear variation in ply of self-identification data as well as possible simplifica- the female-reported gender dummy variable (RMSE of 0.48, tions and distortions in having only two explicit categories RMSE of 0.48 in each of the five folds too). For Operator for gender for players to select. Table 1 includes the sum- 1, there are no such statistically significant variables for the mary descriptive data. male-reported gender dummy and only two for the female- reported gender dummy, accounting for 2.2% of the linear Bias definition variation (RMSE of 0.40, varying from 0.39 to 0.41 across We identify three initial areas of analysis where metrics can five folds). usefully be analysed by gender: the gender balance in the The correlation coefficient only identifies linear relation- training data, the self-exclusion rates in the training data, ships, whereas the model in question is a Random Forest and the performance of the models for separate genders. All of depth 10 which is likely to identify non-linear patterns. three sets of metrics are identified for reporting purposes, but While many common polynomial relationships in real-world only the model performance data is proposed as a metric for gambling data might still be hinted at in a significant linear assessing potential problematic bias. Despite government- relationship, other relevant patterns would not. For instance, commissioned population surveys providing more detail by the Random Forest models used in this research were seen gender, population-wide surveys cannot be related to the to identify relationships based on common values after the gender ratios in an individual operator platform, as customer decimal point in a data set in which linear variation with bases attracted by a particular brand are not representative of gender had been removed by decomposition. Deducting the the overall gambling community. average value of a particular variable for each gender artifi- Focusing on model performance, we are interested in a cially results in zero linear correlation, but can result in such model that performs similarly well for each gender in terms gender-driven patterns in the values after the decimal point, of true positives (as a proxy for spotting those who are likely which remain exploitable by a Random Forest model. to be at risk) and true negatives (reducing any disruption or Instead of linear correlation coefficients as a generic tech- false alerts for players unlikely to be at risk). Since nega- nique, we propose identifying the maximum possible level tive examples dominate in all populations and given the pre- of indirect identification in a model-dependent manner, us- cautionary emphasis on identifying possible harm, the True ing a model with the same structure and parametrization as Positive Rate (TPR) is the chosen primary performance met- the original baseline model, an approach we call ”model- ric for comparisons used in this paper. Given that we would matched indirect identification”. not expect exact equality of performance by gender even in a If the model were linear, with no interaction terms, bi- perfectly fair algorithm, we also identify a tolerance thresh- variate linear correlations reflect the model structure and old by which model performance might be identified as in- would be appropriate to capture possible indirect identifica- sufficient in that it should prompt action. For the exploratory tion. In this case, we train a new model using the same ML purposes of this paper, we use a 2%pt difference in TPR method (Random Forests) with the same model parameter performance among gender categories as such a threshold, selection as the baseline model and the same set of predictor noting that such a threshold must ultimately be informed by variables, but this time using gender as a target classifica- stakeholder consensus. tion variable. The target variable from the baseline model, self-exclusion, does not appear in this new model. Experimental Results The accuracy of this new model is seen as a bound on how well gender can be indirectly identified in the baseline Assessment of the influence of gender on the model model, since the new model is optimised to predict gender (indirect discrimination) explicitly now, whereas the prediction of gender would only Since gender is not included in the original algorithm there have been an indirect goal10 of the baseline model, which is is no potential for direct use of gender in the model. How- optimised to predict self-exclusion only. By using the same ever, gender may still be indirectly identified in the model parametrization, we seek to find a maximum bound for the via the correlations between other input variables and gen- given use case (i.e. model + data) and to avoid the ambigu- der or other patterns in the data. ity of a possibly exponential variety of implicit interaction The standard initial investigation of relationships between terms. variables is the correlation coefficient. For Operator 2, 8 of The gender-classification models produced by this ap- the 40 input variables have a correlation coefficient statisti- proach have an out-of-bag (OOB) error11 for Operator 1 of cally significant at the 5% Bonferonni-adjusted level or bet- 0.5205 and for operator 2 of 0.4637. Collectively, this sug- ter for male-reported gender, and 5 of the 40 input variables 10 have the same correlation in the case of female-reported gen- Motivated only insofar as implicitly predicting gender midway through the model may later help to predict self-exclusion. der. However, this pattern is near-trivial by nature, in that the 11 Metric generated internally by Weka’s Random Forest algo- statistical significance reflects the large sample size rather rithm. This is an equivalent to a validation set performance mea- than the meaningfulness of the co-variance. The r-squared sured for a fold from cross-validation, in that the RF algorithm de- from a linear regression using all statistically significant liberately excludes a set of observations in the construction of each variables reveals that such variables only explain 0.6% of tree. The OOB error measures the classification error rate for such the linear variation in the male-reported gender dummy vari- excluded observations, taking the majority classification for each able (RMSE of 0.30, RMSE across five-fold cross-validation observation that has been excluded from various trees. gests that there is little indirect identification of gender be- In the second option, blind-separate model, we train three yond a random guess based on the majority class. separate models for confirmed male players, confirmed fe- male players and gender-unspecified or undisclosed players. Assessment of performance bias in baseline models If any one model identifies a player as a likely self-excluder, (algorithmic fairness) the player is predicted to be at possible risk, reflecting the Table 1 reveals that male players outweigh female players precautionary approach applied across many problem gam- on the slots-focused brand (1.6x prevalence) and male play- bling identification strategies. As such, the overall classifi- ers are outweighed on the bingo-focused brand (3.5x preva- cation approach is gendered but does not draw on gender lence). In both cases, undeclared gender is the most common as an explicit input variable once deployed. In this way, group. Men tend to see higher levels of self-exclusion than opt-in privacy of customers is preserved without a loss of women. access by customers to the best performing algorithms. A For Operator 1, there is little clear distinction in model loss of access might happen if, for instance, one model were performance by gender. The model is slightly better, based trained with gender data, while another (less accurate) model on TPR, at identifying women at risk than men, but stays were trained without gender data, with the latter model used within the 2%pt tolerance threshold. However, for Operator whenever a customer chooses not to share gender data. In 2 there is a more marked higher model performance among our approach, gender data is only required for a sample of female players than male players, with much higher TPR the players, which might be developed from voluntarily pro- (+7.2%pts) and slightly higher overall accuracy (+0.8%pts). vided data (as done here) or via an ad-hoc collection for the This gender delta by TPR is higher than the specified 2%pt sole purpose of such a model. This approach reduces the threshold, prompting an exercise to see how it might be mit- gender disparity in TPR, but at the cost of the true neg- igated, as follows. ative rate (TNR) and accuracy. Male TPR increases from 46.5% to 54.7% and reduces the gender delta from 7.2%pts Options to enhance algorithmic fairness to 4.0%pts. It is important to note that this improvement in In the online gambling use case, similar to other e-retail use TPR and reduction in delta cannot be achieved by simply al- cases, there is a strong sector preference for not compelling tering the classification thresholds in the baseline model: the users to share sensitive data in order to use the services, both delta increases to 7.3%pts in the baseline model if its classi- recognising the potential intrusiveness of such questions and fication threshold is adjusted until the male TPR matches the the ease with which they can be inaccurately answered by male TPR from the blind-separate model. This provides con- those who would prefer not to be asked. For this reason, gen- fidence that the blind-separate model, in its use of gender- der is a voluntary data point shared by players. insights, is providing additional value in the identification of We test two mitigation methods for Operator 2 that do not players at possible harm. require compulsory gender data: first, the inclusion of gen- Nonetheless, as mentioned, this reduction in gender dis- der as an additional input variable (allowing Unspecified (U) parity by TPR comes at the cost of accuracy and TNR. The to be one of its values). Secondly, we propose an ensemble OOB error is higher for men, which has the smaller sam- method which is gender-blind at its deployment and which ple of the two confirmed genders (0.1452 vs 0.1361 for uses multiple gender-separated models in the ensemble ag- women). TNR decreases from 96.7% to 95.3% for women, gregated to form an overall view on a player’s risk. Natu- from 98.1% to 91.9% for men and from 97.2% to 94.7% for rally, if accurate gender data were assumed available for all unspecified gender. players, other methods exist for reducing performance bias, This may be an acceptable loss of performance in ex- provided stakeholders tolerate modelling structural differ- change for reduced gender disparity, given the gambling in- ences by gender. For instance, separate classification thresh- dustry focus on the precautionary principle, but would re- olds could be set for men and women (potentially as part of quire exploration with sector stakeholders. It is also possible gender-separated models) thus weighting false positives dif- that a larger exercise may result in model choices that entail ferently by gender, or output quotas could be set such that other forms of compromise: one disadvantage of the blind- the top X highest-scoring male players and top Y highest- separate model is that it reduces the sample size available scoring female players are classified as at risk to meet a for training in each gender group. Improvements might be benchmark quota (potentially balanced against a decision expected with larger training data sets and the application of rule that does not allow the quota to apply below a certain data set balancing techniques (in Playtech’s deployed algo- classification probability or clash with the above mentioned rithms, the SMOTE technique is used to generate balanced precautionary approach). data and it is not used here; see Percy et al (2016) for de- The first option above proved ineffective. Gender has little tails). impact on the model. The original 0.1412 OOB error wors- ens marginally to 0.1440 with gender included. Male gen- Lessons Learned, Conclusion and Future der ranks 39 out of 42 input variables in terms of feature frequency in the Random Forest model, and female gender Work ranks 38 out of 42. The gender delta on TPR improves to The purpose of this paper has been to report work by 4.9%pts (reduced from 7.2%pts) but only with a worse TPR Playtech, a provider of B2B and B2C gambling services, to performance among women, with no improvement among investigate the role of gender in its gambling harm identi- men. fication algorithms. We identify five key lessons learned to Possible metric by gender (F/M) Operator 1 (slots-focused brand, n = 4,340) Operator 2 (Bingo-focused brand, n=18,275) Gender balance in training data F: 20.6% M: 32.6% U: 46.8% F: 36.5% M: 10.4% U: 53.1% Self-exclusion outcome F: 20.4% M: 24.4% U: 16.8% F: 17.1% M: 18.7% U: 22.3% Baseline RF model TPR F: 67.0% M: 65.3% U: 66.5% F: 53.7% M: 46.5% U: 52.9% Baseline RF model TNR F: 94.4% M: 95.1% U: 95.0% F: 96.7% M: 98.1% U: 97.2% Baseline RF model accuracy F: 88.8% M: 87.9% U: 90.2% F: 89.3% M: 88.5% U: 87.4% Table 1: Gender metrics from two gambling operators. date as part of an ongoing project to improve practice: to exist elsewhere in the technical and cultural institutions • The diversity of ML use cases, data sets and stakeholder surrounding gambling, the self-identification of problem priorities is such that there is no single stance on what al- gambling, and the socio-economic system on which gam- gorithmic fairness should be prioritised or how it should bling is embedded; it is unclear how such biases might be enhanced. For the two models in the gambling harm influence training data and the resulting AI algorithms. identification use cases, we have found negligible levels More specifically, this exploratory analysis has focused of indirect gender identification in gender blind models. on a sample of regular players and two operators, and Focusing on gender disparities in true positive rates, we it may not be reflective of early-stage players or players found a meaningful disparity in one model but not the with other operators. other. The same technique that reduced gender disparity In what concerns future work, in the gambling sector, our on the target model would have increased it on the other next step is convening a working group to apply this at a model in the other data set, so we should not assume con- larger scale and discussing compromises among competing sistency from one context to another. objectives. Such a group might comprise domain experts • Analysis of bias requires investing resources in the defi- (e.g. ML experts and data scientists, legal counsel, experts nition and defence of unbiased benchmarks and the spec- in the target variable, experts in the use case), managers ification of a tolerance threshold. Since bias can exist ei- and external representatives who provide challenge and va- ther above or below any given benchmark, random varia- lidity as part of the overall exercise, ensuring representa- tion makes it impossible to achieve an exact ongoing fit. tion of individuals from different groups in the target socio- The margin of error which can be tolerated depends on demographic variables. In doing so, the objective is to im- what stakeholders find material, worth the apportion of prove safer gambling outcomes across all cohorts and the resources and the level of variation in the values of a pro- scope can be expanded to include the design and evaluation tected variable as measured over time and over different of industry level interventions as well as risk identification data set samples. algorithms. On the ethics of AI more generally, we shall develop a • Exercises to improve algorithmic fairness need to be in- general framework out of our approach to investigating algo- corporated into overall business priorities, most likely en- rithmic fairness in other use cases in the sector, supported by gaging appropriately balanced stakeholder groups, rather a taxonomy of the diverse techniques available to improve than treated as a separable analytical exercise. This is both fairness. We invite comment, engagement and challenge on because judgement calls need to be made as part of the this paper as part of the broader project to improve practice exercise and because adjusting practice based on insight and to develop relevant and industry-specific AI principles. may require the balancing of multiple objectives, some of which may be competing objectives. References • Indirect discrimination needs to be analysed as a feature Advisory Committee on Equal Opportunities for Women of a specific model rather than a feature of the data set. For and Men for the European Commission. (2020). Opinion instance, a target variable such as gender may be mapped on Artificial Intelligence - opportunities and challenges for in diverse ways against other variables in the data set de- gender equality (published 18 March 2020). pending on the complexity of the model (e.g. linear, poly- nomial, interaction-dependent, integer/decimal structure, Agarwal, Alekh; Beygelzimer, Aliiia; Dudfk, Miroslav; etc). Indirect discrimination is driven by whether your Langford, John and Hanna, Wallach. A Reductions Ap- model can exploit a particular pattern, rather than by other proach to Fair Classification, 35th International Conference patterns that might exist. on Machine Learning, ICML 2018, Stockholm, Sweden, • Any analysis of fairness is inevitably limited, both be- July 2018. cause of changing expectations and the potential breadth of the topic. As such, it is important to treat it as a process Baggio, S., Gainsbury, S., Starcevic, V., Richard, J., rather than a one-off exercise and to recognise the limits Beck, F., Billieux, J. (2018). Gender differences in gam- in any one exercise. In this initial exploratory work, for bling preferences and problem gambling: a network-level instance, it is unclear what biases a voluntary provision analysis, International Gambling Studies, 18:3, 512-525. of gender data might introduce. Gender bias is also likely Bolukbasi, T., Chang, K., Zou, J., Saligrama, V., Kalai, A. (2016). Man is to Computer Programmer as Woman is to McCarthy, S., Thomas, S.L., Bellringer, M.E. et al. Homemaker? Debiasing Word Embeddings. Available via (2019). Women and gambling-related harm: a narrative arXiv:1607.06520v1 [cs.CL] 21 Jul 2016. literature review and implications for research, policy, and practice. BMC Harm Reduction Journal, 16-18 2019. CDEI. (2020). AI Barometer Report: June 2020. London: Centre for Data Ethics and Innovation, UK. Moerel, L. (2018). Algorithms can reduce discrimination, but only with proper data. Publ. 16 Nov 2018 by IAPP, 2018. Choi, YooJung; Farnadi, Golnoosh; Babaki, Behrouz and Broeck, Guy Van den. Learning Fair Naive Bayes Classifiers Obermeyer, Z., Powers, B., Vogeli, C., Mullainathan, by Discovering and Eliminating Discrimination Patterns. In S. (2019). Dissecting racial bias in an algorithm used to Proc. AAAI Conference on Artificial Intelligence, AAAI manage the health of populations. Science 25 Oct 2019: 447- 2020, New York, NY, February 2020. 453. https://science.sciencemag.org/content/366/6464/447. Dragicevic, S., Garcez, A., Percy, C., Sarkar, S. (2019). Percy, C., França, M., Dragičević, S., Garcez, A. (2016): Understanding the Risk Profile of Gambling Behaviour Predicting online gambling self-exclusion: an analysis of through Machine Learning Predictive Modelling and the performance of supervised machine learning models, Explanation. KR2ML 2019, Workshop at 33rd NeurIPS International Gambling Studies, 2016. Conference, Vancouver, Canada, December 2019 (available via https://kr2ml.github.io/2019/papers/). Sarkar, S., Weyde, T., Garcez, A., Slabaugh, G., Drag- icevic, S., Percy, C. (2016). Accuracy and interpretability Dwork, Cynthia; Hardt, Moritz; Pitassi, Toniann; trade-offs in machine learning applied to safer gambling. Reingold, Omer and Zemel, Richard. Fairness through CEUR Workshop Proceedings, 1773. Dec. Available via awareness, Innovations in Theoretical Computer Science http://ceur-ws.org/Vol-1773/CoCoNIPS 2016 paper10.pdf. Conference, ITCS2012, MIT CSAIL, Cambridge MA, January 2012. Stenstrom, E., Saad, G. (2011). Testosterone, financial risk-taking, and pathological gambling. Journal of Neuro- Friedler, Sorelle A; Choudhary, Sonam; Scheidegger, science, Psychology, and Economics, 4(4), 254–266. Carlos; Hamilton, Evan P; Venkatasubramanian, Suresh and Roth, Derek. A Comparative Study of Fairness-Enhancing Su, W., Han, X., Yu, H., Wu, Y., Potenza, M. (2020). Interventions in Machine Learning, In Proc. 2019 ACM Do men become addicted to internet gaming and women Conference on Fairness, Accountability and Transparency, to social media? A meta-analysis examining gender-related Atlanta, GA, January 2019. differences in specific internet addiction. Computers in Human Behavior, Volume 113, 2020. Mehrabi, Ninareh; Morstatter, Fred; Saxena, Nripsuta; Lerman, Kristina and Galstyan, Aram. A Survey on Bias Suresh, H., Guttag, J. (2020). A Framework for Under- and Fairness in Machine Learning, KR2ML Workshop at standing Unintended Consequences of Machine Learning. NeurIPS’19 Conference, Vancouver, Canada, December Available via arXiv:1901.10002v3 [cs.LG], 2020. 2019 (available via https://kr2ml.github.io/2019/papers/). Venne, D., Mazar, A., Volberg, R. (2019). Gender Gambling Commission (2018). Participation in gambling and Gambling Behaviors: A Comprehensive Analysis of and rates of problem gambling – England 2016: Statistical (Dis)Similarities. Int J Ment Health Addiction, 2019. report. Birmingham, GC, UK. White, A., Garcez, A (2020). Measurable Counterfactual Goodman, B. (2016). A Step Towards Accountable Local Explanations for Any Classifier. In Proc. 24th Eu- Algorithms? Algorithmic Discrimination and the European ropean Conference on Artificial Intelligence, ECAI 2020, Union General Data Protection. 29th Conference on Neural Santiago de Compostela, Spain, Aug 2020. Information Processing Systems (NIPS 2016), Barcelona, Spain, December 2016. Wong, G., Zane, N., Saw, A., Chan, A. K. (2013). Examining gender differences for gambling engagement Kleinberg, J., Ludwig, J., Mullainathan, S., Rambachan, and gambling problems among emerging adults. Journal of A. (2018). Advances in big data research in economics: gambling studies, 29(2), 171–189. Algorithmic fairness. AEA Papers and Proceedings 2018, 108: 22–27 https://doi.org/10.1257/pandp.20181018, 2018. Zemel, Richard; Wu, Yu; Swersky, Kevin; Pitassi, To- niann and Dwork, Cynthia. Learning Fair Representations, Lundberg, S., Lee, S. (2017). A Unified Approach to 30th International Conference on Machine Learning, ICML Interpreting Model Predictions. Advances in Neural Infor- 2013, Atlanta, GA, June 2013. mation Processing Systems 30 (NIPS 2017), Long Beach, CA, December 2017.