=Paper= {{Paper |id=Vol-2884/paper_107 |storemode=property |title=Lessons Learned from Problem Gambling Classification: Indirect Discrimination and Algorithmic Fairness |pdfUrl=https://ceur-ws.org/Vol-2884/paper_107.pdf |volume=Vol-2884 |authors=Christian Percy,Artur Garcez,Simo Dragicevic,Sanjoy Sarkar }} ==Lessons Learned from Problem Gambling Classification: Indirect Discrimination and Algorithmic Fairness== https://ceur-ws.org/Vol-2884/paper_107.pdf
                       Lessons Learned from Problem Gambling Classification:
                          Indirect Discrimination and Algorithmic Fairness∗

                Christian Percy1 , Artur d’Avila Garcez 2 , Simo Dragicevic 1 , Sanjoy Sarkar 1
                                                                   1
                                                          Playtech Plc
                                                          2
                                                   City, University of London
            chris@cspres.co.uk, a.garcez@city.ac.uk, Simo.Dragicevic@playtech.com, Sanjoy.Sarkar@playtech.com

                             Abstract                                  variables. However, protected variables may continue to in-
                                                                       fluence an ML model outcome by proxy (via other variables)
  Problem gambling is a public health issue with approximately
  300,000 individuals suffering harm in England and 1.5 mil-           in ways that can make the identification of bias even harder,
  lion at risk. Many gambling operators rely on Machine Learn-         and make the bias correction towards algorithmic fairness
  ing (ML) algorithms to identify online players at risk. Mod-         impossible.
  els are typically gender-blind (gender not included as an in-           In this paper, we report the lessons learned from work by
  put), reflecting the sensitivity of protected characteristic data.   Playtech plc, a provider of B2B and B2C gambling software
  However, some stakeholders worry that gender continues to            services, investigating the role of gender in Playtech’s gam-
  influence the model via other variables (indirect identifica-        bling harm identification algorithms. The paper connects
  tion) and worry about differential model performance by gen-         cross-sector concerns around algorithmic fairness with the
  der (algorithmic fairness). In this paper, we investigate these      specific public health work on problem gambling mitigation.
  concerns using real-world data from 22,500 players across
                                                                       In online gambling, there is a need to balance harm protec-
  two gambling operators. We propose a method for testing
  the indirect identification of a protected variable. We identify     tion against e-consumers’ desires and rights to privacy, with
  near-zero levels of indirect identification of gender. Regard-       a default that minimises the use of sensitive data. As such,
  ing algorithmic fairness, a slight pro-female bias is found in       online gambling is a relevant and challenging test case for
  the first ML model and a moderate pro-female bias in found           exploring issues around algorithmic bias. The main contri-
  in the second ML model. The challenge is to mitigate such            butions of this paper are:
  bias without the intrusion of compulsory gender data collec-
                                                                       • A technique for identifying when indirect discrimination
  tion. We propose a new approach which uses gender data for
  training only, constructing separate models for each gender             exists in ML models where a potentially problematic vari-
  and combining trained models into an ensemble that does                 able has been dropped (‘model-matched indirect identifi-
  not require gender data once deployed. Since harm identifi-             cation’).
  cation adopts a precautionary principle, if any one model in-        • An approach for incorporating insight based on protected
  dicates potential harm, the player is flagged as at risk. This          variables into the classification algorithms without requir-
  approach is shown to reduce the difference per gender in the
                                                                          ing such data to be collected compulsorily from individu-
  True Positive Rate (TPR) of the models from 7.2% points
  to 4.0% points. This is shown to be better than what can be             als (‘blind-separate models’).
  achieved by simply altering the models’ classification thresh-       • The evaluation of the above technique and approach us-
  olds. Both the indirect identification and the algorithmic fair-        ing gender in a real-world use case in the gambling sec-
  ness approaches are part of a wider framework and taxonomy              tor, which reveals near-zero indirect identification and yet
  being proposed towards the ethical use of Artificial Intelli-           the potential to improve algorithmic fairness, as defined
  gence (AI) in the gambling industry.
                                                                          by a reduction in the difference between the model’s true
                                                                          positive rate for men and women.
                         Introduction                                  We see this work as part of a broader project, both within
Problem gambling is a public health issue with approxi-                the gambling sector and on the ethics of AI more gener-
mately 300,000 individuals suffering harm in England and               ally. In the gambling sector, this paper offers the basis for
1.5 million at risk. Many gambling operators rely on Ma-               the formation of a sector-wide working group to study al-
chine Learning (ML) algorithms to identify online players              gorithmic fairness and to consider how to address multiple
at risk. Models are typically gender-blind (gender not in-             objectives, such as overall true positive rates versus true neg-
cluded as an input), reflecting the sensitivity of protected           ative rates, model performance disparities by key customer
   ∗
     The authors would like to thank Charmaine Hogan and Lau-
                                                                       groups, model aggregation policies, parsimony and transfer-
ren Iannarone at Playtech for their constructive challenge and feed-   ability.
back.                                                                     On the ethics of AI more generally, a framework and tax-
Copyright © 2020 for this paper by its authors. Use permitted          onomy is being developed, which includes the concerns of
under Creative Commons License Attribution 4.0 International           algorithmic fairness and bias identification and other use
(CC BY 4.0).
cases to improve fairness. Our approach recognises that the             increasing awareness of the role of gender in problem gam-
diversity and complexity of ML models and data sets, use                bling. A UK gambling charity has reported that the rate of
cases and stakeholders’ priorities are such that no single              problem gambling amongst women increased by a third in
technique can be recommended universally, but also certain              the preceding five years to 2019, a faster rate of increase
principles and a framework should be sought to be devel-                compared to men in the same period (15%)3 . Therefore, as
oped and applied specifically and across the gambling in-               the industry becomes increasingly reliant on ML algorithms
dustry given the importance of the issues being discussed1 .            to detect problematic play, the changes in demographic pro-
   The remainder of the paper is organised as follows: in               files of problem gamblers raises questions on the suitability
Section 2, we position the paper in the context of the related          of historic gender-blind data sets, their role in training mod-
work; in Section 3, we describe the proposed technique and              els, and their potential impact on model efficacy. This can
fairness approach within the problem gambling use case; in              be seen as part of a broader trend arguing that the traditional
Section 4, we evaluate the influence of gender on indirect              gender-blind approach in gambling research is inappropriate
identification and options to enhance algorithmic fairness;             (Baggio et al, 2018) and implicitly male-biased to the detri-
in Section 5 we conclude and discuss directions for future              ment of female gamblers (McCarthy et al, 2019; Venne et al,
work.                                                                   2019).
                                                                           The research reported in this paper was initiated to ad-
                        Related Work                                    dress two possible stakeholder concerns that point towards
                                                                        opposite modelling responses. The first is that gender re-
Algorithms for problem gambling harm reduction                          mains an (unwanted) influence on the gender-blind model
Problem gambling is a public health issue with 300,000 esti-            via its indirect associations with other variables (indirect dis-
mated individuals in England self-reporting as experiencing             crimination). Here, the goal is to remove as much of this in-
harm (0.7% prevalence, Gambling Commission, 2018) and                   fluence as possible. The second concern is whether there is
a further 1.5 million thought to be at risk. Sector organisa-           a missed opportunity for using gender data in a way that en-
tions take a range of steps to mitigate, identify and intervene         hances algorithmic performance and fairness by identifying
to reduce gambling-related harm, including the development              and mitigating differences in model performance by gender
of Machine Learning algorithms to identify players at risk              group (algorithmic fairness). The first concern is motivated
of harm. Some examples of early interventions that opera-               directly by an awareness of the sensitivity of gender data,
tors might take, having identified someone above a certain              both to consumers and in legislation (GDPR). The second
threshold of risk or possible harm, include tailored respon-            concern reflects an awareness of average structural differ-
sible gambling messages or reduced marketing activity.                  ences between men and women that may be relevant for
   Playtech plc has an in-house suite of ML algorithms                  predicting gambling risk. For instance, research has related
trained to identify players with similar characteristics as             testosterone levels to risk-taking and pathological gambling
players who have self-identified as experiencing harm -                 (Stenstrom and Saad, 2011), identified gendered behavioural
see Percy et al (2016) for background on these supervised               patterns in gambling problems (Wong et al, 2013), and ob-
ML models. Several of Playtech’s operations fall under                  served gendered patterns in the types of online behaviour
the purview of the European Union’s General Data Protec-                that can be addictive (Su et al, 2020).
tion Regulations (GDPR; officially adopted in April 2016).
Adopting a precautionary interpretation of GDPR principles              Bias in AI algorithms and mitigation
of data minimisation and protections for special category               Concerns about bias in AI algorithms in relation to socio-
data, the default decision on algorithms implemented by                 demographic traits have now become widespread. The sec-
Playtech was not to incorporate the player’s gender, achiev-            ond of Google’s seven principles for AI is to avoid creating
ing typical cross-validation AUROC rates of 95%+ on bal-                or reinforcing unfair bias4 . Organisations and researchers are
anced data sets by using behaviour and transaction data                 responding to this concern in different ways, which can be
alone.                                                                  grouped based on whether they seek to intervene at the input
   However, the adoption of gender-blind algorithms is un-              level, at the model level or at the output level.
der review. Regulatory advice from the UK Gambling Com-                    At the input level, one approach adopted, already dis-
mission in 2018 suggests that demographic data can be used              cussed, is to exclude the variable corresponding to the socio-
as part of satisfying regulatory requirements2 . There is also          demographic trait in question. For instance, Goldman Sachs
                                                                        in its operation of Apple Card deliberately avoid collecting
    1
      We seek to respond to calls by AI researchers, such as Good-      and using data on sensitive characteristics such as gender,
man (2016) of the Oxford Internet Institute, to develop frameworks      race or age, using this approach to defend against concerns
for so-called ‘algorithm audits’, and sector bodies, such as the EC’s
Advisory Committee on Equal Opportunities for Women and Men             businesses/Compliance/General-compliance/General-Data-
(2020) which recommends monitoring of algorithms for discrimi-          Protection-Regulation-GDPR.aspx
                                                                            3
nation and further work, calling for work to develop and share good           www.telegraph.co.uk/news/2020/01/15/female-gambling-
practices. The UK Government’s CDEI further describes a lack of         addicts-growing-faster-men-amid-rise-online (accessed August
clear regulatory standards and quality assurance (e.g. around algo-     2020)
rithmic bias) as one of the five key trust-related barriers holding         4
                                                                              https://ai.google/principles/ (accessed August    2020)
back AI (CDEI, 2020:4).                                                 (published 2018 at https://www.blog.google/technology/ai/ai-
    2
      https://www.gamblingcommission.gov.uk/for-gambling-               principles/)
of gender bias5 . However, this practice typically proves an     Tool in TensorBoard launched in 2018 to help ML develop-
insufficient defence in the face of evidence of gender bias      ers to visualise differences in classification from key vari-
in the outcomes - New York’s Department of Financial Ser-        ables, identify borderline cases for particular classifications
vices opened an investigation into Apple Card in late 2019       and explore the impact of counterfactuals as part of assess-
given different credit limits provided to men and women de-      ing whether an inappropriate social bias might have been
spite apparently similar financial circumstances6 . In another   absorbed from the training data or otherwise reflected in
example, analysis by Obermeyer et al (2019) revealed that        the model9 . The use of counterfactuals for explainable AI,
the use, in a widely-used commercial algorithm, of health        e.g. White and Garcez (2020), has become increasingly as-
costs as a proxy for healthcare needs resulted in anti-Black     sociated with the goals of fairness in ML. Various other
racial bias; the authors recommend removing health costs         methods addressing fairness which have been proposed re-
as an input variable as a proxy for needs. Another approach      cently, have adopted their own measures of fairness. No-
is to increase the availability and diversity of training data   tably, Dwork et al (2012) introduces a framework for fair
relating to the input variable in question, which was part       classification by a task-specific metric for maximizing util-
of Microsoft’s 2018 strategy for reducing the error rate dis-    ity subject to a fairness constraint. Agarwal et al (2018) pro-
crepancy between men and women and between lighter skin          poses a cost-sensitive classifier also in an attempt to model a
tones and darker skin tones in its image classification tool     specific loss function subject to fairness constraints. Results
Face API7 .                                                      are evaluated empirically on a variety of data sets. Choi et
   At the model level, algorithms or their parameters can be     al (2019) focuses on a specific family of classifiers, naive
adjusted to reduce the extent to which a model draws on          Bayes, and introduce the notion of a discrimination pattern
certain patterns in the input data. One example of this is the   alongside an algorithm for mining discrimination patterns in
gender-debiasing techniques developed for word embedding         a naive Bayes classifier. The approach is iterative and seeks
solutions (Bolukbasi et al, 2016), noting that the authors de-   to eliminate such patterns until a fair model is obtained. An
scribe a mixture of adjusting inputs and model-level adjust-     overview of the various notions of fairness can be found
ments.                                                           in Zemel et al (2013) and Dwork et al (2012). More com-
   At the output level, Moerel (2018) describes LinkedIn’s       prehensive surveys are available in Friedler et al (2019) and
recruitment tool as a way of enforcing quotas using the          Mehrabi et al (2019).
rankings produced by an algorithm in order to match a pre-
defined desirable ratio. The tool can subdivide candidates by                Problem Gambling Use Case
gender, rank each candidate within each gender using its al-
gorithm and then put forward an equal number of men and          Data available
women to the hiring manager for consideration.                   We work with two real-world data sets used to train the Ran-
   Some of the techniques above have come under chal-            dom Forest harm prediction algorithms currently deployed
lenge. For instance, the simplistic approach of dropping         by Playtech. The two gambling operators have different
socio-demographic input variables (blinding an algorithm)        brands (one Bingo-focused and one Slot-Machine-focused),
has come under challenge for inadvertently distracting from      which will help demonstrate the diversity of circumstances
fairness by reducing visibility of the issue, by ignoring pos-   even in a narrow ML domain.
sible proxy variables for socio-demographic traits and by ig-       The binary classification algorithms use data by players
noring opportunities to implement other solutions - see, e.g.    for whether they voluntarily self-excluded from the gam-
the analysis of US College admissions by Kleinberg et al         bling platform during the analysis period, as an approxi-
(2018) which argues for data-led proactive intervention at       mate proxy for experiencing harm. Only regular players who
the output level.                                                have been live in the platform for at least 1-2 months are
   Focusing particularly on identifying output-level bias,       included in the training data sets, given the focus of the al-
new tools are being developed to identify whether ML al-         gorithm on regular players. The open source Weka tool was
gorithms are biased in terms of having systematically worse      used to replicate a comparable model to the deployed model
performance (e.g. lower accuracy) for particular groups.         with the same (approx. 40) behavioural input variables. A
Facebook announced the testing of an internal tool to do this    Random Forest model was trained on the raw, unbalanced
in 2018, Fairness Flow, which was discussed further in its       training data, resulting in an accuracy under max-accuracy
July 2020 Civil Rights Report as part of efforts to tackle       ROC curve for the two trained models within 1%pt of the
algorithmic discrimination8 . Google’s open-source What If       deployed models. The two trained models (one for each op-
    5                                                            erator) that are generated by this process are referred to in
      Reported by Kevin Peachey for BBC News on 18 Novem-
ber 2019, Sexist and biased? How credit firms make decisions     this paper as the baseline models.
https://www.bbc.co.uk/news/business-50432634                        These training data sets are enriched for the purpose of
    6
      11 Nov 2019, Apple’s sexist credit card investigated by    this study with gender data voluntarily supplied by play-
US regulator. BBC News https://www.bbc.co.uk/news/business-      ers during the sign-up process. The gender variable can take
50365609                                                         three values: male (M), female (F) or unspecified/undeclared
    7
      https://blogs.microsoft.com/ai/gender-skin-tone-facial-    (U). Caveats remain with the quality of the available gender
recognition-improvement/
    8                                                                9
      https://about.fb.com/wp-content/uploads/2020/07/Civil-           https://ai.googleblog.com/2018/09/the-what-if-tool-code-
Rights-Audit-Final-Report.pdf                                    free-probing-of.html
data, including possible gender bias in this voluntary sup-         varies from 0.30 to 0.31) and 1.1% of the linear variation in
ply of self-identification data as well as possible simplifica-     the female-reported gender dummy variable (RMSE of 0.48,
tions and distortions in having only two explicit categories        RMSE of 0.48 in each of the five folds too). For Operator
for gender for players to select. Table 1 includes the sum-         1, there are no such statistically significant variables for the
mary descriptive data.                                              male-reported gender dummy and only two for the female-
                                                                    reported gender dummy, accounting for 2.2% of the linear
Bias definition                                                     variation (RMSE of 0.40, varying from 0.39 to 0.41 across
We identify three initial areas of analysis where metrics can       five folds).
usefully be analysed by gender: the gender balance in the              The correlation coefficient only identifies linear relation-
training data, the self-exclusion rates in the training data,       ships, whereas the model in question is a Random Forest
and the performance of the models for separate genders. All         of depth 10 which is likely to identify non-linear patterns.
three sets of metrics are identified for reporting purposes, but    While many common polynomial relationships in real-world
only the model performance data is proposed as a metric for         gambling data might still be hinted at in a significant linear
assessing potential problematic bias. Despite government-           relationship, other relevant patterns would not. For instance,
commissioned population surveys providing more detail by            the Random Forest models used in this research were seen
gender, population-wide surveys cannot be related to the            to identify relationships based on common values after the
gender ratios in an individual operator platform, as customer       decimal point in a data set in which linear variation with
bases attracted by a particular brand are not representative of     gender had been removed by decomposition. Deducting the
the overall gambling community.                                     average value of a particular variable for each gender artifi-
   Focusing on model performance, we are interested in a            cially results in zero linear correlation, but can result in such
model that performs similarly well for each gender in terms         gender-driven patterns in the values after the decimal point,
of true positives (as a proxy for spotting those who are likely     which remain exploitable by a Random Forest model.
to be at risk) and true negatives (reducing any disruption or          Instead of linear correlation coefficients as a generic tech-
false alerts for players unlikely to be at risk). Since nega-       nique, we propose identifying the maximum possible level
tive examples dominate in all populations and given the pre-        of indirect identification in a model-dependent manner, us-
cautionary emphasis on identifying possible harm, the True          ing a model with the same structure and parametrization as
Positive Rate (TPR) is the chosen primary performance met-          the original baseline model, an approach we call ”model-
ric for comparisons used in this paper. Given that we would         matched indirect identification”.
not expect exact equality of performance by gender even in a           If the model were linear, with no interaction terms, bi-
perfectly fair algorithm, we also identify a tolerance thresh-      variate linear correlations reflect the model structure and
old by which model performance might be identified as in-           would be appropriate to capture possible indirect identifica-
sufficient in that it should prompt action. For the exploratory     tion. In this case, we train a new model using the same ML
purposes of this paper, we use a 2%pt difference in TPR             method (Random Forests) with the same model parameter
performance among gender categories as such a threshold,            selection as the baseline model and the same set of predictor
noting that such a threshold must ultimately be informed by         variables, but this time using gender as a target classifica-
stakeholder consensus.                                              tion variable. The target variable from the baseline model,
                                                                    self-exclusion, does not appear in this new model.
                 Experimental Results                                  The accuracy of this new model is seen as a bound on
                                                                    how well gender can be indirectly identified in the baseline
Assessment of the influence of gender on the model                  model, since the new model is optimised to predict gender
(indirect discrimination)                                           explicitly now, whereas the prediction of gender would only
Since gender is not included in the original algorithm there        have been an indirect goal10 of the baseline model, which is
is no potential for direct use of gender in the model. How-         optimised to predict self-exclusion only. By using the same
ever, gender may still be indirectly identified in the model        parametrization, we seek to find a maximum bound for the
via the correlations between other input variables and gen-         given use case (i.e. model + data) and to avoid the ambigu-
der or other patterns in the data.                                  ity of a possibly exponential variety of implicit interaction
   The standard initial investigation of relationships between      terms.
variables is the correlation coefficient. For Operator 2, 8 of         The gender-classification models produced by this ap-
the 40 input variables have a correlation coefficient statisti-     proach have an out-of-bag (OOB) error11 for Operator 1 of
cally significant at the 5% Bonferonni-adjusted level or bet-       0.5205 and for operator 2 of 0.4637. Collectively, this sug-
ter for male-reported gender, and 5 of the 40 input variables          10
have the same correlation in the case of female-reported gen-             Motivated only insofar as implicitly predicting gender midway
                                                                    through the model may later help to predict self-exclusion.
der. However, this pattern is near-trivial by nature, in that the      11
                                                                          Metric generated internally by Weka’s Random Forest algo-
statistical significance reflects the large sample size rather
                                                                    rithm. This is an equivalent to a validation set performance mea-
than the meaningfulness of the co-variance. The r-squared           sured for a fold from cross-validation, in that the RF algorithm de-
from a linear regression using all statistically significant        liberately excludes a set of observations in the construction of each
variables reveals that such variables only explain 0.6% of          tree. The OOB error measures the classification error rate for such
the linear variation in the male-reported gender dummy vari-        excluded observations, taking the majority classification for each
able (RMSE of 0.30, RMSE across five-fold cross-validation          observation that has been excluded from various trees.
gests that there is little indirect identification of gender be-      In the second option, blind-separate model, we train three
yond a random guess based on the majority class.                   separate models for confirmed male players, confirmed fe-
                                                                   male players and gender-unspecified or undisclosed players.
Assessment of performance bias in baseline models                  If any one model identifies a player as a likely self-excluder,
(algorithmic fairness)                                             the player is predicted to be at possible risk, reflecting the
Table 1 reveals that male players outweigh female players          precautionary approach applied across many problem gam-
on the slots-focused brand (1.6x prevalence) and male play-        bling identification strategies. As such, the overall classifi-
ers are outweighed on the bingo-focused brand (3.5x preva-         cation approach is gendered but does not draw on gender
lence). In both cases, undeclared gender is the most common        as an explicit input variable once deployed. In this way,
group. Men tend to see higher levels of self-exclusion than        opt-in privacy of customers is preserved without a loss of
women.                                                             access by customers to the best performing algorithms. A
   For Operator 1, there is little clear distinction in model      loss of access might happen if, for instance, one model were
performance by gender. The model is slightly better, based         trained with gender data, while another (less accurate) model
on TPR, at identifying women at risk than men, but stays           were trained without gender data, with the latter model used
within the 2%pt tolerance threshold. However, for Operator         whenever a customer chooses not to share gender data. In
2 there is a more marked higher model performance among            our approach, gender data is only required for a sample of
female players than male players, with much higher TPR             the players, which might be developed from voluntarily pro-
(+7.2%pts) and slightly higher overall accuracy (+0.8%pts).        vided data (as done here) or via an ad-hoc collection for the
This gender delta by TPR is higher than the specified 2%pt         sole purpose of such a model. This approach reduces the
threshold, prompting an exercise to see how it might be mit-       gender disparity in TPR, but at the cost of the true neg-
igated, as follows.                                                ative rate (TNR) and accuracy. Male TPR increases from
                                                                   46.5% to 54.7% and reduces the gender delta from 7.2%pts
Options to enhance algorithmic fairness                            to 4.0%pts. It is important to note that this improvement in
In the online gambling use case, similar to other e-retail use     TPR and reduction in delta cannot be achieved by simply al-
cases, there is a strong sector preference for not compelling      tering the classification thresholds in the baseline model: the
users to share sensitive data in order to use the services, both   delta increases to 7.3%pts in the baseline model if its classi-
recognising the potential intrusiveness of such questions and      fication threshold is adjusted until the male TPR matches the
the ease with which they can be inaccurately answered by           male TPR from the blind-separate model. This provides con-
those who would prefer not to be asked. For this reason, gen-      fidence that the blind-separate model, in its use of gender-
der is a voluntary data point shared by players.                   insights, is providing additional value in the identification of
   We test two mitigation methods for Operator 2 that do not       players at possible harm.
require compulsory gender data: first, the inclusion of gen-          Nonetheless, as mentioned, this reduction in gender dis-
der as an additional input variable (allowing Unspecified (U)      parity by TPR comes at the cost of accuracy and TNR. The
to be one of its values). Secondly, we propose an ensemble         OOB error is higher for men, which has the smaller sam-
method which is gender-blind at its deployment and which           ple of the two confirmed genders (0.1452 vs 0.1361 for
uses multiple gender-separated models in the ensemble ag-          women). TNR decreases from 96.7% to 95.3% for women,
gregated to form an overall view on a player’s risk. Natu-         from 98.1% to 91.9% for men and from 97.2% to 94.7% for
rally, if accurate gender data were assumed available for all      unspecified gender.
players, other methods exist for reducing performance bias,           This may be an acceptable loss of performance in ex-
provided stakeholders tolerate modelling structural differ-        change for reduced gender disparity, given the gambling in-
ences by gender. For instance, separate classification thresh-     dustry focus on the precautionary principle, but would re-
olds could be set for men and women (potentially as part of        quire exploration with sector stakeholders. It is also possible
gender-separated models) thus weighting false positives dif-       that a larger exercise may result in model choices that entail
ferently by gender, or output quotas could be set such that        other forms of compromise: one disadvantage of the blind-
the top X highest-scoring male players and top Y highest-          separate model is that it reduces the sample size available
scoring female players are classified as at risk to meet a         for training in each gender group. Improvements might be
benchmark quota (potentially balanced against a decision           expected with larger training data sets and the application of
rule that does not allow the quota to apply below a certain        data set balancing techniques (in Playtech’s deployed algo-
classification probability or clash with the above mentioned       rithms, the SMOTE technique is used to generate balanced
precautionary approach).                                           data and it is not used here; see Percy et al (2016) for de-
   The first option above proved ineffective. Gender has little    tails).
impact on the model. The original 0.1412 OOB error wors-
ens marginally to 0.1440 with gender included. Male gen-              Lessons Learned, Conclusion and Future
der ranks 39 out of 42 input variables in terms of feature
frequency in the Random Forest model, and female gender
                                                                                      Work
ranks 38 out of 42. The gender delta on TPR improves to            The purpose of this paper has been to report work by
4.9%pts (reduced from 7.2%pts) but only with a worse TPR           Playtech, a provider of B2B and B2C gambling services, to
performance among women, with no improvement among                 investigate the role of gender in its gambling harm identi-
men.                                                               fication algorithms. We identify five key lessons learned to
 Possible metric by gender (F/M)        Operator 1 (slots-focused brand, n = 4,340)   Operator 2 (Bingo-focused brand, n=18,275)
 Gender balance in training data              F: 20.6% M: 32.6% U: 46.8%                     F: 36.5% M: 10.4% U: 53.1%
     Self-exclusion outcome                   F: 20.4% M: 24.4% U: 16.8%                     F: 17.1% M: 18.7% U: 22.3%
    Baseline RF model TPR                     F: 67.0% M: 65.3% U: 66.5%                     F: 53.7% M: 46.5% U: 52.9%
    Baseline RF model TNR                     F: 94.4% M: 95.1% U: 95.0%                     F: 96.7% M: 98.1% U: 97.2%
  Baseline RF model accuracy                  F: 88.8% M: 87.9% U: 90.2%                     F: 89.3% M: 88.5% U: 87.4%

                                       Table 1: Gender metrics from two gambling operators.


date as part of an ongoing project to improve practice:                to exist elsewhere in the technical and cultural institutions
• The diversity of ML use cases, data sets and stakeholder             surrounding gambling, the self-identification of problem
  priorities is such that there is no single stance on what al-        gambling, and the socio-economic system on which gam-
  gorithmic fairness should be prioritised or how it should            bling is embedded; it is unclear how such biases might
  be enhanced. For the two models in the gambling harm                 influence training data and the resulting AI algorithms.
  identification use cases, we have found negligible levels            More specifically, this exploratory analysis has focused
  of indirect gender identification in gender blind models.            on a sample of regular players and two operators, and
  Focusing on gender disparities in true positive rates, we            it may not be reflective of early-stage players or players
  found a meaningful disparity in one model but not the                with other operators.
  other. The same technique that reduced gender disparity           In what concerns future work, in the gambling sector, our
  on the target model would have increased it on the other          next step is convening a working group to apply this at a
  model in the other data set, so we should not assume con-         larger scale and discussing compromises among competing
  sistency from one context to another.                             objectives. Such a group might comprise domain experts
• Analysis of bias requires investing resources in the defi-        (e.g. ML experts and data scientists, legal counsel, experts
  nition and defence of unbiased benchmarks and the spec-           in the target variable, experts in the use case), managers
  ification of a tolerance threshold. Since bias can exist ei-      and external representatives who provide challenge and va-
  ther above or below any given benchmark, random varia-            lidity as part of the overall exercise, ensuring representa-
  tion makes it impossible to achieve an exact ongoing fit.         tion of individuals from different groups in the target socio-
  The margin of error which can be tolerated depends on             demographic variables. In doing so, the objective is to im-
  what stakeholders find material, worth the apportion of           prove safer gambling outcomes across all cohorts and the
  resources and the level of variation in the values of a pro-      scope can be expanded to include the design and evaluation
  tected variable as measured over time and over different          of industry level interventions as well as risk identification
  data set samples.                                                 algorithms.
                                                                       On the ethics of AI more generally, we shall develop a
• Exercises to improve algorithmic fairness need to be in-          general framework out of our approach to investigating algo-
  corporated into overall business priorities, most likely en-      rithmic fairness in other use cases in the sector, supported by
  gaging appropriately balanced stakeholder groups, rather          a taxonomy of the diverse techniques available to improve
  than treated as a separable analytical exercise. This is both     fairness. We invite comment, engagement and challenge on
  because judgement calls need to be made as part of the            this paper as part of the broader project to improve practice
  exercise and because adjusting practice based on insight          and to develop relevant and industry-specific AI principles.
  may require the balancing of multiple objectives, some of
  which may be competing objectives.                                                        References
• Indirect discrimination needs to be analysed as a feature           Advisory Committee on Equal Opportunities for Women
  of a specific model rather than a feature of the data set. For    and Men for the European Commission. (2020). Opinion
  instance, a target variable such as gender may be mapped          on Artificial Intelligence - opportunities and challenges for
  in diverse ways against other variables in the data set de-       gender equality (published 18 March 2020).
  pending on the complexity of the model (e.g. linear, poly-
  nomial, interaction-dependent, integer/decimal structure,           Agarwal, Alekh; Beygelzimer, Aliiia; Dudfk, Miroslav;
  etc). Indirect discrimination is driven by whether your           Langford, John and Hanna, Wallach. A Reductions Ap-
  model can exploit a particular pattern, rather than by other      proach to Fair Classification, 35th International Conference
  patterns that might exist.                                        on Machine Learning, ICML 2018, Stockholm, Sweden,
• Any analysis of fairness is inevitably limited, both be-          July 2018.
  cause of changing expectations and the potential breadth
  of the topic. As such, it is important to treat it as a process      Baggio, S., Gainsbury, S., Starcevic, V., Richard, J.,
  rather than a one-off exercise and to recognise the limits        Beck, F., Billieux, J. (2018). Gender differences in gam-
  in any one exercise. In this initial exploratory work, for        bling preferences and problem gambling: a network-level
  instance, it is unclear what biases a voluntary provision         analysis, International Gambling Studies, 18:3, 512-525.
  of gender data might introduce. Gender bias is also likely
  Bolukbasi, T., Chang, K., Zou, J., Saligrama, V., Kalai,
A. (2016). Man is to Computer Programmer as Woman is to           McCarthy, S., Thomas, S.L., Bellringer, M.E. et al.
Homemaker? Debiasing Word Embeddings. Available via            (2019). Women and gambling-related harm: a narrative
arXiv:1607.06520v1 [cs.CL] 21 Jul 2016.                        literature review and implications for research, policy, and
                                                               practice. BMC Harm Reduction Journal, 16-18 2019.
  CDEI. (2020). AI Barometer Report: June 2020. London:
Centre for Data Ethics and Innovation, UK.                       Moerel, L. (2018). Algorithms can reduce discrimination,
                                                               but only with proper data. Publ. 16 Nov 2018 by IAPP, 2018.
  Choi, YooJung; Farnadi, Golnoosh; Babaki, Behrouz and
Broeck, Guy Van den. Learning Fair Naive Bayes Classifiers       Obermeyer, Z., Powers, B., Vogeli, C., Mullainathan,
by Discovering and Eliminating Discrimination Patterns. In     S. (2019). Dissecting racial bias in an algorithm used to
Proc. AAAI Conference on Artificial Intelligence, AAAI         manage the health of populations. Science 25 Oct 2019: 447-
2020, New York, NY, February 2020.                             453. https://science.sciencemag.org/content/366/6464/447.

   Dragicevic, S., Garcez, A., Percy, C., Sarkar, S. (2019).      Percy, C., França, M., Dragičević, S., Garcez, A. (2016):
Understanding the Risk Profile of Gambling Behaviour           Predicting online gambling self-exclusion: an analysis of
through Machine Learning Predictive Modelling and              the performance of supervised machine learning models,
Explanation. KR2ML 2019, Workshop at 33rd NeurIPS              International Gambling Studies, 2016.
Conference, Vancouver, Canada, December 2019 (available
via https://kr2ml.github.io/2019/papers/).                        Sarkar, S., Weyde, T., Garcez, A., Slabaugh, G., Drag-
                                                               icevic, S., Percy, C. (2016). Accuracy and interpretability
  Dwork, Cynthia; Hardt, Moritz; Pitassi, Toniann;             trade-offs in machine learning applied to safer gambling.
Reingold, Omer and Zemel, Richard. Fairness through            CEUR Workshop Proceedings, 1773. Dec. Available via
awareness, Innovations in Theoretical Computer Science         http://ceur-ws.org/Vol-1773/CoCoNIPS 2016 paper10.pdf.
Conference, ITCS2012, MIT CSAIL, Cambridge MA,
January 2012.                                                     Stenstrom, E., Saad, G. (2011). Testosterone, financial
                                                               risk-taking, and pathological gambling. Journal of Neuro-
   Friedler, Sorelle A; Choudhary, Sonam; Scheidegger,         science, Psychology, and Economics, 4(4), 254–266.
Carlos; Hamilton, Evan P; Venkatasubramanian, Suresh and
Roth, Derek. A Comparative Study of Fairness-Enhancing            Su, W., Han, X., Yu, H., Wu, Y., Potenza, M. (2020).
Interventions in Machine Learning, In Proc. 2019 ACM           Do men become addicted to internet gaming and women
Conference on Fairness, Accountability and Transparency,       to social media? A meta-analysis examining gender-related
Atlanta, GA, January 2019.                                     differences in specific internet addiction. Computers in
                                                               Human Behavior, Volume 113, 2020.
  Mehrabi, Ninareh; Morstatter, Fred; Saxena, Nripsuta;
Lerman, Kristina and Galstyan, Aram. A Survey on Bias             Suresh, H., Guttag, J. (2020). A Framework for Under-
and Fairness in Machine Learning, KR2ML Workshop at            standing Unintended Consequences of Machine Learning.
NeurIPS’19 Conference, Vancouver, Canada, December             Available via arXiv:1901.10002v3 [cs.LG], 2020.
2019 (available via https://kr2ml.github.io/2019/papers/).
                                                                 Venne, D., Mazar, A., Volberg, R. (2019). Gender
  Gambling Commission (2018). Participation in gambling        and Gambling Behaviors: A Comprehensive Analysis of
and rates of problem gambling – England 2016: Statistical      (Dis)Similarities. Int J Ment Health Addiction, 2019.
report. Birmingham, GC, UK.
                                                                 White, A., Garcez, A (2020). Measurable Counterfactual
   Goodman, B. (2016). A Step Towards Accountable              Local Explanations for Any Classifier. In Proc. 24th Eu-
Algorithms? Algorithmic Discrimination and the European        ropean Conference on Artificial Intelligence, ECAI 2020,
Union General Data Protection. 29th Conference on Neural       Santiago de Compostela, Spain, Aug 2020.
Information Processing Systems (NIPS 2016), Barcelona,
Spain, December 2016.                                            Wong, G., Zane, N., Saw, A., Chan, A. K. (2013).
                                                               Examining gender differences for gambling engagement
  Kleinberg, J., Ludwig, J., Mullainathan, S., Rambachan,      and gambling problems among emerging adults. Journal of
A. (2018). Advances in big data research in economics:         gambling studies, 29(2), 171–189.
Algorithmic fairness. AEA Papers and Proceedings 2018,
108: 22–27 https://doi.org/10.1257/pandp.20181018, 2018.          Zemel, Richard; Wu, Yu; Swersky, Kevin; Pitassi, To-
                                                               niann and Dwork, Cynthia. Learning Fair Representations,
   Lundberg, S., Lee, S. (2017). A Unified Approach to         30th International Conference on Machine Learning, ICML
Interpreting Model Predictions. Advances in Neural Infor-      2013, Atlanta, GA, June 2013.
mation Processing Systems 30 (NIPS 2017), Long Beach,
CA, December 2017.