Bayesian Models to Assess Risk of Corruption of Federal Management Units


                                  Ricardo S. Carvalho1 and Rommel N. Carvalho1,2
        1
            Department of Research and Strategic Information at the Brazilian Office of the Comptroller General ⇤
                             2
                               Department of Computer Science at the University of Brası́lia †
                                    {ricardo.carvalho,rommel.carvalho}@cgu.gov.br


                         Abstract                                as one of its competences the role of assisting directly
                                                                 and immediately the President on matters and measures
     This paper presents a data mining project that              related to preventing and fighting corruption. Through
     generated Bayesian models to assess risk of                 activities of strategic information production, the Depart-
     corruption of federal management units. With                ment of Research and Strategic Information (DIE) is the
     thousands of extracted features related to cor-             area responsible for investigating possible irregularities
     ruptibility, the data were processed using tech-            involving federal civil servants working in management
     niques like correlation analysis and variance               units.
     per class. We also compared two different                   Nowadays, there are more than thirty thousand active
     discretization methods: Minimum Description                 federal management units2 , all subject to investigation.
     Length Principle (MDLP) and Class-Attribute                 Due to this large number of units, most of the time
     Contingency Coefficient (CACC). The feature                 DIE is limited to performing only investigations of those
     selection process used Adaptive Lasso. To                   involved on large federal operations or recurrent com-
     choose our final model we evaluated three dif-              plaints, often restricting its activities to cases triggered
     ferent algorithms: Naı̈ve Bayes, Tree Aug-                  externally. Thus, it is important to have prioritization of
     mented Naı̈ve Bayes, and Attribute Weighted                 activities based on risks of involvement in corruption so
     Naı̈ve Bayes. Finally, we analyzed the rules                that DIE can act more effectively and proactively.
     generated by the model in order to support
     knowledge discovery.                                        This work has two main objectives and contributions.
                                                                 The first is to build a Bayesian model to assess risk of
                                                                 corruption of federal management units. To this end, we
                                                                 seek to apply data mining techniques based on the state-
1   INTRODUCTION                                                 of-the-art, along with a practical study of the information
                                                                 related to corruption. Therefore, we wish to contribute
Currently, it is known that corruption is a recurrent and        to CGU’s activities in fighting corruption by building
primary subject on the Brazilian government agenda,              an useful model for their work priorization. Also, the
fundamentally requiring its ostensive and efficient com-         step-by-step of this data mining project might be inter-
bat. Public corruption can be defined – supported by             esting for other practitioners, since it involves the com-
Brazilian Law no. 8,429, of June 19921 – as the act of           bination of several different methods. We show how we
misconduct or improper use of public office that leads to        applied correlation analysis and two discretization meth-
illicit enrichment, causing injury to the public treasury        ods to process features, Adaptive Lasso for feature selec-
or infringing upon the principles of the public adminis-         tion, and end up comparing three different algorithms to
tration.                                                         choose our final Bayesian model. Hence, this work gives
The Brazilian Office of the Comptroller General (CGU),           contribution to practitioners while describing the appli-
an agency incorporated in the Presidency structure, has          cation of data mining techniques with a practical objec-
    ⇤
                                                                 tive and singular combination of techniques.
     SAS, Quadra 01, Bloco A, Edificio Darcy Ribeiro Brasilia,
DF, Brazil
     †                                                             2
     Campus Darcy Ribeiro Brası́lia, DF, Brazil                      Management Units dataset:  http://www.
   1
     Brazilian Law no. 8.429, June 1992: http://www.             tesourotransparente.gov.br/ckan/dataset/
planalto.gov.br/ccivil_03/leis/l8429.htm                         siafi-relatorio-unidades-gestoras


                                                       BMAW 2016 - Page 28 of 59
The second objective is to achieve knowledge discovery         data mining in the prevention of corruption. Another pa-
in relation to information about corruptibility of federal     per (Balaniuk et al., 2012) shows the use of Naı̈ve Bayes
management units, seeking to extract new rules in this         to evaluate the risk of corruption in public procurement.
domain. To this end, the information of management             The authors applied natural logarithm to discretize at-
units available – as well as its direct and indirect rela-     tributes and based their assessment on the results of the
tionships with the federal civil servants working there –      conditional probabilities defined by experts.
are analyzed with the support of DIE experts in fighting
                                                               In addition, a recent paper (Carvalho et al., 2013)
corruption. After building our final model, we analyzed
                                                               presents the use of probabilistic ontologies to design and
its derived rules. With this in mind, we wish to contribute
                                                               test a model that performs the fusion of information to
to the enrichment of the experts’ knowledge in fighting
                                                               detect possible fraud in bidding processes involving fed-
corruption.
                                                               eral money in Brazil.
In Section 2, we depict works most closely related to
                                                               With respect to discretization algorithms, it has currently
fighting corruption and how data mining has been used,
                                                               received a lot of focus as a pre-processing technique,
while in Section 3 we give an overview of the informa-
                                                               mostly since many machine learning algorithms are
tion selected by DIE experts that will be used to build our
                                                               known to produce better models by discretizing continu-
models. Section 4 describes steps taken to pre-process
                                                               ous attributes (Garcia et al., 2013). Two algorithms have
data, such as correlation analysis, discretization, and also
                                                               received generally great performance, namely: CACC
feature selection. In Section 5 we show how we used
                                                               (Class-Attribute Contingency Coefficient) (Tsai et al.,
machine learning to build several models and Section 6
                                                               2008) and MDLP (Minimum Description Length Princi-
depicts our evaluation strategies. Section 7 discusses our
                                                               ple) (Irani, 1993). In this work we compare the results of
deployment efforts related to the products of this work
                                                               these algorithms after feature selection by creating mod-
and we end this paper with a conclusion in Section 8.
                                                               els to allow us to choose the best results.
                                                               For feature selection, a recent review (Tang et al., 2014)
2   RELATED WORK                                               shows several different widely used techniques, such as
                                                               Adaptive Lasso (Zou, 2006). The Adaptive Lasso has ba-
In the last decade, observing current research areas, a        sically two steps. First, an initial estimator is obtained,
topic closely related to risk of corruption is fraud de-       usually using Ridge Regression (Zou, 2006). Then a op-
tection. The main objective of fraud detection is to re-       timization problem with a weighted L1 penalty is carried
veal trends of suspicious acts. For example, an emerg-         out. The initial estimator generally puts more weight on
ing theme is to use data mining to detect financial fraud.     the zero coefficients and less on nonzero ones to improve
A review of the academic literature of such application        upon its predecessor: the Lasso (Zou, 2006). Compared
(Ngai et al., 2011) shows its successful use in detecting      to the Lasso, the adaptive Lasso has the advantage of the
credit card fraud, money laundering, bankruptcy predic-        oracle property (Zou, 2006), resulting in a performance
tion, among others. This review also identifies common         as well as if the true underlying model were given in
data mining techniques used in fraud detection, includ-        advance. Compared to the SCAD and bridge methods
ing Artificial Neural Networks, Decision Trees, Logistic       (Tang et al., 2014), which also have the oracle property,
Regression, and Naı̈ve Bayes.                                  the advantage of the adaptive Lasso is its computational
In this context, a recent survey on the subject of data        efficiency.
mining-based fraud detection (Phua et al., 2010) dis-
plays a summary of published technical articles and a          3   DATA UNDERSTANDING
review on the topic. This survey, as well as other works
(Kou et al., 2004), includes comments on similar appli-        Seeking to analyze corruptibility of federal management
cations. Also, an individual-oriented corruption analysis      units, various databases that DIE has access have been
(Carvalho et al., 2014) was done building a corruption         identified as useful for this work. For a better under-
risk model for affiliated civil servants with algorithms       standing of the data, the available information were di-
like Random Forest and Bayesian Networks.                      vided into four dimensions, namely: Corruption; Em-
                                                               ployment; Sanctions; and Political.
Regarding aspects of corruption, research related to pub-
lic bidding and contracting processes has also been car-       Some of the information treated in this work are related
ried out, though not as widely as in fraud detection. The      to the federal civil servants that work in the management
use of clustering and association rules to the problem         units. These information can give an idea of how much
of cartels in public bidding processes (Silva and Ralha,       power a certain unit concentrates or how much influence
2010) found results that corroborate the application of        the civil servants bring to the unit environment.


                                                     BMAW 2016 - Page 29 of 59
Due to the limited size of this paper, we present each        eral Court of Accounts (TCU)5 , that judges the accounts
dimension giving only an overview of the existing             of each management unit, deciding about its regularity
databases and relevant information identified by DIE ex-      according to Brazilian laws. Similarly, we used CGU’s
perts regarding possible relationships with corruptibility.   certificates of management irregularity6 .
                                                              Therefore, the experts in fighting corruption of DIE se-
3.1       CORRUPTION DIMENSION                                lected four different information, that later can be trans-
                                                              formed in four or more different features in the data
CGU maintains the Federal Administration Registry of          preparation phase. Examples of these information are:
Expelled (CEAF)3 , which is a database with information       number of accounts judged irregular from TCU; and
that gathers expulsion penalties (expel, retirement abro-     number of regularity certificates from CGU.
gation, and dismissal) of federal civil servants since the
year of 2003.
                                                              3.4       POLITICAL DIMENSION
This database will be used to define management units
that are corrupt, namely the positive class in our machine    The political dimension covers data of federal civil ser-
learning algorithms. The first paragraph of the Section 4     vants related to political activities, namely analyzing in-
describes how this is done.                                   formation of affiliation to political parties. By getting the
                                                              affiliated servants of each management unit, we can mea-
                                                              sure how much each political party influences the units
3.2       EMPLOYMENT DIMENSION                                and if this will relate to corruption. The main database
                                                              comes from Superior Electoral Court (TSE)7 .
The employment dimension covers the information of
management units regarding the federal civil servants         Taking into account the knowledge of DIE experts, from
that work there. It may be related to basic information       the data provided by TSE we selected nine different in-
such as office time and income, or even data that exposes     formation. Examples are: number of affiliations for a
the importance of the unit the servant is working – such      given political party and total number of affiliated ser-
as number of coordination roles or critic public offices      vants in each management unit.
like those that deal directly with public resources or fi-
nancial assets.
                                                              4         DATA PREPARATION
Most of the information comes from the Human Re-
sources Integrated System (SIAPE) of the Brazilian Fed-       The data to be prepared are extracted for two classes,
eral Government4 .                                            called “Corrupt” and “Non Corrupt”. On one hand, “Cor-
For the employment dimension, the experts in fighting         rupt” management units are those that throughout its his-
corruption of DIE selected 16 different information, that     tory have had at least one civil servant who was expelled
later can be transformed in 16 or more different features     due specifically to corruption. In other words, units that
in the data preparation phase. Examples of these infor-       had corrupt civil servants, which are those registered in
mation are: mean, maximum, and minimum monthly in-            CEAF whose legal basis for expulsion is consistent with
come; number of coordination roles that deal with public      our definition for corruption, as stated in Section 1.
contracts; number of roles for specific activities such as    On the other hand, to build the “Non Corrupt” group,
head of regional agency.                                      we sampled a large group of management units and re-
                                                              moved those considered “Corrupt” by definition, keeping
3.3       SANCTIONS DIMENSION                                 the random sample proportion.
                                                              Thus, the dataset for non corrupt was created with a ran-
The sanctions dimension covers the information of man-        dom sample of approximately 4,800 federal management
agement units that got sanctioned, due to practices of bad    units – amount approximately 8 times greater than the
management of public money. We used sanctions in the          number of corrupt units.
Accounts Judged Irregular (CADIRREG) from the Fed-
                                                                    5
                                                                  CADIRREG:           http://contas.tcu.gov.br/
      3
     CEAF        –      Link:           http://www.           cadirreg/CadirregConsultaNome
                                                                6
portaldatransparencia.gov.br/expulsoes/                           CGU’s audits reports: http://sistemas.cgu.gov.
entrada                                                       br/relats/relatorios.php
   4                                                            7
     Website for the Human Resources Integrated System            TSE      repositories:         http://www.tse.
(SIAPE) of the Brazilian Federal Government: http://          jus.br/eleicoes/estatisticas/
www.siapenet.gov.br                                           repositorio-de-dados-eleitorais


                                                    BMAW 2016 - Page 30 of 59
The data preparation phase includes feature selection         of them were removed – most of these being related to
and goes through the following steps, which will be de-       binarized categorical variables.
scribed in the next sections:
                                                              We also preliminarily addressed perfect pairwise correla-
                                                              tion, which accounts for redundant information and may
  • Data Cleaning and Feature Engineering: Adjusts            give biased estimates. Perfectly correlated features may
    the dataset;                                              have been added accidentally, or may have arisen after
                                                              feature engineering.
  • Preliminary Analysis: Treats variance zero per class
    and correlation;                                          Among the 1,495 variables analyzed, 96 – 48 pairs – re-
                                                              turned perfect correlation. DIE experts chose which to
  • Data Separation: Segregates data for training and         eliminate in each pair.
    testing;
                                                              4.3   DATA SEPARATION
  • Intermediary Analysis: Variance and correlation fil-
    tering;                                                   At this point, our complete dataset had 688 corrupt units
                                                              and 4,792 non corrupt units, with 1,447 features.
  • Feature Selection: Uses Adaptive Lasso;
                                                              In this step we created two different datasets: Training
  • Discretization: Applies MDLP and CACC;                    Data (DT) and Testing Data (DTE). The first will be used
                                                              through all data preparation and modeling, while the sec-
4.1   DATA CLEANING AND FEATURE                               ond will only be used as a final test after choosing the
      ENGINEERING                                             best final model.

Besides usual data cleaning activities – such as adjust-      To keep the original balance, DTE was created using a
ment of inconsistencies, data conversion, and standardiz-     random sample of 20% of corrupts plus 20% of the non
ing data types – the treatment of missing values was also     corrupts, and DT stayed with the remaining data, corre-
conducted. For categorical variables we created a cate-       sponding to 80% of the complete dataset.
gory “NA” representing the absence of values for a given
variable. As for counting numerical variables, missing        4.4   INTERMEDIARY ANALYSIS
values represent the actual value of zero, so they were
                                                              Similarly to the Preliminary Analysis, we again analyzed
replaced by such value. In addition, other fields with
                                                              the class-variance. This resulted in removing 62 features
missing values were treated individually. For example,
                                                              with zero variance in one of the classes.
date of cancellation of party affiliation, when affiliation
still active, were replaced by a current date in order to     Nevertheless, in the intermediary analysis we did a dif-
create features for time of affiliation.                      ferent correlation analysis, following the well known hy-
                                                              pothesis (Hall, 1999): “A good feature subset is one that
On feature engineering, first we created binarized fea-
                                                              contains features highly correlated with (predictive of)
tures for all the categorical variables. Then, since some
                                                              the class, yet uncorrelated with (not predictive of) each
information can be registered more than once for a given
                                                              other”.
management unit – for example, one can have several
regularity certificates – we had to summarize the features    Initially we calculated the correlation matrix of the re-
for each unit. With only numerical features, a few of         maining 1,376 features, also adding their correlation with
them were summarized by creating features with maxi-          the class column indicating corruptibility – 0 to non cor-
mum, minimum, average, and total. For example, annual         rupt units and 1 to corrupt units. Then we filtered pairs of
income was transformed into maximum annual income,            features with correlation equal or greater than 0.70 (ab-
minimum annual income, and mean annual income.                solute value) – number generally considered high cor-
                                                              relation (Taylor, 1990). After that, the resulting matrix
After this step, we had created 2,238 different features.
                                                              was sorted in descending order regarding the correlation
                                                              of the features in relation to the class.
4.2   PRELIMINARY ANALYSIS
                                                              Thereafter, the rows of the matrix were traversed from
At first we removed features that had variance, within        the features with the largest correlation to the class. In
one of the classes, equal to zero, since with zero class-     each row, we kept the feature with the highest correlation
variance algorithms might bring estimates of coefficients     with the class and removed the remaining features – from
that do not generalize (Hosmer et al., 2013). After calcu-    the dataset and the matrix – that had inter-correlation
lating class-variance for each of the 2,238 features, 747     higher than 0.70 (absolute value).


                                                    BMAW 2016 - Page 31 of 59
With this algorithm we eliminated 468 features that had            used. The dataset discretized with MDLP algorithm re-
absolute correlation equal or greater than 0.70, thus re-          turned 23 binary features, while CACC returned 66 – the
maining 910 features.                                              reason these datasets have less features than the original
                                                                   is due to the fact that constant features were automati-
Such an approach was used to try to avoid the collinear-
                                                                   cally removed.
ity problem, mainly due to the fact that it is impossible to
analyze all the possible combinations of feature groups,
involved in this work. Thus, the correlation heuristic of          5         MODELING
each feature with its class – although not fully reflected in
a model due to interactions between the features – serves          In the modeling phase we started by creating models for
as a technique to try to keep the theoretically most signif-       each of the datasets discretized with MDLP and CACC.
icant features – considering the correlation with class8 .         For this, we created Bayesian models using three dif-
                                                                   ferent algorithms: Naı̈ve Bayes (Lowd and Domingos,
4.5       FEATURE SELECTION                                        2005), Tree Augmented Naı̈ve Bayes (Zheng and Webb,
                                                                   2011), and Attribute Weighted Naı̈ve Bayes (Taheri et al.,
To perform feature selection, each dataset passes through          2014).
a regularized regression, specifically using Adaptive
Lasso. For this purpose, we start by performing Ridge              This task was done using the R Package named caret9 .
Regression with 10-fold cross-validation on the DT                 We used 10-fold cross-validation to evaluate AUC and
dataset. The estimates of the coefficients are used to             tried several different combinations of parameters for
construct an adaptive weights vector. With this vector             each of the three algorithms – from 20 to 60 combina-
introduced as the penalty factor, we implement Adaptive            tions. For example, for Tree Augmented Naı̈ve Bayes
Lasso with 10-fold cross-validation. It is worth noticing          we used three score functions (loglik, bic, aic) each along
that the Adaptive Lasso can force some of the coefficients         side 20 different values for smoothing (from 0 to 19). Af-
to have estimates exactly equal to zero, thereby reducing          ter these models were built, caret selects the one with the
the number of features.                                            combination of parameters that resulted in the best AUC
                                                                   value for each algorithm.
After feature selection with Adaptive Lasso, we selected
144 features. The 10-fold cross-validation resulted in a           5.1       DISCRETIZATION SELECTION
AUC (Area Under the ROC Curve) (Bradley, 1997) of
0.85, considered satisfactory.                                     The first step is to choose the most suitable discretization.
                                                                   With this in mind, for each discretized dataset we take
4.6       DISCRETIZATION                                           the average results of AUC for the three algorithms used,
                                                                   again using 10-fold cross validation to try to estimate the
In recent years, discretization has received increasing re-        out-of-sample results. The mean AUC outcomes are de-
search attention (Garcia et al., 2013). In the case of             picted in Table 1, along side the number of features each
non-monotonic variables, the use of discretization tech-           dataset has.
niques proves to be essential since it makes it possible to
separate an original non-monotonic variable in various
monotonous derived covariates (Tufféry, 2011). Also,              Table 1: Mean Results of Bayesian Models for each Dis-
when thinking about Bayesian models, some algorithms               cretized Dataset
need all the features to be categorical, and discretization                Discretization No. of features AUC
is a method of doing so.                                                      MDLP               23         0.82
                                                                              CACC               66         0.83
In recent research (Garcia et al., 2013), two algorithms
have received generally great performance, namely:
MDLP (Minimum Description Length Principle) (Irani,                Although the results for the dataset with CACC dis-
1993) and CACC(Class-Attribute Contingency Class)                  cretization were slightly better, it is desirable to minimize
(Garcia et al., 2013). We compare these algorithms by              the number of features considered in a model. Mainly
later creating models for groups of features discretized           models with less features tend to be more numerically
with each method.                                                  stable and be adopted more easily. Also, a model with
Accordingly, we have generated two different datasets              less features can avoid overfitting and increase its inter-
from DT, one dataset for each discretization method                pretability.
      8                                                                  9
     It may be useful to use different methods to analyze corre-      R Package caret: https://cran.r-project.org/
lation in future work.                                             web/packages/caret/index.html


                                                         BMAW 2016 - Page 32 of 59
Therefore, we chose to select the features discretized         tracted that indicate an increase of risk of corruption are
with MDLP, since the respective model achieved results         showed below.
close to CACC but kept almost three times less features.
                                                                 • Accounts judged irregular by TCU;
5.2   MODEL SELECTION
                                                                 • Responsibilities related to financial activities;
With the discretized dataset chosen, we now evaluate             • Substitution public functions for controlling ex-
the Bayesian models built with the three algorithms:               penses;
TAN (Tree Augmented Naı̈ve Bayes) AW-NB (Attribute
Weighted Naı̈ve Bayes) and NB (Naı̈ve Bayes). The                • Number of requested civil servants allocated;
AUC outcomes are showed in Table 2.
                                                                 • Heading roles on regional agencies;

Table 2: Results of Bayesian Models for MDLP Dataset             • Political party affiliations;
                   Algorithm      AUC                            • Activities spread by multiple municipalities; and
                     TAN         0.8272
                                                                 • Number of public offices occupied by designation
                    AW-NB        0.8207
                                                                   (without a selective process).
                      NB         0.8244
                                                               After discussing the main rules with DIE experts, they
Observing the results we chose the Bayesian model cre-         made a few comments in order to rationalize upon the
ated with NB (Naı̈ve Bayes) to be our final model, since       knowledge discovered by the model.
it is more interpretable and simpler, while keeping prac-
tically the same results as the other two models.                • Accounts judged irregular by TCU are themselves
                                                                   by definition scenarios that involve inadequacies or
6     EVALUATION                                                   irregularities;
                                                                 • Responsibilities related to expenses and financial
In the evaluation phase, we start by analyzing the re-             activities are critical, since they involve public re-
sults of the final model on the testing data separated on          sources and possible embezzlements;
the beginning of this work. Finally, we analyzed the
conditional probabilities of the features to extract useful      • A management unit with several civil servants al-
knowledge regarding fighting corruption.                           located by request might show a scenario of poor
                                                                   strength of the internal career;
6.1   TESTING DATA                                               • The heading roles related to regional management
                                                                   units usually have civil servants holding a relatively
To ultimately validate our final model, we used the
                                                                   high amount of decision-making power with greater
dataset separated in the data preparation phase for this
                                                                   discretion, displaying a scenario of high propensity
purpose: the testing dataset (DTE). The first step here is
                                                                   to corruption;
to adjust DTE to have the same 23 final features selected
from MDLP discretization.                                        • Political party affiliations are related to greater po-
Then, applying the final model on DTE we got AUC of                litical influence in decisions of public interest on the
approximately 0.76. Hence, we consider the results sat-            federal management units;
isfactory. The reason being that the results are just a lit-     • Units with activities on many municipalities have to
tle below those obtained in the training dataset and are           deal with decentralization problems; and
higher than 0.70, considered to be a threshold of good
models.                                                          • Public offices employed by designation are occu-
                                                                   pied in the government due to nomination from
6.2   KNOWLEDGE DISCOVERY                                          discretionary authorities, not necessarily related to
                                                                   merit.
Observing the conditional probabilities of the final
model, we extracted the rules it follows to define cor-        Therefore, by analyzing the rules together with the ex-
ruptibility for federal management units. This knowl-          perts’ comments, we see that the results have reason-
edge discovery aims to give a contribution to the activ-       able suitability in scenarios involving federal manage-
ities of fighting corruption. Some of the main rules ex-       ment units.


                                                     BMAW 2016 - Page 33 of 59
7   DEPLOYMENT                                                formed per dataset. MDLP was chosen due to great re-
                                                              sults aligned with a considerable reduction of the number
In the deployment phase, we created a Web application to      of features selected – from 144 to 23.
allow managers at CGU to query management units and
                                                              After choosing the dataset discretized with MDLP we
analyze their risk of corruption. With paths of grouped
                                                              evaluated the AUC for the three algorithms used on mod-
queries, managers can now view management units or-
                                                              eling. The results were very close, approximately 0.82.
ganized by their agencies. They are also able to perform
                                                              Therefore, we chose the model created with Naı̈ve Bayes
ad-hoc queries, using as input unique identifiers of man-
                                                              to be our final model, since it is more interpretable and
agement units to obtain risk of corruption analysis for an
                                                              simpler.
individual unit or groups of them.
                                                              The dataset labeled Testing (DTE) separated on data
To deploy the predictive model to assess risk of corrup-
                                                              preparation was then used to confirm the validity of the
tion we simply implemented the calculation of Naı̈ve
                                                              final model. DTE showed AUC of approximately 0.76.
Bayes with the conditional probabilities for the features
selected on our final model. Using the output probabil-       Finally, the rules of the final model were extracted. With
ities given by the model, we then discretized the results     help from DIE experts, we derived knowledge for corrup-
manually to only show risk categories, specifically: less     tion fight activities. Rules generated and experts’ com-
than 0.20 as Very Low; equal or greater than 0.20 but         ments were outlined to give an overview of the results.
less than 0.40 as Low; equal or greater than 0.40 but less
                                                              The predictive model from this project was also deployed
than 0.60 as Medium; equal or greater than 0.60 but less
                                                              in a Web application, allowing managers from CGU to
than 0.80 as High; and equal or greater than 0.80 as Very
                                                              query and analyze federal management units regarding
High.
                                                              their risk of corruption. With the results of our model,
The Web application also generates pdf reports contain-       CGU is already prioritizing corruption related activities
ing, for a given management unit: risk of corruption, av-     to help maximize audits efficacy.
erage and maximum risk of corruption of the manage-
                                                              Therefore, this work contributed with an end-to-end data
ment units on the same agency. The application not only
                                                              mining project overview, with application of several
shows risk results, but also several other government data
                                                              state-of-the-art techniques. We reinforced CGU’s activ-
related to each management unit, allowing a general view
                                                              ities in fighting corruption by building an useful model
of each unit.
                                                              to assess risk of corruption of federal management units.
With the application running, we started to present this      The knowledge discovered is also increasing the exper-
work to all areas of CGU. Currently, several activities in-   tise of DIE analysts. With the Web application devel-
volving management units are being prioritized using our      oped from this project, we help potentially save millions
risk of corruption predictive model together with other       in public resources. Additionally, with risk assessment
information.                                                  we encourage proactive audits, helping managers plan
                                                              their work. To that end, we generate impact nationwide
8   CONCLUSION                                                in fighting corruption.

This paper described a data mining project that generated
Bayesian models to assess risk of corruption of federal
management units. We analyzed data from several gov-
ernment databases and, with the help of DIE experts, we
developed thousands of important features. These vari-
ables were prepared and pre-processed removing those
with zero class-variance and high inter-correlation.
Feature selection was done using Adaptive Lasso, which
selected the 144 most relevant features. We compared
two different discretization methods: CACC and MDLP.
Bayesian models were built for datasets discretized with
the two methods using the following algorithms: Naı̈ve
Bayes, Tree Augmented Naı̈ve Bayes, and Attribute
Weighted Naı̈ve Bayes. To first choose the best dis-
cretization method we evaluated our results obtaining
the average of the 10-fold cross-validation metrics per-


                                                    BMAW 2016 - Page 34 of 59
Acknowledgements                                               Daniel Lowd and Pedro Domingos. Naive bayes models
                                                                 for probability estimation. In Proceedings of the 22nd
The authors would like to thank the corruption fighting          international conference on Machine learning, pages
expert Victor Steytler for providing useful insights for         529–536. ACM, 2005.
the development of this work. Finally, the authors would
like to thank CGU for providing the resources necessary        EWT Ngai, Yong Hu, YH Wong, Yijun Chen, and Xin
to work in this research, as well as for allowing its publi-    Sun. The application of data mining techniques in
cation.                                                         financial fraud detection: A classification framework
                                                                and an academic review of literature. Decision Sup-
                                                                port Systems, 50(3):559–569, 2011.
References                                                     Clifton Phua, Vincent Lee, Kate Smith, and Ross Gayler.
R. Balaniuk, P. Bessiere, E. Mazer, and P. Cobbe. Risk           A comprehensive survey of data mining-based fraud
  based Government Audit Planning using Naı̈ve Bayes             detection research. arXiv preprint arXiv:1009.6119,
  Classifiers. Advances in Knowledge-Based and Intel-            2010. URL http://arxiv.org/abs/1009.
  ligent Information and Engineering Systems, 2012.              6119.
Andrew P Bradley. The use of the area under the                Carlos Vinı́cius Silva and Célia Ralha. Utilização de
  roc curve in the evaluation of machine learning algo-          Técnicas de Mineração de Dados como Auxı́lio na
  rithms. Pattern recognition, 30(7):1145–1159, 1997.            Detecção de Cartéis em Licitações. In WCGE -
                                                                 II Workshop de Computação Aplicada em Governo
Ricardo Carvalho, Rommel Carvalho, Marcelo Ladeira,              Eletrônico, 2010.
  Fernando Monteiro, and Gilson Mendes. Using po-
                                                               Sona Taheri, John Yearwood, Musa Mammadov, and
  litical party affiliation data to measure civil servants’
                                                                 Sattar Seifollahi. Attribute weighted naive bayes clas-
  risk of corruption. In 2014 Brazilian Conference on
                                                                 sifier using a local optimization. Neural Computing
  Intelligent Systems (BRACIS), pages 166–171. IEEE,
                                                                 and Applications, 24(5):995–1002, 2014.
  2014.
                                                               Jiliang Tang, Salem Alelyani, and Huan Liu. Feature
Rommel Carvalho, Shou Matsumoto, Kathryn B.
                                                                  selection for classification: A review. Data Classifica-
  Laskey, Paulo C. G. Costa, Marcelo Ladeira, and La-
                                                                  tion: Algorithms and Applications, page 37, 2014.
  cio L. Santos. Probabilistic ontology and knowledge
  fusion for procurement fraud detection in brazil. In         Richard Taylor. Interpretation of the correlation coeffi-
  Uncertainty Reasoning for the Semantic Web II, pages           cient: a basic review. Journal of diagnostic medical
  19–40. Springer, 2013.                                         sonography, 6(1):35–39, 1990.
S. Garcia, J. Luengo, J. A. Sáez, V. Lopez, and F. Her-       Cheng-Jung Tsai, Chien-I Lee, and Wei-Pang Yang. A
   rera. A survey of discretization techniques: taxonomy         discretization algorithm based on class-attribute con-
   and empirical analysis in supervised learning. Knowl-         tingency coefficient. Information Sciences, 178(3):
   edge and Data Engineering, IEEE Transactions on, 25           714–731, 2008.
   (4):734–750, 2013.                                          Stéphane Tufféry. Data mining and statistics for decision
Mark A. Hall.     Correlation-based feature se-                   making. John Wiley & Sons, 2011.
 lection for machine learning.      PhD thesis,                Fei Zheng and Geoffrey I Webb. Tree augmented naive
 The University of Waikato, 1999.          URL                   bayes. In Encyclopedia of Machine Learning, pages
 https://www.lri.fr/˜pierres/donn%                               990–991. Springer, 2011.
 E9es/save/these/articles/lpr-queue/                           Hui Zou. The adaptive lasso and its oracle properties.
 hall99correlationbased.pdf.                                     Journal of the American statistical association, 101
David W Hosmer, Stanley Lemeshow, and Rodney X                   (476):1418–1429, 2006.
  Sturdivant. Applied logistic regression, volume 398.
  John Wiley & Sons, 2013.
Keki B Irani. Multi-interval discretization of continuous-
  valued attributes for classification learning. 1993.
Yufeng Kou, Chang-Tien Lu, Sirirat Sirwongwattana,
  and Yo-Ping Huang. Survey of fraud detection tech-
  niques. In Networking, sensing and control, 2004
  IEEE international conference on, volume 2, pages
  749–754. IEEE, 2004.


                                                     BMAW 2016 - Page 35 of 59