<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Bayesian Models to Assess Risk of Corruption of Federal Management Units</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ricardo S. Carvalho</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rommel N. Carvalho</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Brazilian Law no. 8.429</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Campus Darcy Ribeiro Bras ́ılia</institution>
          ,
          <addr-line>DF</addr-line>
          ,
          <country country="BR">Brazil</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Computer Science at the University of Bras ́ılia</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <abstract>
        <p>This paper presents a data mining project that generated Bayesian models to assess risk of corruption of federal management units. With thousands of extracted features related to corruptibility, the data were processed using techniques like correlation analysis and variance per class. We also compared two different discretization methods: Minimum Description Length Principle (MDLP) and Class-Attribute Contingency Coefficient (CACC). The feature selection process used Adaptive Lasso. To choose our final model we evaluated three different algorithms: Na¨ıve Bayes, Tree Augmented Na¨ıve Bayes, and Attribute Weighted Na¨ıve Bayes. Finally, we analyzed the rules generated by the model in order to support knowledge discovery.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>INTRODUCTION
Currently, it is known that corruption is a recurrent and
primary subject on the Brazilian government agenda,
fundamentally requiring its ostensive and efficient
combat. Public corruption can be defined – supported by
Brazilian Law no. 8,429, of June 19921 – as the act of
misconduct or improper use of public office that leads to
illicit enrichment, causing injury to the public treasury
or infringing upon the principles of the public
administration.</p>
      <p>The Brazilian Office of the Comptroller General (CGU),
an agency incorporated in the Presidency structure, has
as one of its competences the role of assisting directly
and immediately the President on matters and measures
related to preventing and fighting corruption. Through
activities of strategic information production, the
Department of Research and Strategic Information (DIE) is the
area responsible for investigating possible irregularities
involving federal civil servants working in management
units.</p>
      <p>Nowadays, there are more than thirty thousand active
federal management units2, all subject to investigation.
Due to this large number of units, most of the time
DIE is limited to performing only investigations of those
involved on large federal operations or recurrent
complaints, often restricting its activities to cases triggered
externally. Thus, it is important to have prioritization of
activities based on risks of involvement in corruption so
that DIE can act more effectively and proactively.
This work has two main objectives and contributions.
The first is to build a Bayesian model to assess risk of
corruption of federal management units. To this end, we
seek to apply data mining techniques based on the
stateof-the-art, along with a practical study of the information
related to corruption. Therefore, we wish to contribute
to CGU’s activities in fighting corruption by building
an useful model for their work priorization. Also, the
step-by-step of this data mining project might be
interesting for other practitioners, since it involves the
combination of several different methods. We show how we
applied correlation analysis and two discretization
methods to process features, Adaptive Lasso for feature
selection, and end up comparing three different algorithms to
choose our final Bayesian model. Hence, this work gives
contribution to practitioners while describing the
application of data mining techniques with a practical
objective and singular combination of techniques.</p>
      <p>2Management Units dataset: http://www.
tesourotransparente.gov.br/ckan/dataset/
siafi-relatorio-unidades-gestoras
The second objective is to achieve knowledge discovery
in relation to information about corruptibility of federal
management units, seeking to extract new rules in this
domain. To this end, the information of management
units available – as well as its direct and indirect
relationships with the federal civil servants working there –
are analyzed with the support of DIE experts in fighting
corruption. After building our final model, we analyzed
its derived rules. With this in mind, we wish to contribute
to the enrichment of the experts’ knowledge in fighting
corruption.</p>
      <p>In Section 2, we depict works most closely related to
fighting corruption and how data mining has been used,
while in Section 3 we give an overview of the
information selected by DIE experts that will be used to build our
models. Section 4 describes steps taken to pre-process
data, such as correlation analysis, discretization, and also
feature selection. In Section 5 we show how we used
machine learning to build several models and Section 6
depicts our evaluation strategies. Section 7 discusses our
deployment efforts related to the products of this work
and we end this paper with a conclusion in Section 8.
2</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        In the last decade, observing current research areas, a
topic closely related to risk of corruption is fraud
detection. The main objective of fraud detection is to
reveal trends of suspicious acts. For example, an
emerging theme is to use data mining to detect financial fraud.
A review of the academic literature of such application
        <xref ref-type="bibr" rid="ref11">(Ngai et al., 2011)</xref>
        shows its successful use in detecting
credit card fraud, money laundering, bankruptcy
prediction, among others. This review also identifies common
data mining techniques used in fraud detection,
including Artificial Neural Networks, Decision Trees, Logistic
Regression, and Na¨ıve Bayes.
      </p>
      <p>
        In this context, a recent survey on the subject of data
mining-based fraud detection
        <xref ref-type="bibr" rid="ref12">(Phua et al., 2010)</xref>
        displays a summary of published technical articles and a
review on the topic. This survey, as well as other works
        <xref ref-type="bibr" rid="ref9">(Kou et al., 2004)</xref>
        , includes comments on similar
applications. Also, an individual-oriented corruption analysis
        <xref ref-type="bibr" rid="ref3">(Carvalho et al., 2014)</xref>
        was done building a corruption
risk model for affiliated civil servants with algorithms
like Random Forest and Bayesian Networks.
      </p>
      <p>
        Regarding aspects of corruption, research related to
public bidding and contracting processes has also been
carried out, though not as widely as in fraud detection. The
use of clustering and association rules to the problem
of cartels in public bidding processes
        <xref ref-type="bibr" rid="ref12 ref13">(Silva and Ralha,
2010)</xref>
        found results that corroborate the application of
data mining in the prevention of corruption. Another
paper
        <xref ref-type="bibr" rid="ref1">(Balaniuk et al., 2012)</xref>
        shows the use of Na¨ıve Bayes
to evaluate the risk of corruption in public procurement.
The authors applied natural logarithm to discretize
attributes and based their assessment on the results of the
conditional probabilities defined by experts.
      </p>
      <p>
        In addition, a recent paper
        <xref ref-type="bibr" rid="ref4">(Carvalho et al., 2013)</xref>
        presents the use of probabilistic ontologies to design and
test a model that performs the fusion of information to
detect possible fraud in bidding processes involving
federal money in Brazil.
      </p>
      <p>
        With respect to discretization algorithms, it has currently
received a lot of focus as a pre-processing technique,
mostly since many machine learning algorithms are
known to produce better models by discretizing
continuous attributes
        <xref ref-type="bibr" rid="ref5">(Garcia et al., 2013)</xref>
        . Two algorithms have
received generally great performance, namely: CACC
(Class-Attribute Contingency Coefficient)
        <xref ref-type="bibr" rid="ref17">(Tsai et al.,
2008)</xref>
        and MDLP (Minimum Description Length
Principle)
        <xref ref-type="bibr" rid="ref8">(Irani, 1993)</xref>
        . In this work we compare the results of
these algorithms after feature selection by creating
models to allow us to choose the best results.
      </p>
      <p>
        For feature selection, a recent review
        <xref ref-type="bibr" rid="ref15">(Tang et al., 2014)</xref>
        shows several different widely used techniques, such as
Adaptive Lasso
        <xref ref-type="bibr" rid="ref20">(Zou, 2006)</xref>
        . The Adaptive Lasso has
basically two steps. First, an initial estimator is obtained,
usually using Ridge Regression
        <xref ref-type="bibr" rid="ref20">(Zou, 2006)</xref>
        . Then a
optimization problem with a weighted L1 penalty is carried
out. The initial estimator generally puts more weight on
the zero coefficients and less on nonzero ones to improve
upon its predecessor: the Lasso
        <xref ref-type="bibr" rid="ref20">(Zou, 2006)</xref>
        . Compared
to the Lasso, the adaptive Lasso has the advantage of the
oracle property
        <xref ref-type="bibr" rid="ref20">(Zou, 2006)</xref>
        , resulting in a performance
as well as if the true underlying model were given in
advance. Compared to the SCAD and bridge methods
        <xref ref-type="bibr" rid="ref15">(Tang et al., 2014)</xref>
        , which also have the oracle property,
the advantage of the adaptive Lasso is its computational
efficiency.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>DATA UNDERSTANDING</title>
      <p>Seeking to analyze corruptibility of federal management
units, various databases that DIE has access have been
identified as useful for this work. For a better
understanding of the data, the available information were
divided into four dimensions, namely: Corruption;
Employment; Sanctions; and Political.</p>
      <p>Some of the information treated in this work are related
to the federal civil servants that work in the management
units. These information can give an idea of how much
power a certain unit concentrates or how much influence
the civil servants bring to the unit environment.
Due to the limited size of this paper, we present each
dimension giving only an overview of the existing
databases and relevant information identified by DIE
experts regarding possible relationships with corruptibility.
3.1</p>
      <sec id="sec-3-1">
        <title>CORRUPTION DIMENSION</title>
        <p>CGU maintains the Federal Administration Registry of
Expelled (CEAF)3, which is a database with information
that gathers expulsion penalties (expel, retirement
abrogation, and dismissal) of federal civil servants since the
year of 2003.</p>
        <p>This database will be used to define management units
that are corrupt, namely the positive class in our machine
learning algorithms. The first paragraph of the Section 4
describes how this is done.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>EMPLOYMENT DIMENSION</title>
        <p>The employment dimension covers the information of
management units regarding the federal civil servants
that work there. It may be related to basic information
such as office time and income, or even data that exposes
the importance of the unit the servant is working – such
as number of coordination roles or critic public offices
like those that deal directly with public resources or
financial assets.</p>
        <p>Most of the information comes from the Human
Resources Integrated System (SIAPE) of the Brazilian
Federal Government4.</p>
        <p>For the employment dimension, the experts in fighting
corruption of DIE selected 16 different information, that
later can be transformed in 16 or more different features
in the data preparation phase. Examples of these
information are: mean, maximum, and minimum monthly
income; number of coordination roles that deal with public
contracts; number of roles for specific activities such as
head of regional agency.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>SANCTIONS DIMENSION</title>
        <p>The sanctions dimension covers the information of
management units that got sanctioned, due to practices of bad
management of public money. We used sanctions in the
Accounts Judged Irregular (CADIRREG) from the
Fed3CEAF – Link: http://www.
portaldatransparencia.gov.br/expulsoes/
entrada</p>
        <p>4Website for the Human Resources Integrated System
(SIAPE) of the Brazilian Federal Government: http://
www.siapenet.gov.br
eral Court of Accounts (TCU)5, that judges the accounts
of each management unit, deciding about its regularity
according to Brazilian laws. Similarly, we used CGU’s
certificates of management irregularity6.</p>
        <p>Therefore, the experts in fighting corruption of DIE
selected four different information, that later can be
transformed in four or more different features in the data
preparation phase. Examples of these information are:
number of accounts judged irregular from TCU; and
number of regularity certificates from CGU.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>POLITICAL DIMENSION</title>
        <p>The political dimension covers data of federal civil
servants related to political activities, namely analyzing
information of affiliation to political parties. By getting the
affiliated servants of each management unit, we can
measure how much each political party influences the units
and if this will relate to corruption. The main database
comes from Superior Electoral Court (TSE)7.
Taking into account the knowledge of DIE experts, from
the data provided by TSE we selected nine different
information. Examples are: number of affiliations for a
given political party and total number of affiliated
servants in each management unit.
4</p>
        <p>DATA PREPARATION
The data to be prepared are extracted for two classes,
called “Corrupt” and “Non Corrupt”. On one hand,
“Corrupt” management units are those that throughout its
history have had at least one civil servant who was expelled
due specifically to corruption. In other words, units that
had corrupt civil servants, which are those registered in
CEAF whose legal basis for expulsion is consistent with
our definition for corruption, as stated in Section 1.
On the other hand, to build the “Non Corrupt” group,
we sampled a large group of management units and
removed those considered “Corrupt” by definition, keeping
the random sample proportion.</p>
        <p>Thus, the dataset for non corrupt was created with a
random sample of approximately 4,800 federal management
units – amount approximately 8 times greater than the
number of corrupt units.</p>
        <p>5CADIRREG: http://contas.tcu.gov.br/
cadirreg/CadirregConsultaNome</p>
        <p>6CGU’s audits reports: http://sistemas.cgu.gov.
br/relats/relatorios.php</p>
        <p>7TSE repositories: http://www.tse.
jus.br/eleicoes/estatisticas/
repositorio-de-dados-eleitorais
The data preparation phase includes feature selection
and goes through the following steps, which will be
described in the next sections:
• Data Cleaning and Feature Engineering: Adjusts
the dataset;
• Preliminary Analysis: Treats variance zero per class
and correlation;
• Data Separation: Segregates data for training and
testing;
• Intermediary Analysis: Variance and correlation
filtering;
• Feature Selection: Uses Adaptive Lasso;
• Discretization: Applies MDLP and CACC;
4.1</p>
        <sec id="sec-3-4-1">
          <title>DATA CLEANING AND FEATURE</title>
        </sec>
        <sec id="sec-3-4-2">
          <title>ENGINEERING</title>
          <p>Besides usual data cleaning activities – such as
adjustment of inconsistencies, data conversion, and
standardizing data types – the treatment of missing values was also
conducted. For categorical variables we created a
category “NA” representing the absence of values for a given
variable. As for counting numerical variables, missing
values represent the actual value of zero, so they were
replaced by such value. In addition, other fields with
missing values were treated individually. For example,
date of cancellation of party affiliation, when affiliation
still active, were replaced by a current date in order to
create features for time of affiliation.</p>
          <p>On feature engineering, first we created binarized
features for all the categorical variables. Then, since some
information can be registered more than once for a given
management unit – for example, one can have several
regularity certificates – we had to summarize the features
for each unit. With only numerical features, a few of
them were summarized by creating features with
maximum, minimum, average, and total. For example, annual
income was transformed into maximum annual income,
minimum annual income, and mean annual income.
After this step, we had created 2,238 different features.
4.2</p>
        </sec>
        <sec id="sec-3-4-3">
          <title>PRELIMINARY ANALYSIS</title>
          <p>
            At first we removed features that had variance, within
one of the classes, equal to zero, since with zero
classvariance algorithms might bring estimates of coefficients
that do not generalize
            <xref ref-type="bibr" rid="ref7">(Hosmer et al., 2013)</xref>
            . After
calculating class-variance for each of the 2,238 features, 747
of them were removed – most of these being related to
binarized categorical variables.
          </p>
          <p>We also preliminarily addressed perfect pairwise
correlation, which accounts for redundant information and may
give biased estimates. Perfectly correlated features may
have been added accidentally, or may have arisen after
feature engineering.</p>
          <p>Among the 1,495 variables analyzed, 96 – 48 pairs –
returned perfect correlation. DIE experts chose which to
eliminate in each pair.
4.3</p>
        </sec>
        <sec id="sec-3-4-4">
          <title>DATA SEPARATION</title>
          <p>At this point, our complete dataset had 688 corrupt units
and 4,792 non corrupt units, with 1,447 features.
In this step we created two different datasets: Training
Data (DT) and Testing Data (DTE). The first will be used
through all data preparation and modeling, while the
second will only be used as a final test after choosing the
best final model.</p>
          <p>To keep the original balance, DTE was created using a
random sample of 20% of corrupts plus 20% of the non
corrupts, and DT stayed with the remaining data,
corresponding to 80% of the complete dataset.
4.4</p>
        </sec>
        <sec id="sec-3-4-5">
          <title>INTERMEDIARY ANALYSIS</title>
          <p>Similarly to the Preliminary Analysis, we again analyzed
the class-variance. This resulted in removing 62 features
with zero variance in one of the classes.</p>
          <p>
            Nevertheless, in the intermediary analysis we did a
different correlation analysis, following the well known
hypothesis
            <xref ref-type="bibr" rid="ref6">(Hall, 1999)</xref>
            : “A good feature subset is one that
contains features highly correlated with (predictive of)
the class, yet uncorrelated with (not predictive of) each
other”.
          </p>
          <p>
            Initially we calculated the correlation matrix of the
remaining 1,376 features, also adding their correlation with
the class column indicating corruptibility – 0 to non
corrupt units and 1 to corrupt units. Then we filtered pairs of
features with correlation equal or greater than 0.70
(absolute value) – number generally considered high
correlation
            <xref ref-type="bibr" rid="ref16">(Taylor, 1990)</xref>
            . After that, the resulting matrix
was sorted in descending order regarding the correlation
of the features in relation to the class.
          </p>
          <p>Thereafter, the rows of the matrix were traversed from
the features with the largest correlation to the class. In
each row, we kept the feature with the highest correlation
with the class and removed the remaining features – from
the dataset and the matrix – that had inter-correlation
higher than 0.70 (absolute value).
With this algorithm we eliminated 468 features that had
absolute correlation equal or greater than 0.70, thus
remaining 910 features.</p>
          <p>Such an approach was used to try to avoid the
collinearity problem, mainly due to the fact that it is impossible to
analyze all the possible combinations of feature groups,
involved in this work. Thus, the correlation heuristic of
each feature with its class – although not fully reflected in
a model due to interactions between the features – serves
as a technique to try to keep the theoretically most
significant features – considering the correlation with class8.
4.5</p>
          <p>FEATURE SELECTION
To perform feature selection, each dataset passes through
a regularized regression, specifically using Adaptive
Lasso. For this purpose, we start by performing Ridge
Regression with 10-fold cross-validation on the DT
dataset. The estimates of the coefficients are used to
construct an adaptive weights vector. With this vector
introduced as the penalty factor, we implement Adaptive
Lasso with 10-fold cross-validation. It is worth noticing
that the Adaptive Lasso can force some of the coefficients
to have estimates exactly equal to zero, thereby reducing
the number of features.</p>
          <p>
            After feature selection with Adaptive Lasso, we selected
144 features. The 10-fold cross-validation resulted in a
AUC (Area Under the ROC Curve)
            <xref ref-type="bibr" rid="ref2">(Bradley, 1997)</xref>
            of
0.85, considered satisfactory.
4.6
          </p>
          <p>
            DISCRETIZATION
In recent years, discretization has received increasing
research attention
            <xref ref-type="bibr" rid="ref5">(Garcia et al., 2013)</xref>
            . In the case of
non-monotonic variables, the use of discretization
techniques proves to be essential since it makes it possible to
separate an original non-monotonic variable in various
monotonous derived covariates
            <xref ref-type="bibr" rid="ref18">(Tuffe´ry, 2011)</xref>
            . Also,
when thinking about Bayesian models, some algorithms
need all the features to be categorical, and discretization
is a method of doing so.
          </p>
          <p>
            In recent research
            <xref ref-type="bibr" rid="ref5">(Garcia et al., 2013)</xref>
            , two algorithms
have received generally great performance, namely:
MDLP (Minimum Description Length Principle)
            <xref ref-type="bibr" rid="ref8">(Irani,
1993)</xref>
            and CACC(Class-Attribute Contingency Class)
            <xref ref-type="bibr" rid="ref5">(Garcia et al., 2013)</xref>
            . We compare these algorithms by
later creating models for groups of features discretized
with each method.
          </p>
          <p>Accordingly, we have generated two different datasets
from DT, one dataset for each discretization method
used. The dataset discretized with MDLP algorithm
returned 23 binary features, while CACC returned 66 – the
reason these datasets have less features than the original
is due to the fact that constant features were
automatically removed.
5</p>
          <p>
            MODELING
In the modeling phase we started by creating models for
each of the datasets discretized with MDLP and CACC.
For this, we created Bayesian models using three
different algorithms: Na¨ıve Bayes
            <xref ref-type="bibr" rid="ref10">(Lowd and Domingos,
2005)</xref>
            , Tree Augmented Na¨ıve Bayes
            <xref ref-type="bibr" rid="ref18 ref19">(Zheng and Webb,
2011)</xref>
            , and Attribute Weighted Na¨ıve Bayes
            <xref ref-type="bibr" rid="ref14">(Taheri et al.,
2014)</xref>
            .
          </p>
          <p>This task was done using the R Package named caret9.
We used 10-fold cross-validation to evaluate AUC and
tried several different combinations of parameters for
each of the three algorithms – from 20 to 60
combinations. For example, for Tree Augmented Na¨ıve Bayes
we used three score functions (loglik, bic, aic) each along
side 20 different values for smoothing (from 0 to 19).
After these models were built, caret selects the one with the
combination of parameters that resulted in the best AUC
value for each algorithm.
5.1</p>
          <p>DISCRETIZATION SELECTION
The first step is to choose the most suitable discretization.
With this in mind, for each discretized dataset we take
the average results of AUC for the three algorithms used,
again using 10-fold cross validation to try to estimate the
out-of-sample results. The mean AUC outcomes are
depicted in Table 1, along side the number of features each
dataset has.
Although the results for the dataset with CACC
discretization were slightly better, it is desirable to minimize
the number of features considered in a model. Mainly
models with less features tend to be more numerically
stable and be adopted more easily. Also, a model with
less features can avoid overfitting and increase its
interpretability.</p>
          <p>8It may be useful to use different methods to analyze
correlation in future work.</p>
          <p>9R Package caret: https://cran.r-project.org/
web/packages/caret/index.html
Therefore, we chose to select the features discretized
with MDLP, since the respective model achieved results
close to CACC but kept almost three times less features.
5.2</p>
          <p>MODEL SELECTION
With the discretized dataset chosen, we now evaluate
the Bayesian models built with the three algorithms:
TAN (Tree Augmented Na¨ıve Bayes) AW-NB (Attribute
Weighted Na¨ıve Bayes) and NB (Na¨ıve Bayes). The
AUC outcomes are showed in Table 2.
Observing the results we chose the Bayesian model
created with NB (Na¨ıve Bayes) to be our final model, since
it is more interpretable and simpler, while keeping
practically the same results as the other two models.
6</p>
          <p>EVALUATION
In the evaluation phase, we start by analyzing the
results of the final model on the testing data separated on
the beginning of this work. Finally, we analyzed the
conditional probabilities of the features to extract useful
knowledge regarding fighting corruption.
6.1</p>
          <p>TESTING DATA
To ultimately validate our final model, we used the
dataset separated in the data preparation phase for this
purpose: the testing dataset (DTE). The first step here is
to adjust DTE to have the same 23 final features selected
from MDLP discretization.</p>
          <p>Then, applying the final model on DTE we got AUC of
approximately 0.76. Hence, we consider the results
satisfactory. The reason being that the results are just a
little below those obtained in the training dataset and are
higher than 0.70, considered to be a threshold of good
models.
6.2</p>
          <p>KNOWLEDGE DISCOVERY
Observing the conditional probabilities of the final
model, we extracted the rules it follows to define
corruptibility for federal management units. This
knowledge discovery aims to give a contribution to the
activities of fighting corruption. Some of the main rules
extracted that indicate an increase of risk of corruption are
showed below.</p>
          <p>• Accounts judged irregular by TCU;
• Responsibilities related to financial activities;
• Substitution public functions for controlling
expenses;
• Number of requested civil servants allocated;
• Heading roles on regional agencies;
• Political party affiliations;
• Activities spread by multiple municipalities; and
• Number of public offices occupied by designation
(without a selective process).</p>
          <p>After discussing the main rules with DIE experts, they
made a few comments in order to rationalize upon the
knowledge discovered by the model.</p>
          <p>• Accounts judged irregular by TCU are themselves
by definition scenarios that involve inadequacies or
irregularities;
• Responsibilities related to expenses and financial
activities are critical, since they involve public
resources and possible embezzlements;
• A management unit with several civil servants
allocated by request might show a scenario of poor
strength of the internal career;
• The heading roles related to regional management
units usually have civil servants holding a relatively
high amount of decision-making power with greater
discretion, displaying a scenario of high propensity
to corruption;
• Political party affiliations are related to greater
political influence in decisions of public interest on the
federal management units;
• Units with activities on many municipalities have to
deal with decentralization problems; and
• Public offices employed by designation are
occupied in the government due to nomination from
discretionary authorities, not necessarily related to
merit.</p>
          <p>Therefore, by analyzing the rules together with the
experts’ comments, we see that the results have
reasonable suitability in scenarios involving federal
management units.</p>
          <p>DEPLOYMENT
In the deployment phase, we created a Web application to
allow managers at CGU to query management units and
analyze their risk of corruption. With paths of grouped
queries, managers can now view management units
organized by their agencies. They are also able to perform
ad-hoc queries, using as input unique identifiers of
management units to obtain risk of corruption analysis for an
individual unit or groups of them.</p>
          <p>To deploy the predictive model to assess risk of
corruption we simply implemented the calculation of Na¨ıve
Bayes with the conditional probabilities for the features
selected on our final model. Using the output
probabilities given by the model, we then discretized the results
manually to only show risk categories, specifically: less
than 0.20 as Very Low; equal or greater than 0.20 but
less than 0.40 as Low; equal or greater than 0.40 but less
than 0.60 as Medium; equal or greater than 0.60 but less
than 0.80 as High; and equal or greater than 0.80 as Very
High.</p>
          <p>The Web application also generates pdf reports
containing, for a given management unit: risk of corruption,
average and maximum risk of corruption of the
management units on the same agency. The application not only
shows risk results, but also several other government data
related to each management unit, allowing a general view
of each unit.</p>
          <p>With the application running, we started to present this
work to all areas of CGU. Currently, several activities
involving management units are being prioritized using our
risk of corruption predictive model together with other
information.
8</p>
          <p>CONCLUSION
This paper described a data mining project that generated
Bayesian models to assess risk of corruption of federal
management units. We analyzed data from several
government databases and, with the help of DIE experts, we
developed thousands of important features. These
variables were prepared and pre-processed removing those
with zero class-variance and high inter-correlation.
Feature selection was done using Adaptive Lasso, which
selected the 144 most relevant features. We compared
two different discretization methods: CACC and MDLP.
Bayesian models were built for datasets discretized with
the two methods using the following algorithms: Na¨ıve
Bayes, Tree Augmented Na¨ıve Bayes, and Attribute
Weighted Na¨ıve Bayes. To first choose the best
discretization method we evaluated our results obtaining
the average of the 10-fold cross-validation metrics
performed per dataset. MDLP was chosen due to great
results aligned with a considerable reduction of the number
of features selected – from 144 to 23.</p>
          <p>After choosing the dataset discretized with MDLP we
evaluated the AUC for the three algorithms used on
modeling. The results were very close, approximately 0.82.
Therefore, we chose the model created with Na¨ıve Bayes
to be our final model, since it is more interpretable and
simpler.</p>
          <p>The dataset labeled Testing (DTE) separated on data
preparation was then used to confirm the validity of the
final model. DTE showed AUC of approximately 0.76.
Finally, the rules of the final model were extracted. With
help from DIE experts, we derived knowledge for
corruption fight activities. Rules generated and experts’
comments were outlined to give an overview of the results.
The predictive model from this project was also deployed
in a Web application, allowing managers from CGU to
query and analyze federal management units regarding
their risk of corruption. With the results of our model,
CGU is already prioritizing corruption related activities
to help maximize audits efficacy.</p>
          <p>Therefore, this work contributed with an end-to-end data
mining project overview, with application of several
state-of-the-art techniques. We reinforced CGU’s
activities in fighting corruption by building an useful model
to assess risk of corruption of federal management units.
The knowledge discovered is also increasing the
expertise of DIE analysts. With the Web application
developed from this project, we help potentially save millions
in public resources. Additionally, with risk assessment
we encourage proactive audits, helping managers plan
their work. To that end, we generate impact nationwide
in fighting corruption.
The authors would like to thank the corruption fighting
expert Victor Steytler for providing useful insights for
the development of this work. Finally, the authors would
like to thank CGU for providing the resources necessary
to work in this research, as well as for allowing its
publication.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>R.</given-names>
            <surname>Balaniuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bessiere</surname>
          </string-name>
          , E. Mazer, and
          <string-name>
            <given-names>P.</given-names>
            <surname>Cobbe</surname>
          </string-name>
          .
          <article-title>Risk based Government Audit Planning using Na¨ıve Bayes Classifiers</article-title>
          .
          <source>Advances in Knowledge-Based and Intelligent Information and Engineering Systems</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Andrew P Bradley.</surname>
          </string-name>
          <article-title>The use of the area under the roc curve in the evaluation of machine learning algorithms</article-title>
          .
          <source>Pattern recognition</source>
          ,
          <volume>30</volume>
          (
          <issue>7</issue>
          ):
          <fpage>1145</fpage>
          -
          <lpage>1159</lpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Ricardo</given-names>
            <surname>Carvalho</surname>
          </string-name>
          , Rommel Carvalho, Marcelo Ladeira, Fernando Monteiro, and
          <string-name>
            <given-names>Gilson</given-names>
            <surname>Mendes</surname>
          </string-name>
          .
          <article-title>Using political party affiliation data to measure civil servants' risk of corruption</article-title>
          .
          <source>In 2014 Brazilian Conference on Intelligent Systems (BRACIS)</source>
          , pages
          <fpage>166</fpage>
          -
          <lpage>171</lpage>
          . IEEE,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Rommel</given-names>
            <surname>Carvalho</surname>
          </string-name>
          , Shou Matsumoto, Kathryn B.
          <string-name>
            <surname>Laskey</surname>
            , Paulo C. G. Costa, Marcelo Ladeira,
            <given-names>and Lacio L.</given-names>
          </string-name>
          <string-name>
            <surname>Santos</surname>
          </string-name>
          .
          <article-title>Probabilistic ontology and knowledge fusion for procurement fraud detection in brazil</article-title>
          .
          <source>In Uncertainty Reasoning for the Semantic Web II</source>
          , pages
          <fpage>19</fpage>
          -
          <lpage>40</lpage>
          . Springer,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>S.</given-names>
            <surname>Garcia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Luengo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Sa</surname>
          </string-name>
          ´ez,
          <string-name>
            <given-names>V.</given-names>
            <surname>Lopez</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Herrera</surname>
          </string-name>
          .
          <article-title>A survey of discretization techniques: taxonomy and empirical analysis in supervised learning</article-title>
          .
          <source>Knowledge and Data Engineering</source>
          , IEEE Transactions on,
          <volume>25</volume>
          (
          <issue>4</issue>
          ):
          <fpage>734</fpage>
          -
          <lpage>750</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Mark A.</given-names>
            <surname>Hall</surname>
          </string-name>
          .
          <article-title>Correlation-based feature selection for machine learning</article-title>
          .
          <source>PhD thesis</source>
          , The University of Waikato,
          <year>1999</year>
          . URL https://www.lri.fr/˜pierres/donn% E9es/save/these/articles/lpr-queue/ hall99correlationbased.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>David W Hosmer</given-names>
            , Stanley Lemeshow, and
            <surname>Rodney</surname>
          </string-name>
          <string-name>
            <given-names>X</given-names>
            <surname>Sturdivant</surname>
          </string-name>
          .
          <article-title>Applied logistic regression</article-title>
          , volume
          <volume>398</volume>
          . John Wiley &amp; Sons,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Keki B</surname>
          </string-name>
          <article-title>Irani. Multi-interval discretization of continuousvalued attributes for classification learning</article-title>
          .
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Yufeng</given-names>
            <surname>Kou</surname>
          </string-name>
          ,
          <string-name>
            <surname>Chang-Tien</surname>
            <given-names>Lu</given-names>
          </string-name>
          , Sirirat Sirwongwattana, and
          <string-name>
            <surname>Yo-Ping Huang</surname>
          </string-name>
          .
          <article-title>Survey of fraud detection techniques</article-title>
          .
          <source>In Networking, sensing and control</source>
          ,
          <source>2004 IEEE international conference on</source>
          , volume
          <volume>2</volume>
          , pages
          <fpage>749</fpage>
          -
          <lpage>754</lpage>
          . IEEE,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Lowd</surname>
          </string-name>
          and
          <string-name>
            <given-names>Pedro</given-names>
            <surname>Domingos</surname>
          </string-name>
          .
          <article-title>Naive bayes models for probability estimation</article-title>
          .
          <source>In Proceedings of the 22nd international conference on Machine learning</source>
          , pages
          <fpage>529</fpage>
          -
          <lpage>536</lpage>
          . ACM,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>EWT</given-names>
            <surname>Ngai</surname>
          </string-name>
          , Yong Hu, YH Wong, Yijun Chen, and
          <string-name>
            <given-names>Xin</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <article-title>The application of data mining techniques in financial fraud detection: A classification framework and an academic review of literature</article-title>
          .
          <source>Decision Support Systems</source>
          ,
          <volume>50</volume>
          (
          <issue>3</issue>
          ):
          <fpage>559</fpage>
          -
          <lpage>569</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Clifton</given-names>
            <surname>Phua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Vincent</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Kate</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Ross</given-names>
            <surname>Gayler</surname>
          </string-name>
          .
          <article-title>A comprehensive survey of data mining-based fraud detection research</article-title>
          .
          <source>arXiv preprint arXiv:1009.6119</source>
          ,
          <year>2010</year>
          . URL http://arxiv.org/abs/1009. 6119.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Carlos</given-names>
            <surname>Vin</surname>
          </string-name>
          <article-title>´ıcius Silva and Ce´lia Ralha</article-title>
          . Utilizac¸a˜o de Te´cnicas de Minerac¸
          <article-title>a˜o de Dados como Aux´ılio na Detecc¸a˜o de Carte´is em Licitac¸o˜es</article-title>
          . In WCGE - II Workshop de Computac¸
          <article-title>a˜o Aplicada em Governo Eletroˆnico</article-title>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Sona</given-names>
            <surname>Taheri</surname>
          </string-name>
          , John Yearwood, Musa Mammadov, and
          <string-name>
            <given-names>Sattar</given-names>
            <surname>Seifollahi</surname>
          </string-name>
          .
          <article-title>Attribute weighted naive bayes classifier using a local optimization</article-title>
          .
          <source>Neural Computing and Applications</source>
          ,
          <volume>24</volume>
          (
          <issue>5</issue>
          ):
          <fpage>995</fpage>
          -
          <lpage>1002</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Jiliang</given-names>
            <surname>Tang</surname>
          </string-name>
          , Salem Alelyani, and Huan Liu.
          <article-title>Feature selection for classification: A review. Data Classification: Algorithms and Applications</article-title>
          , page
          <volume>37</volume>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Richard</given-names>
            <surname>Taylor</surname>
          </string-name>
          .
          <article-title>Interpretation of the correlation coefficient: a basic review</article-title>
          .
          <source>Journal of diagnostic medical sonography</source>
          ,
          <volume>6</volume>
          (
          <issue>1</issue>
          ):
          <fpage>35</fpage>
          -
          <lpage>39</lpage>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Cheng-Jung</surname>
            <given-names>Tsai</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chien-I Lee</surname>
          </string-name>
          , and
          <string-name>
            <surname>Wei-Pang Yang</surname>
          </string-name>
          .
          <article-title>A discretization algorithm based on class-attribute contingency coefficient</article-title>
          .
          <source>Information Sciences</source>
          ,
          <volume>178</volume>
          (
          <issue>3</issue>
          ):
          <fpage>714</fpage>
          -
          <lpage>731</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <article-title>Ste´phane Tuffe´ry. Data mining and statistics for decision making</article-title>
          . John Wiley &amp; Sons,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Fei</given-names>
            <surname>Zheng</surname>
          </string-name>
          and
          <string-name>
            <given-names>Geoffrey I</given-names>
            <surname>Webb</surname>
          </string-name>
          .
          <article-title>Tree augmented naive bayes</article-title>
          .
          <source>In Encyclopedia of Machine Learning</source>
          , pages
          <fpage>990</fpage>
          -
          <lpage>991</lpage>
          . Springer,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>Hui</given-names>
            <surname>Zou</surname>
          </string-name>
          .
          <article-title>The adaptive lasso and its oracle properties</article-title>
          .
          <source>Journal of the American statistical association</source>
          ,
          <volume>101</volume>
          (
          <issue>476</issue>
          ):
          <fpage>1418</fpage>
          -
          <lpage>1429</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>