Bayesian Models to Assess Risk of Corruption of Federal Management Units Ricardo S. Carvalho1 and Rommel N. Carvalho1,2 1 Department of Research and Strategic Information at the Brazilian Office of the Comptroller General ⇤ 2 Department of Computer Science at the University of Brası́lia † {ricardo.carvalho,rommel.carvalho}@cgu.gov.br Abstract as one of its competences the role of assisting directly and immediately the President on matters and measures This paper presents a data mining project that related to preventing and fighting corruption. Through generated Bayesian models to assess risk of activities of strategic information production, the Depart- corruption of federal management units. With ment of Research and Strategic Information (DIE) is the thousands of extracted features related to cor- area responsible for investigating possible irregularities ruptibility, the data were processed using tech- involving federal civil servants working in management niques like correlation analysis and variance units. per class. We also compared two different Nowadays, there are more than thirty thousand active discretization methods: Minimum Description federal management units2 , all subject to investigation. Length Principle (MDLP) and Class-Attribute Due to this large number of units, most of the time Contingency Coefficient (CACC). The feature DIE is limited to performing only investigations of those selection process used Adaptive Lasso. To involved on large federal operations or recurrent com- choose our final model we evaluated three dif- plaints, often restricting its activities to cases triggered ferent algorithms: Naı̈ve Bayes, Tree Aug- externally. Thus, it is important to have prioritization of mented Naı̈ve Bayes, and Attribute Weighted activities based on risks of involvement in corruption so Naı̈ve Bayes. Finally, we analyzed the rules that DIE can act more effectively and proactively. generated by the model in order to support knowledge discovery. This work has two main objectives and contributions. The first is to build a Bayesian model to assess risk of corruption of federal management units. To this end, we seek to apply data mining techniques based on the state- 1 INTRODUCTION of-the-art, along with a practical study of the information related to corruption. Therefore, we wish to contribute Currently, it is known that corruption is a recurrent and to CGU’s activities in fighting corruption by building primary subject on the Brazilian government agenda, an useful model for their work priorization. Also, the fundamentally requiring its ostensive and efficient com- step-by-step of this data mining project might be inter- bat. Public corruption can be defined – supported by esting for other practitioners, since it involves the com- Brazilian Law no. 8,429, of June 19921 – as the act of bination of several different methods. We show how we misconduct or improper use of public office that leads to applied correlation analysis and two discretization meth- illicit enrichment, causing injury to the public treasury ods to process features, Adaptive Lasso for feature selec- or infringing upon the principles of the public adminis- tion, and end up comparing three different algorithms to tration. choose our final Bayesian model. Hence, this work gives The Brazilian Office of the Comptroller General (CGU), contribution to practitioners while describing the appli- an agency incorporated in the Presidency structure, has cation of data mining techniques with a practical objec- ⇤ tive and singular combination of techniques. SAS, Quadra 01, Bloco A, Edificio Darcy Ribeiro Brasilia, DF, Brazil † 2 Campus Darcy Ribeiro Brası́lia, DF, Brazil Management Units dataset: http://www. 1 Brazilian Law no. 8.429, June 1992: http://www. tesourotransparente.gov.br/ckan/dataset/ planalto.gov.br/ccivil_03/leis/l8429.htm siafi-relatorio-unidades-gestoras BMAW 2016 - Page 28 of 59 The second objective is to achieve knowledge discovery data mining in the prevention of corruption. Another pa- in relation to information about corruptibility of federal per (Balaniuk et al., 2012) shows the use of Naı̈ve Bayes management units, seeking to extract new rules in this to evaluate the risk of corruption in public procurement. domain. To this end, the information of management The authors applied natural logarithm to discretize at- units available – as well as its direct and indirect rela- tributes and based their assessment on the results of the tionships with the federal civil servants working there – conditional probabilities defined by experts. are analyzed with the support of DIE experts in fighting In addition, a recent paper (Carvalho et al., 2013) corruption. After building our final model, we analyzed presents the use of probabilistic ontologies to design and its derived rules. With this in mind, we wish to contribute test a model that performs the fusion of information to to the enrichment of the experts’ knowledge in fighting detect possible fraud in bidding processes involving fed- corruption. eral money in Brazil. In Section 2, we depict works most closely related to With respect to discretization algorithms, it has currently fighting corruption and how data mining has been used, received a lot of focus as a pre-processing technique, while in Section 3 we give an overview of the informa- mostly since many machine learning algorithms are tion selected by DIE experts that will be used to build our known to produce better models by discretizing continu- models. Section 4 describes steps taken to pre-process ous attributes (Garcia et al., 2013). Two algorithms have data, such as correlation analysis, discretization, and also received generally great performance, namely: CACC feature selection. In Section 5 we show how we used (Class-Attribute Contingency Coefficient) (Tsai et al., machine learning to build several models and Section 6 2008) and MDLP (Minimum Description Length Princi- depicts our evaluation strategies. Section 7 discusses our ple) (Irani, 1993). In this work we compare the results of deployment efforts related to the products of this work these algorithms after feature selection by creating mod- and we end this paper with a conclusion in Section 8. els to allow us to choose the best results. For feature selection, a recent review (Tang et al., 2014) 2 RELATED WORK shows several different widely used techniques, such as Adaptive Lasso (Zou, 2006). The Adaptive Lasso has ba- In the last decade, observing current research areas, a sically two steps. First, an initial estimator is obtained, topic closely related to risk of corruption is fraud de- usually using Ridge Regression (Zou, 2006). Then a op- tection. The main objective of fraud detection is to re- timization problem with a weighted L1 penalty is carried veal trends of suspicious acts. For example, an emerg- out. The initial estimator generally puts more weight on ing theme is to use data mining to detect financial fraud. the zero coefficients and less on nonzero ones to improve A review of the academic literature of such application upon its predecessor: the Lasso (Zou, 2006). Compared (Ngai et al., 2011) shows its successful use in detecting to the Lasso, the adaptive Lasso has the advantage of the credit card fraud, money laundering, bankruptcy predic- oracle property (Zou, 2006), resulting in a performance tion, among others. This review also identifies common as well as if the true underlying model were given in data mining techniques used in fraud detection, includ- advance. Compared to the SCAD and bridge methods ing Artificial Neural Networks, Decision Trees, Logistic (Tang et al., 2014), which also have the oracle property, Regression, and Naı̈ve Bayes. the advantage of the adaptive Lasso is its computational In this context, a recent survey on the subject of data efficiency. mining-based fraud detection (Phua et al., 2010) dis- plays a summary of published technical articles and a 3 DATA UNDERSTANDING review on the topic. This survey, as well as other works (Kou et al., 2004), includes comments on similar appli- Seeking to analyze corruptibility of federal management cations. Also, an individual-oriented corruption analysis units, various databases that DIE has access have been (Carvalho et al., 2014) was done building a corruption identified as useful for this work. For a better under- risk model for affiliated civil servants with algorithms standing of the data, the available information were di- like Random Forest and Bayesian Networks. vided into four dimensions, namely: Corruption; Em- ployment; Sanctions; and Political. Regarding aspects of corruption, research related to pub- lic bidding and contracting processes has also been car- Some of the information treated in this work are related ried out, though not as widely as in fraud detection. The to the federal civil servants that work in the management use of clustering and association rules to the problem units. These information can give an idea of how much of cartels in public bidding processes (Silva and Ralha, power a certain unit concentrates or how much influence 2010) found results that corroborate the application of the civil servants bring to the unit environment. BMAW 2016 - Page 29 of 59 Due to the limited size of this paper, we present each eral Court of Accounts (TCU)5 , that judges the accounts dimension giving only an overview of the existing of each management unit, deciding about its regularity databases and relevant information identified by DIE ex- according to Brazilian laws. Similarly, we used CGU’s perts regarding possible relationships with corruptibility. certificates of management irregularity6 . Therefore, the experts in fighting corruption of DIE se- 3.1 CORRUPTION DIMENSION lected four different information, that later can be trans- formed in four or more different features in the data CGU maintains the Federal Administration Registry of preparation phase. Examples of these information are: Expelled (CEAF)3 , which is a database with information number of accounts judged irregular from TCU; and that gathers expulsion penalties (expel, retirement abro- number of regularity certificates from CGU. gation, and dismissal) of federal civil servants since the year of 2003. 3.4 POLITICAL DIMENSION This database will be used to define management units that are corrupt, namely the positive class in our machine The political dimension covers data of federal civil ser- learning algorithms. The first paragraph of the Section 4 vants related to political activities, namely analyzing in- describes how this is done. formation of affiliation to political parties. By getting the affiliated servants of each management unit, we can mea- sure how much each political party influences the units 3.2 EMPLOYMENT DIMENSION and if this will relate to corruption. The main database comes from Superior Electoral Court (TSE)7 . The employment dimension covers the information of management units regarding the federal civil servants Taking into account the knowledge of DIE experts, from that work there. It may be related to basic information the data provided by TSE we selected nine different in- such as office time and income, or even data that exposes formation. Examples are: number of affiliations for a the importance of the unit the servant is working – such given political party and total number of affiliated ser- as number of coordination roles or critic public offices vants in each management unit. like those that deal directly with public resources or fi- nancial assets. 4 DATA PREPARATION Most of the information comes from the Human Re- sources Integrated System (SIAPE) of the Brazilian Fed- The data to be prepared are extracted for two classes, eral Government4 . called “Corrupt” and “Non Corrupt”. On one hand, “Cor- For the employment dimension, the experts in fighting rupt” management units are those that throughout its his- corruption of DIE selected 16 different information, that tory have had at least one civil servant who was expelled later can be transformed in 16 or more different features due specifically to corruption. In other words, units that in the data preparation phase. Examples of these infor- had corrupt civil servants, which are those registered in mation are: mean, maximum, and minimum monthly in- CEAF whose legal basis for expulsion is consistent with come; number of coordination roles that deal with public our definition for corruption, as stated in Section 1. contracts; number of roles for specific activities such as On the other hand, to build the “Non Corrupt” group, head of regional agency. we sampled a large group of management units and re- moved those considered “Corrupt” by definition, keeping 3.3 SANCTIONS DIMENSION the random sample proportion. Thus, the dataset for non corrupt was created with a ran- The sanctions dimension covers the information of man- dom sample of approximately 4,800 federal management agement units that got sanctioned, due to practices of bad units – amount approximately 8 times greater than the management of public money. We used sanctions in the number of corrupt units. Accounts Judged Irregular (CADIRREG) from the Fed- 5 CADIRREG: http://contas.tcu.gov.br/ 3 CEAF – Link: http://www. cadirreg/CadirregConsultaNome 6 portaldatransparencia.gov.br/expulsoes/ CGU’s audits reports: http://sistemas.cgu.gov. entrada br/relats/relatorios.php 4 7 Website for the Human Resources Integrated System TSE repositories: http://www.tse. (SIAPE) of the Brazilian Federal Government: http:// jus.br/eleicoes/estatisticas/ www.siapenet.gov.br repositorio-de-dados-eleitorais BMAW 2016 - Page 30 of 59 The data preparation phase includes feature selection of them were removed – most of these being related to and goes through the following steps, which will be de- binarized categorical variables. scribed in the next sections: We also preliminarily addressed perfect pairwise correla- tion, which accounts for redundant information and may • Data Cleaning and Feature Engineering: Adjusts give biased estimates. Perfectly correlated features may the dataset; have been added accidentally, or may have arisen after feature engineering. • Preliminary Analysis: Treats variance zero per class and correlation; Among the 1,495 variables analyzed, 96 – 48 pairs – re- turned perfect correlation. DIE experts chose which to • Data Separation: Segregates data for training and eliminate in each pair. testing; 4.3 DATA SEPARATION • Intermediary Analysis: Variance and correlation fil- tering; At this point, our complete dataset had 688 corrupt units and 4,792 non corrupt units, with 1,447 features. • Feature Selection: Uses Adaptive Lasso; In this step we created two different datasets: Training • Discretization: Applies MDLP and CACC; Data (DT) and Testing Data (DTE). The first will be used through all data preparation and modeling, while the sec- 4.1 DATA CLEANING AND FEATURE ond will only be used as a final test after choosing the ENGINEERING best final model. Besides usual data cleaning activities – such as adjust- To keep the original balance, DTE was created using a ment of inconsistencies, data conversion, and standardiz- random sample of 20% of corrupts plus 20% of the non ing data types – the treatment of missing values was also corrupts, and DT stayed with the remaining data, corre- conducted. For categorical variables we created a cate- sponding to 80% of the complete dataset. gory “NA” representing the absence of values for a given variable. As for counting numerical variables, missing 4.4 INTERMEDIARY ANALYSIS values represent the actual value of zero, so they were Similarly to the Preliminary Analysis, we again analyzed replaced by such value. In addition, other fields with the class-variance. This resulted in removing 62 features missing values were treated individually. For example, with zero variance in one of the classes. date of cancellation of party affiliation, when affiliation still active, were replaced by a current date in order to Nevertheless, in the intermediary analysis we did a dif- create features for time of affiliation. ferent correlation analysis, following the well known hy- pothesis (Hall, 1999): “A good feature subset is one that On feature engineering, first we created binarized fea- contains features highly correlated with (predictive of) tures for all the categorical variables. Then, since some the class, yet uncorrelated with (not predictive of) each information can be registered more than once for a given other”. management unit – for example, one can have several regularity certificates – we had to summarize the features Initially we calculated the correlation matrix of the re- for each unit. With only numerical features, a few of maining 1,376 features, also adding their correlation with them were summarized by creating features with maxi- the class column indicating corruptibility – 0 to non cor- mum, minimum, average, and total. For example, annual rupt units and 1 to corrupt units. Then we filtered pairs of income was transformed into maximum annual income, features with correlation equal or greater than 0.70 (ab- minimum annual income, and mean annual income. solute value) – number generally considered high cor- relation (Taylor, 1990). After that, the resulting matrix After this step, we had created 2,238 different features. was sorted in descending order regarding the correlation of the features in relation to the class. 4.2 PRELIMINARY ANALYSIS Thereafter, the rows of the matrix were traversed from At first we removed features that had variance, within the features with the largest correlation to the class. In one of the classes, equal to zero, since with zero class- each row, we kept the feature with the highest correlation variance algorithms might bring estimates of coefficients with the class and removed the remaining features – from that do not generalize (Hosmer et al., 2013). After calcu- the dataset and the matrix – that had inter-correlation lating class-variance for each of the 2,238 features, 747 higher than 0.70 (absolute value). BMAW 2016 - Page 31 of 59 With this algorithm we eliminated 468 features that had used. The dataset discretized with MDLP algorithm re- absolute correlation equal or greater than 0.70, thus re- turned 23 binary features, while CACC returned 66 – the maining 910 features. reason these datasets have less features than the original is due to the fact that constant features were automati- Such an approach was used to try to avoid the collinear- cally removed. ity problem, mainly due to the fact that it is impossible to analyze all the possible combinations of feature groups, involved in this work. Thus, the correlation heuristic of 5 MODELING each feature with its class – although not fully reflected in a model due to interactions between the features – serves In the modeling phase we started by creating models for as a technique to try to keep the theoretically most signif- each of the datasets discretized with MDLP and CACC. icant features – considering the correlation with class8 . For this, we created Bayesian models using three dif- ferent algorithms: Naı̈ve Bayes (Lowd and Domingos, 4.5 FEATURE SELECTION 2005), Tree Augmented Naı̈ve Bayes (Zheng and Webb, 2011), and Attribute Weighted Naı̈ve Bayes (Taheri et al., To perform feature selection, each dataset passes through 2014). a regularized regression, specifically using Adaptive Lasso. For this purpose, we start by performing Ridge This task was done using the R Package named caret9 . Regression with 10-fold cross-validation on the DT We used 10-fold cross-validation to evaluate AUC and dataset. The estimates of the coefficients are used to tried several different combinations of parameters for construct an adaptive weights vector. With this vector each of the three algorithms – from 20 to 60 combina- introduced as the penalty factor, we implement Adaptive tions. For example, for Tree Augmented Naı̈ve Bayes Lasso with 10-fold cross-validation. It is worth noticing we used three score functions (loglik, bic, aic) each along that the Adaptive Lasso can force some of the coefficients side 20 different values for smoothing (from 0 to 19). Af- to have estimates exactly equal to zero, thereby reducing ter these models were built, caret selects the one with the the number of features. combination of parameters that resulted in the best AUC value for each algorithm. After feature selection with Adaptive Lasso, we selected 144 features. The 10-fold cross-validation resulted in a 5.1 DISCRETIZATION SELECTION AUC (Area Under the ROC Curve) (Bradley, 1997) of 0.85, considered satisfactory. The first step is to choose the most suitable discretization. With this in mind, for each discretized dataset we take 4.6 DISCRETIZATION the average results of AUC for the three algorithms used, again using 10-fold cross validation to try to estimate the In recent years, discretization has received increasing re- out-of-sample results. The mean AUC outcomes are de- search attention (Garcia et al., 2013). In the case of picted in Table 1, along side the number of features each non-monotonic variables, the use of discretization tech- dataset has. niques proves to be essential since it makes it possible to separate an original non-monotonic variable in various monotonous derived covariates (Tufféry, 2011). Also, Table 1: Mean Results of Bayesian Models for each Dis- when thinking about Bayesian models, some algorithms cretized Dataset need all the features to be categorical, and discretization Discretization No. of features AUC is a method of doing so. MDLP 23 0.82 CACC 66 0.83 In recent research (Garcia et al., 2013), two algorithms have received generally great performance, namely: MDLP (Minimum Description Length Principle) (Irani, Although the results for the dataset with CACC dis- 1993) and CACC(Class-Attribute Contingency Class) cretization were slightly better, it is desirable to minimize (Garcia et al., 2013). We compare these algorithms by the number of features considered in a model. Mainly later creating models for groups of features discretized models with less features tend to be more numerically with each method. stable and be adopted more easily. Also, a model with Accordingly, we have generated two different datasets less features can avoid overfitting and increase its inter- from DT, one dataset for each discretization method pretability. 8 9 It may be useful to use different methods to analyze corre- R Package caret: https://cran.r-project.org/ lation in future work. web/packages/caret/index.html BMAW 2016 - Page 32 of 59 Therefore, we chose to select the features discretized tracted that indicate an increase of risk of corruption are with MDLP, since the respective model achieved results showed below. close to CACC but kept almost three times less features. • Accounts judged irregular by TCU; 5.2 MODEL SELECTION • Responsibilities related to financial activities; With the discretized dataset chosen, we now evaluate • Substitution public functions for controlling ex- the Bayesian models built with the three algorithms: penses; TAN (Tree Augmented Naı̈ve Bayes) AW-NB (Attribute Weighted Naı̈ve Bayes) and NB (Naı̈ve Bayes). The • Number of requested civil servants allocated; AUC outcomes are showed in Table 2. • Heading roles on regional agencies; Table 2: Results of Bayesian Models for MDLP Dataset • Political party affiliations; Algorithm AUC • Activities spread by multiple municipalities; and TAN 0.8272 • Number of public offices occupied by designation AW-NB 0.8207 (without a selective process). NB 0.8244 After discussing the main rules with DIE experts, they Observing the results we chose the Bayesian model cre- made a few comments in order to rationalize upon the ated with NB (Naı̈ve Bayes) to be our final model, since knowledge discovered by the model. it is more interpretable and simpler, while keeping prac- tically the same results as the other two models. • Accounts judged irregular by TCU are themselves by definition scenarios that involve inadequacies or 6 EVALUATION irregularities; • Responsibilities related to expenses and financial In the evaluation phase, we start by analyzing the re- activities are critical, since they involve public re- sults of the final model on the testing data separated on sources and possible embezzlements; the beginning of this work. Finally, we analyzed the conditional probabilities of the features to extract useful • A management unit with several civil servants al- knowledge regarding fighting corruption. located by request might show a scenario of poor strength of the internal career; 6.1 TESTING DATA • The heading roles related to regional management units usually have civil servants holding a relatively To ultimately validate our final model, we used the high amount of decision-making power with greater dataset separated in the data preparation phase for this discretion, displaying a scenario of high propensity purpose: the testing dataset (DTE). The first step here is to corruption; to adjust DTE to have the same 23 final features selected from MDLP discretization. • Political party affiliations are related to greater po- Then, applying the final model on DTE we got AUC of litical influence in decisions of public interest on the approximately 0.76. Hence, we consider the results sat- federal management units; isfactory. The reason being that the results are just a lit- • Units with activities on many municipalities have to tle below those obtained in the training dataset and are deal with decentralization problems; and higher than 0.70, considered to be a threshold of good models. • Public offices employed by designation are occu- pied in the government due to nomination from 6.2 KNOWLEDGE DISCOVERY discretionary authorities, not necessarily related to merit. Observing the conditional probabilities of the final model, we extracted the rules it follows to define cor- Therefore, by analyzing the rules together with the ex- ruptibility for federal management units. This knowl- perts’ comments, we see that the results have reason- edge discovery aims to give a contribution to the activ- able suitability in scenarios involving federal manage- ities of fighting corruption. Some of the main rules ex- ment units. BMAW 2016 - Page 33 of 59 7 DEPLOYMENT formed per dataset. MDLP was chosen due to great re- sults aligned with a considerable reduction of the number In the deployment phase, we created a Web application to of features selected – from 144 to 23. allow managers at CGU to query management units and After choosing the dataset discretized with MDLP we analyze their risk of corruption. With paths of grouped evaluated the AUC for the three algorithms used on mod- queries, managers can now view management units or- eling. The results were very close, approximately 0.82. ganized by their agencies. They are also able to perform Therefore, we chose the model created with Naı̈ve Bayes ad-hoc queries, using as input unique identifiers of man- to be our final model, since it is more interpretable and agement units to obtain risk of corruption analysis for an simpler. individual unit or groups of them. The dataset labeled Testing (DTE) separated on data To deploy the predictive model to assess risk of corrup- preparation was then used to confirm the validity of the tion we simply implemented the calculation of Naı̈ve final model. DTE showed AUC of approximately 0.76. Bayes with the conditional probabilities for the features selected on our final model. Using the output probabil- Finally, the rules of the final model were extracted. With ities given by the model, we then discretized the results help from DIE experts, we derived knowledge for corrup- manually to only show risk categories, specifically: less tion fight activities. Rules generated and experts’ com- than 0.20 as Very Low; equal or greater than 0.20 but ments were outlined to give an overview of the results. less than 0.40 as Low; equal or greater than 0.40 but less The predictive model from this project was also deployed than 0.60 as Medium; equal or greater than 0.60 but less in a Web application, allowing managers from CGU to than 0.80 as High; and equal or greater than 0.80 as Very query and analyze federal management units regarding High. their risk of corruption. With the results of our model, The Web application also generates pdf reports contain- CGU is already prioritizing corruption related activities ing, for a given management unit: risk of corruption, av- to help maximize audits efficacy. erage and maximum risk of corruption of the manage- Therefore, this work contributed with an end-to-end data ment units on the same agency. The application not only mining project overview, with application of several shows risk results, but also several other government data state-of-the-art techniques. We reinforced CGU’s activ- related to each management unit, allowing a general view ities in fighting corruption by building an useful model of each unit. to assess risk of corruption of federal management units. With the application running, we started to present this The knowledge discovered is also increasing the exper- work to all areas of CGU. Currently, several activities in- tise of DIE analysts. With the Web application devel- volving management units are being prioritized using our oped from this project, we help potentially save millions risk of corruption predictive model together with other in public resources. Additionally, with risk assessment information. we encourage proactive audits, helping managers plan their work. To that end, we generate impact nationwide 8 CONCLUSION in fighting corruption. This paper described a data mining project that generated Bayesian models to assess risk of corruption of federal management units. We analyzed data from several gov- ernment databases and, with the help of DIE experts, we developed thousands of important features. These vari- ables were prepared and pre-processed removing those with zero class-variance and high inter-correlation. Feature selection was done using Adaptive Lasso, which selected the 144 most relevant features. We compared two different discretization methods: CACC and MDLP. Bayesian models were built for datasets discretized with the two methods using the following algorithms: Naı̈ve Bayes, Tree Augmented Naı̈ve Bayes, and Attribute Weighted Naı̈ve Bayes. To first choose the best dis- cretization method we evaluated our results obtaining the average of the 10-fold cross-validation metrics per- BMAW 2016 - Page 34 of 59 Acknowledgements Daniel Lowd and Pedro Domingos. Naive bayes models for probability estimation. In Proceedings of the 22nd The authors would like to thank the corruption fighting international conference on Machine learning, pages expert Victor Steytler for providing useful insights for 529–536. ACM, 2005. the development of this work. Finally, the authors would like to thank CGU for providing the resources necessary EWT Ngai, Yong Hu, YH Wong, Yijun Chen, and Xin to work in this research, as well as for allowing its publi- Sun. The application of data mining techniques in cation. financial fraud detection: A classification framework and an academic review of literature. Decision Sup- port Systems, 50(3):559–569, 2011. References Clifton Phua, Vincent Lee, Kate Smith, and Ross Gayler. R. Balaniuk, P. Bessiere, E. Mazer, and P. Cobbe. Risk A comprehensive survey of data mining-based fraud based Government Audit Planning using Naı̈ve Bayes detection research. arXiv preprint arXiv:1009.6119, Classifiers. Advances in Knowledge-Based and Intel- 2010. URL http://arxiv.org/abs/1009. ligent Information and Engineering Systems, 2012. 6119. Andrew P Bradley. The use of the area under the Carlos Vinı́cius Silva and Célia Ralha. Utilização de roc curve in the evaluation of machine learning algo- Técnicas de Mineração de Dados como Auxı́lio na rithms. Pattern recognition, 30(7):1145–1159, 1997. Detecção de Cartéis em Licitações. In WCGE - II Workshop de Computação Aplicada em Governo Ricardo Carvalho, Rommel Carvalho, Marcelo Ladeira, Eletrônico, 2010. Fernando Monteiro, and Gilson Mendes. Using po- Sona Taheri, John Yearwood, Musa Mammadov, and litical party affiliation data to measure civil servants’ Sattar Seifollahi. Attribute weighted naive bayes clas- risk of corruption. In 2014 Brazilian Conference on sifier using a local optimization. Neural Computing Intelligent Systems (BRACIS), pages 166–171. IEEE, and Applications, 24(5):995–1002, 2014. 2014. Jiliang Tang, Salem Alelyani, and Huan Liu. Feature Rommel Carvalho, Shou Matsumoto, Kathryn B. selection for classification: A review. Data Classifica- Laskey, Paulo C. G. Costa, Marcelo Ladeira, and La- tion: Algorithms and Applications, page 37, 2014. cio L. Santos. Probabilistic ontology and knowledge fusion for procurement fraud detection in brazil. In Richard Taylor. Interpretation of the correlation coeffi- Uncertainty Reasoning for the Semantic Web II, pages cient: a basic review. Journal of diagnostic medical 19–40. Springer, 2013. sonography, 6(1):35–39, 1990. S. Garcia, J. Luengo, J. A. Sáez, V. Lopez, and F. Her- Cheng-Jung Tsai, Chien-I Lee, and Wei-Pang Yang. A rera. A survey of discretization techniques: taxonomy discretization algorithm based on class-attribute con- and empirical analysis in supervised learning. Knowl- tingency coefficient. Information Sciences, 178(3): edge and Data Engineering, IEEE Transactions on, 25 714–731, 2008. (4):734–750, 2013. Stéphane Tufféry. Data mining and statistics for decision Mark A. Hall. Correlation-based feature se- making. John Wiley & Sons, 2011. lection for machine learning. PhD thesis, Fei Zheng and Geoffrey I Webb. Tree augmented naive The University of Waikato, 1999. URL bayes. In Encyclopedia of Machine Learning, pages https://www.lri.fr/˜pierres/donn% 990–991. Springer, 2011. E9es/save/these/articles/lpr-queue/ Hui Zou. The adaptive lasso and its oracle properties. hall99correlationbased.pdf. Journal of the American statistical association, 101 David W Hosmer, Stanley Lemeshow, and Rodney X (476):1418–1429, 2006. Sturdivant. Applied logistic regression, volume 398. John Wiley & Sons, 2013. Keki B Irani. Multi-interval discretization of continuous- valued attributes for classification learning. 1993. Yufeng Kou, Chang-Tien Lu, Sirirat Sirwongwattana, and Yo-Ping Huang. Survey of fraud detection tech- niques. In Networking, sensing and control, 2004 IEEE international conference on, volume 2, pages 749–754. IEEE, 2004. BMAW 2016 - Page 35 of 59