<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Using Bayesian Networks to Identify and Prevent Split Purchases in Brazil</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Rommel</forename><forename type="middle">N</forename><surname>Carvalho</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Research and Strategic Information (DIE) Brazil&apos;s Office of the Comptroller General (CGU) SAS</orgName>
								<address>
									<addrLine>Quadra 01, Bloco A, Edifício Darcy Ribeiro Brasília</addrLine>
									<region>DF</region>
									<country key="BR">Brazil</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Leonardo</forename><forename type="middle">J</forename><surname>Sales</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Research and Strategic Information (DIE) Brazil&apos;s Office of the Comptroller General (CGU) SAS</orgName>
								<address>
									<addrLine>Quadra 01, Bloco A, Edifício Darcy Ribeiro Brasília</addrLine>
									<region>DF</region>
									<country key="BR">Brazil</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Henrique</forename><forename type="middle">A</forename><surname>Da Rocha</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Research and Strategic Information (DIE) Brazil&apos;s Office of the Comptroller General (CGU) SAS</orgName>
								<address>
									<addrLine>Quadra 01, Bloco A, Edifício Darcy Ribeiro Brasília</addrLine>
									<region>DF</region>
									<country key="BR">Brazil</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Gilson</forename><forename type="middle">L</forename><surname>Mendes</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Department of Research and Strategic Information (DIE) Brazil&apos;s Office of the Comptroller General (CGU) SAS</orgName>
								<address>
									<addrLine>Quadra 01, Bloco A, Edifício Darcy Ribeiro Brasília</addrLine>
									<region>DF</region>
									<country key="BR">Brazil</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Using Bayesian Networks to Identify and Prevent Split Purchases in Brazil</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">03C5E3E911DC401521A2579BFE0C7513</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T17:13+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>To cope with society's demand for transparency and corruption prevention, the Brazilian Office of the Comptroller General (CGU) has carried out a number of actions, including: awareness campaigns aimed at the private sector; campaigns to educate the public; research initiatives; and regular inspections and audits of municipalities and states. Although CGU has collected information from various different sources -Revenue Agency, Federal Police, and others -, going through all the data in order to find suspicious transactions has proven to be really challenging. In this paper, we present a Data Mining study applied on real data -government purchases -for finding transactions that might become irregular before they are considered as such in order to act proactively. Moreover, we compare the performance of various Bayesian Network (BN) learning algorithms with different parameters in order to fine tune the learned models and improve their performance. The best result was obtained using the Tree Augmented Network (TAN) algorithm and oversampling the minority class in order to balance the data set. Using a 10-fold crossvalidation, the model correctly classified all split purchases, it obtained a ROC area of .999, and its accuracy was 99.197%.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>The Brazilian Office of the Comptroller General (CGU) is the Brazilian central body of the internal control system of the federal executive branch. It has, among its responsibilities, the task of inspecting and auditing the Brazilian Government projects and programs with respect to their legality, results, efficacy and efficiency. ⇤ Department of Computer Science (CIC), University of Brasília (UnB), Brasília, DF, Brazil A primary responsibility of CGU is to prevent and detect government corruption. To carry out this mission, CGU must gather information from a variety of sources and combine it to evaluate whether further action, such as an investigation, is required. One of the most difficult challenges is the information explosion. Auditors must fuse vast quantities of information from a variety of sources in a way that highlights its relevance to decision makers and helps them focus their efforts on the most critical cases. This is no trivial duty. The Growing Acceleration Program (PAC) alone has a budget greater than 250 billion dollars with more than one thousand projects only in the state of Sao Paulo <ref type="foot" target="#foot_0">1</ref> . All of these have to be audited and inspected by CGU -and, in spite of having only three thousand employees. Therefore, CGU must optimize its processes in order to carry out its mission.</p><p>In Brazil, all contracts with the private sector must be in accordance with the Law N 8,666/93, also known as the national Procurement Law. According to <ref type="bibr" target="#b14">[15]</ref> procurement is the administrative procedure by which the Public Administration selects the most advantageous proposal for a contract in its interest. From the former definition, the conclusion is that the public interest must always be the objective of the procedure. In terms of purchasing with the use of public money, this means that not only must the winner of the procurement process be the best supplier in terms of the price of the good or service supplied, but also in terms of other objectives of the procurement process.</p><p>Corruption can happen in many ways in Public Procurements <ref type="bibr" target="#b15">[16]</ref>. The public agent may favor a specific supplier that she happens to know. She may receive, from the bidder, a financial compensation for awarding a contract to that firm. Bidders may collude as to set the results of the procurement. The whole process is susceptible to many forms of corruption, from within and outside the public administration.</p><p>The government purchases large quantities of goods and services. It is also the sole purchaser for some goods, such as hydraulic turbines for large damns. The government spends large quantities of money in the market and is a guaranteed payer. Hence, many firms have a strong interest in negotiating with the public administration. There is a temptation for many suppliers to cheat in the procurement process to find means of being awarded a lucrative government contract.</p><p>The Brazilian Procurement Law has as one of its main objectives the curbing of corruption in public purchasing and contracting. Concern over corruption is evident in Brazil's daily press. There are frequent accusations of public administrators who did not abide by the procurement rules, and are accused of favoring a certain supplier or worse, receiving a payment in the process.</p><p>When writing the law, legislators included many articles that established penalties for firms or/and public legislators caught in corruption activities. There are two types of penalties stated in Law N 8,666/93 dealing with this subject. They are administrative actions and penal actions.</p><p>Since enforcing the law is difficult <ref type="bibr" target="#b15">[16]</ref>, legislators will have to find another manner to prevent corruption in public procurement. The question is one of preventing corruption practices, against one of punishing the ones that have already happened.</p><p>Therefore, the main objective of this project is to apply Data Mining methods to create models, more specifically to learn Bayesian Network models, that will aid the experts in identifying procurement frauds and, if possible, identify suspicious transactions as soon as possible in order to prevent the fraud from happening.</p><p>The structure of this paper is as follows: Section 2 presents the CRoss Industry Standard Process for Data Mining (CRISP-DM) methodology used in this work. Section 3 presents the data used and its underlying semantics. Section 4 presets the models learned as well as their evaluation. Section 5 presents ideas on how the learned model for identifying split purchases will be deployed. Finally, Section 6 presents the conclusion and future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Methodology</head><p>The CRoss Industry Standard Process for Data Mining (CRISP-DM) project developed an industry and toolneutral data mining process model. Starting from the embryonic knowledge discovery processes used in early data mining projects and responding directly to user requirements, this project defined and validated a data mining process that is applicable in diverse industry sectors. This methodology makes large data mining projects faster, cheaper, more reliable, and more manageable. Even smallscale data mining investigations benefit from using CRISP-DM.</p><p>The process model provided by the consortium CRISP-DM can be summarized through the life cycle of the data mining process presented in Figure <ref type="figure" target="#fig_0">1</ref>, reproduced from <ref type="bibr" target="#b4">[5]</ref>. The outer circle symbolizes the cyclical nature of the process of data mining. Data mining does not end when a solution is achieved. In fact, the lessons learned during a process can be useful in subsequent processes. The life cycle of a data mining project consists of six phases. The sequence of the phases is not strict. Moving back and forth between different phases is often required. The arrows indicate the most important and frequent dependencies between phases. The phases are <ref type="bibr" target="#b18">[19]</ref>:</p><p>Business Understanding: This initial phase focuses on understanding the project objectives and requirements from a business perspective, and then converting this knowledge into a data mining problem definition, and a preliminary project plan designed to achieve the objectives. Data Understanding: The data understanding phase starts with an initial data collection and proceeds with activities in order to get familiar with the data, to identify data quality problems, to discover first insights into the data, or to detect interesting subsets to form hypotheses for hidden information.</p><p>There is a close link between Business Understanding and Data Understanding. The formulation of the data mining problem and the project plan require at least some understanding of the available data. Data Preparation: The data preparation phase covers all activities to construct the final data set (data that will be fed into the modeling tool(s)) from the initial raw data. Data preparation tasks are likely to be performed multiple times, and not in any prescribed order. Tasks include A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached. Deployment: Creation of the model is generally not the end of the project. Usually, the knowledge gained will need to be organized and presented in a way that the customer can use it. Depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process. In many cases it will be the user, not the data analyst, who will carry out the deployment steps. In any case, it is important to understand up front what actions will need to be carried out in order to actually make use of the created models.</p><p>The Business Understanding phase was already covered in the introduction section. The Data Understanding phase will not be covered in detail since the data and its structure is confidential. The following sections will cover the Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment phases.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Data Understanding and Preparation</head><p>The data set used in this work is related to procurements and contracts of IT services in the Brazilian Federal Government from 2005 to 2010. This data set is actually merged data from different databases (DB) available in different agencies in Brazil.</p><p>CGU has been working closely with subject matter experts (SMEs) in order to identify certain fraud topologies, such as:</p><p>1. Identify if owners from different companies are actually partners. In the public procurement we expect competition, but if the only two enterprises participating in the procurement have owners that are partners, then it is obvious that there is no competition at all, so this should be prevented/identified.</p><p>2. In Brazil if the contract is small enough (8 thousand reais, or roughly around 4 thousand dollars), then there is no need to actually go through the whole procurement process. However, sometimes, they break down a big contract in a lot of small ones. This is called split purchase<ref type="foot" target="#foot_1">2</ref> and it is not allowable by law and should be prevented/identified. E.g., identify a lot of contracts that sums up to more than the threshold with the same objective (buying computers, for instance) in a week or a month.</p><p>There are actually more than 20 typologies like these two described by the SMEs that we would like to identify/predict using the data set we have. However, this project focuses on the second one described above.</p><p>First the data set was loaded in Excel, then we removed some characters (comma, double quotes, and single quotes), because they were being a problem when loading the file in Weka (see <ref type="bibr" target="#b12">[13]</ref> for details about Weka). Then we replaced values like NA, -9, -8, and 0 by "?" to represent a missing value in Weka.</p><p>The initial data set had 42 attributes and 70,365 transactions. From Excel we were able to bring that number down to 26. Most of the attributes removed were identification of descriptions that were also available as a different attribute. We decided to keep the description because it is easier to understand what it means. For instance, we kept the name of the state instead of its ID.</p><p>The next step was to save the data as a CSV file and load it in Weka. Then we changed the year to nominal and removed rows with missing values in the final price attribute, since we will need a value later on to identify the class.</p><p>Before running any algorithm we started looking at the data trying to understand it better. The first thing that we noticed was the range of the values. There were values from cents to hundreds of trillions of dollars. Besides that, there were cases where the proposed unit price was 4 thousand whereas the final actual price was a dollar. Those are typical cases of data that should not be trusted or considered incorrect.</p><p>The number of cases with a really high value is small so they should be carefully analyzed before being thrown away, just to make sure they are not outliers, but actually inconsistent data. For that, we got in contact with the SMEs at CGU and asked them which ones should be ignored.</p><p>With their consent we removed those transactions that were considered noise. The transactions that could be outliers are being analyzed by the experts in more detail.</p><p>Since we are interested in transactions that sum to 8,000 in a given period of time (see typology 2 above), we computed using R 3 the transactions that involved the same institutions on the same month and year that added up to more than 8,000, then we flagged them as irregular transactions 4 (i.e., split purchases).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Modeling and Evaluation</head><p>The main objective of this work is to try to classify, without looking at the sums, which transactions should be considered suspicious. The reason for that is to avoid waiting for the problem to occur (having several transactions that add to more than 8,000 reais) and to identify the problem as soon as possible.</p><p>Before being able to classify we had to discretize the value of the transactions, since we decide to work with learning algorithms that do not support continuous values. We used equal frequency to generate 11 different bins. The number of bins could be different. The only thing we kept in mind was that we did not want it to be binary, nor to have too many bins to the point that we would have just a few transactions in each bin.</p><p>Since Bayesian Networks (BNs) have been successfully used in classification problems (e.g., see <ref type="bibr">[4,7,10-12,17,18,</ref> 3 R is a free software environment for statistical computing and graphics. For more information see http://www. r-project.org/. 4 Usually we also consider the type of service/product contracted/bought. However, since we limited the scope of our analysis to only computer purchases, all 10 different types of service/product are too broad and really similar (they all basically say that these are computer related purchases without any significant difference between each other). Therefore, we can safely ignore this field when defining split purchases, since they do not add any more useful information. To avoid comparing purchases of different nature, we use the contractor field, since a company that provides software development does not usually sell computer peripherals. Unfortunately, we do not have any other structured field that we could use to improve this proxy classification. Of course we might have cases that purchases identified as split purchases are, in fact, false positives. However, our current alert system would also identify those purchases as split purchases. What we are trying to do is to identify those "split purchases" as soon as possible, instead of waiting for the whole month cycle. 20]), we decided to experiment with different BN learning algorithms in order to classify the transactions as regular or not, in order to identify split purchases as soon as possible, without having to wait for them to actually be confirmed as such (before having several purchases summing up to more than 8 thousand reais). We ran all classifiers with 10-fold cross-validation.</p><p>At first we compared the results of Naïve Bayes versus Bayesian Network models. Since the data is unbalanced with 50,911 regular cases and 2,626 irregular cases, we decided to resample the data.</p><p>Table <ref type="table" target="#tab_2">1</ref> presents the performance of the Naïve Bayes and Bayesian Network classifiers using 10-fold crossvalidation. We learned models without resampling the data set, oversampling the minority class, undersampling the majority class, and both oversampling the minority class and undersampling the majority class. The metrics used to compare the models are regular false positive (FP) rate, irregular FP rate, accuracy, and Receiver Operating Characteristic (ROC) area.</p><p>The standard algorithm used in Weka for BN is K2. K2 uses a hill climbing algorithm restricted by an order on the variables. For more information on K2 see <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b8">9]</ref>. The parameters used for learning the BN model were all the default values in Weka expect for the maximum number of parents allowed, which we changed from 1 to 2. The estimator parameter used, which is independent of which algorithm we use to learn the model, was the SimpleEstimator. SimpleEstimator is used for estimating the conditional probability tables of a BN once the structure has been learned. It estimates probabilities directly from data. The only configurable parameter for this estimator is alpha. Alpha is used for estimating the probability tables and can be interpreted as the initial count on each value. The default value used for alpha is .5. As explained in <ref type="bibr" target="#b2">[3]</ref> When oversampling, we used the sample size of 200% to avoid removing known cases of the majority class, which is basically the oversampling of the minority class. The goal was to increase the irregular cases to match the number of regular ones.</p><p>We can see that we got much better results with Bayesian Network, when oversampling the minority class. Moreover, the false positive rate for the regular cases, which is our main concern since it represents the number of irregular transactions that were classified as regular, has dropped significantly after oversampling (from .363 to .140 for Naïve Bayes and from .452 to .005 for Bayesian Network). As a result, the FP rate for the irregular cases increased while the accuracy decreased for NB and slightly increased for BN. Nevertheless, since we are more interested in correctly classifying the irregular cases, the result with oversampling is better.</p><p>When undersampling, we decided to use the sample size of (2, 626/50, 911) ⇤ 2 ⇡ 10%. Finally, we tried doing both oversampling and undersampling by keeping the sample size at 100%. As it can be seen from Table <ref type="table" target="#tab_2">1</ref>, oversampling the minority class gave us the best results both in accuracy and in ROC Area. Besides that, we did have a significant decrease in regular FP rate, which is our main concern. The accuracy without resampling is higher than most of the other results with resampling due to the fact that we had an increase in irregular FP rate and since this is the majority class the number of correctly classified cases is much larger. However, as explained before, the main objective is to obtain low regular FP rate and not necessarily high accuracy, which could only mean that we are getting just the majority class right.</p><p>As BN gave the best result, we decided to try different search algorithms. The algorithms used were K2 (again using the default parameters), Hill Climber<ref type="foot" target="#foot_2">5</ref> , Tabu Search<ref type="foot" target="#foot_3">6</ref> , and Tree Augmented Network<ref type="foot" target="#foot_4">7</ref> (TAN).</p><p>The options for the Hill Climber algorithm in Weka are the same as K2, except that it does not have the parameter randomOrder. Besides, we have the useArcReversal, which when set to true (default is false), the arc reversal operation is used in the search. As in K2, we have used all default parameters, except for the maximum number of parents and the score metric (we used MDL instead of Bayes).</p><p>The Tabu Search algorithm in Weka has the same options as Hill Climber. Besides that, it also has runs, which sets the number of steps to be performed (default is 10), and tabuList, which sets the length of the tabu list (default is 5). As in both K2 and Hill Climber, we have used all default parameters, except for the maximum number of parents and the score metric (we used MDL instead of Bayes).</p><p>The TAN algorithm in Weka only has the markovBlanketClassifier and scoreType options with the same values as K2. We have used the default parameters. Note that the TAN algorithm does not allow the configuration in the maximum number of parents, since it is always 2.</p><p>Table <ref type="table" target="#tab_3">2</ref> shows a summary of the BN classifiers performance using different search algorithms with oversampled data. It is worth noting that we were only able to use the default Bayes metric with the K2 and TAN algorithms. We had to use MDL metric in order to avoid out of memory error with the Hill Climbing and Tabu Search algorithms. This difference might be the reason why K2 and TAN outperforms both Hill Climbing and Tabu Search, since we have tried both K2 and TAN using MDL metric and we did obtain worse results. As it can be seen, TAN search algorithm had the best performance with a ROC area of .999.</p><p>Since oversampling resulted in very good results, we decided to also try Synthetic Minority Oversampling TEchnique (SMOTE) <ref type="bibr" target="#b5">[6]</ref> to oversample the minority class. We used 1800% to get roughly the same number of transac- tions in both classes. After we oversampled the data, we discretized it as we did previously (11 bins with equal frequency). However, even though the SMOTE oversampling gave better results for most metrics, the standard oversampling method outperformed the SMOTE method on the regular FP rate, which is the one that we are most interested in. Therefore, we do not present the details of the results here.</p><p>Since TAN algorithm presented the best results, we decided to try different values for the alpha parameter in order to improve its performance. Table <ref type="table" target="#tab_4">3</ref> presents performance for the alpha values .7, .5, .3, .1, and .05. On the one hand, when we increased the default value of .5 to .7, the performance slightly worsened. On the other hand, when we decreased the default value we kept getting slightly better results up to .05, when the performance started worsening again.</p><p>Table <ref type="table" target="#tab_5">4</ref> presents the confusion matrix for the best model we obtained, which was learned using TAN with the value .1 for the alpha parameter. As it can be seen, every single split purchase (irregular) was correctly classified. Although the regular FP rate in Table <ref type="table" target="#tab_4">3</ref> suggests that alpha values of .5, .3, and .05 also classified all irregular instances correctly, the model with .5 alpha missclassified 7 instances (the regular FP rate is shown as 0 due to rounding, since the value is actually .00013). Besides that, even though the irregular FP rate is the same for the models with .1 and .05 alpha, the model with the .05 alpha missclassified 5 instances more than the model with .1 alpha.</p><p>In order to try to better understand the relationship between the variables, we generated some association rules using Apriori algorithm <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b13">14]</ref>. In the first run we used .1 as lower bound min support, 1 as upper bound min support, lift as the metric type, 1.1 as min metric, and 50 as number of rules. For more information on each of these parameters see http://weka.sourceforge.net/ doc.dev/weka/associations/Apriori.html.</p><p>The results were related to variables that were not of interest and even though they had a high confidence (1) and high lift (1.82) they were not describing anything that we did not already know. So we decided to remove those variables from the data set and run the algorithm again to see the new rules it would generate.</p><p>We analyzed all 50 new rules generated and they still did not provide any insightful information. Besides that, these new rules were not as "strong" as the previous ones in the sense that the best confidence was .62 and the best lift was 1.11.</p><p>The reason for not presenting the rules is two-fold. The first is because, as previously explained, they did not provide any insightful information. The second is because they are confidential and we cannot discuss the contents of the rules. All that can be said is that they were not useful.</p><p>Finally, we analyzed the BN that had the best evaluation results (BN -TAN with .1 alpha -standard oversampled data set) to try to better understand the relations between the variables in order to comprehend why it is getting such good results. This could provide insightful information that we were not able to extract from association rules.</p><p>By analyzing the structure of the BN we were able to identify some interesting relations associated to the government office responsible for the procurement and also to the winner of the procurements. However, due to the confidentiality nature of the problem we are not allowed to discuss these relations further. Nevertheless, we can comment on the fact that to better understand these relations a more detailed study is necessary by looking at the conditional probability tables on these relations, which was not done in this work due to time constraints.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Deployment</head><p>As it was explained before, the main goal is to identify suspicious transactions as soon as possible to avoid the occurrence of split purchases.</p><p>Split purchases in the data set we worked on happen when a procurement that should be done just once with a value higher than 8,000 is broken down in smaller ones to avoid the normal procurement procedure, allowing the office to hire an enterprise directly.</p><p>So, the idea is that we want to detect one or more of these procurements, but that still does not sum up to more than 8,000, as soon as they become suspicious (using the classifier) and act proactively. This will allow a faster response and prevent irregularities in the first place. This means that instead of waiting a whole month to find an irregular transaction, we will be finding suspicious ones (not necessarily irregular ones) and warn/educate/teach the people involved that they should be careful, since breaking down big procurements in small ones is not allowed.</p><p>Hopefully, this will make them think twice before continuing with these kinds of transactions, if they are really thinking about doing irregular transactions on purpose. And at the same time this will educate those that are not aware they cannot do these kinds of transactions by law.</p><p>Another thing to keep in mind when deploying this classi-fier is that as we warn/teach the different offices in Brazil about this type of fraud, it is expected that they will decrease the number of irregular transactions associated to this type of fraud. Therefore, it is important to keep track of these numbers in order to evaluate if this expected behavior is actually happening and at what rate.</p><p>Besides that, as people learn and change their behavior, it is also expected that our classifier will also need some modifications to cope with these changes. Therefore, it should be expected that every month or so a new classifier should be trained and evaluated to learn the new behavior and analyze its performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>This project shows how Data Mining, BN learning more specifically, can be used to identify and prevent split purchases in Brazil using real data provided by the Brazil's Office of the Comptroller General (CGU).</p><p>At first, the idea was to allow the identification of a few different types of irregularities. However, as we started working with the data and creating the different models, it became clear that this is a great effort and after discussing with stakeholders at CGU, we decided to prioritize depth instead of breadth. In other words, it was decided that it was best to choose one type of fraud and do a thoroughly job to develop a model that is good enough to actually deploy it and analyze the results before looking at other types of frauds.</p><p>Fortunately, the results paid off in the sense that we were able to generate a model that correctly classified all split purchases and a really high ROC area (.999). The next step is to deploy this classifier in CGUs intelligence system and analyze its impact.</p><p>As future work, it is necessary to better understand why the model is capable of getting such good results. By analyzing the structure of the BNs generated, it is possible to understand relations between variables that were previously unknown. A brief analysis showed that some interesting relations were found, but a more detailed analysis is needed.</p><p>Even though we have used resampling for dealing with the rare event of split purchases and cross-validation to avoid overfitting, we might have not completely ruled out the possibility that this model might be slightly overfitted. Therefore, before deploying the model, we will gather new data from recent purchases in order to test the performance of our model with new unseen data. Due to time constraints, we have not been able to gather the new data and perform this test yet.</p><p>Besides that, there are a lot of different types of frauds that were not analyzed. So, a starting point for future work is to do the same kind of analysis done in this project with different types of fraud.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Phases of the CRISP-DM Process Model</figDesc><graphic coords="2,327.60,166.82,234.00,212.50" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 1 :</head><label>1</label><figDesc>Evaluation of Naïve Bayes (NB) and Bayesian Network (BN) classifiers without resampling, with oversampling, with undersampling, and with both over and undersampling</figDesc><table><row><cell>Metric</cell><cell>NB</cell><cell>BN</cell><cell cols="6">NB -Over BN -Over NB -Under BN -Under NB -Both BN -Both</cell></row><row><cell>Regular FP Rate</cell><cell>.363</cell><cell>.452</cell><cell>.140</cell><cell>.005</cell><cell>.169</cell><cell>.068</cell><cell>.146</cell><cell>.012</cell></row><row><cell>Irregular FP Rate</cell><cell>.028</cell><cell>.003</cell><cell>.048</cell><cell>.036</cell><cell>.082</cell><cell>.054</cell><cell>.054</cell><cell>.045</cell></row><row><cell>Accuracy</cell><cell cols="2">95.6% 97.5%</cell><cell>90.6%</cell><cell>97.9%</cell><cell>87.4%</cell><cell>90.6%</cell><cell>90.0%</cell><cell>97.1%</cell></row><row><cell>ROC Area</cell><cell>.931</cell><cell>.973</cell><cell>.975</cell><cell>.997</cell><cell>.945</cell><cell>.968</cell><cell>.971</cell><cell>.996</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 2 :</head><label>2</label><figDesc>BN classifiers performance with oversampling and different number of parents</figDesc><table><row><cell>Metric</cell><cell cols="2">BN -K2</cell><cell cols="2">BN -HC</cell><cell cols="2">BN -Tabu</cell><cell>BN -TAN</cell></row><row><cell>Parents</cell><cell>2</cell><cell>3</cell><cell>2</cell><cell>3</cell><cell>2</cell><cell>3</cell><cell>-</cell></row><row><cell>Regular FP Rate</cell><cell>.005</cell><cell>.018</cell><cell>.098</cell><cell>.066</cell><cell>.093</cell><cell>.093</cell><cell>0</cell></row><row><cell>Irregular FP Rate</cell><cell>.036</cell><cell>.038</cell><cell>.082</cell><cell>.067</cell><cell>.039</cell><cell>.039</cell><cell>.02</cell></row><row><cell>Accuracy</cell><cell cols="6">97.94% 97.23% 91.01% 93.35% 93.40% 93.40%</cell><cell>98.98%</cell></row><row><cell>ROC Area</cell><cell>.997</cell><cell>.996</cell><cell>.976</cell><cell>.985</cell><cell>.988</cell><cell>.988</cell><cell>.999</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 3 :</head><label>3</label><figDesc>TAN performance with different values for the alpha parameter Metric TAN -.7 alpha TAN -.5 alpha TAN -.3 alpha TAN -.1 alpha TAN -.05 alpha</figDesc><table><row><cell>Regular FP Rate</cell><cell>.001</cell><cell>0</cell><cell>0</cell><cell>0</cell><cell>0</cell></row><row><cell>Irregular FP Rate</cell><cell>.023</cell><cell>.020</cell><cell>.018</cell><cell>.016</cell><cell>.016</cell></row><row><cell>Accuracy</cell><cell>98.834%</cell><cell>98.984%</cell><cell>99.108%</cell><cell>99.197%</cell><cell>99.192%</cell></row><row><cell>ROC Area</cell><cell>.999</cell><cell>.999</cell><cell>.999</cell><cell>.999</cell><cell>.999</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 4 :</head><label>4</label><figDesc>Confusion matrix of the best model, which was learned using TAN with .1 alpha</figDesc><table><row><cell></cell><cell cols="2">Predicted Regular Predicted Irregular</cell></row><row><cell>Is Regular</cell><cell>52,670</cell><cell>860</cell></row><row><cell>Is Irregular</cell><cell>0</cell><cell>53,544</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">http://www.pac.gov.br/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">For more information about split purchases and case examples see http://guide.iacrc.org/ potential-scheme-split-purchases/.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_2">Hill Climber uses a hill climbing algorithm adding, deleting and reversing arcs. The search is not restricted by an order on the variables (unlike K2). For more details see http://weka. sourceforge.net/doc.dev/weka/classifiers/ bayes/net/search/global/HillClimber.html.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_3">Tabu Search algorithm uses tabu search for finding a well scoring BN structure. For more details see<ref type="bibr" target="#b1">[2]</ref>.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_4">TAN algorithm determines the maximum weight spanning tree and returns a NB network augmented with a tree. For more details see<ref type="bibr" target="#b9">[10]</ref>.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgements</head><p>The authors would like to thank Prof. Daniel Barbará from George Mason University for providing useful insights for the development of this work. Finally, the authors would like to thank CGU for providing the resources necessary to work in this research, as well as for allowing its publication.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Fast algorithms for mining association rules in large databases</title>
		<author>
			<persName><forename type="first">R</forename><surname>Agrawal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Srikant</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">20th International Conference on Very Large Data Bases</title>
				<meeting><address><addrLine>Los Altos, CA</addrLine></address></meeting>
		<imprint>
			<publisher>Morgan Kaufmann</publisher>
			<date type="published" when="1994">1994</date>
			<biblScope unit="page" from="478" to="499" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">Bayesian Belief Networks: from Construction to Inference</title>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">R</forename><surname>Bouckaert</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1995">1995</date>
			<pubPlace>Utrecht, Netherlands</pubPlace>
		</imprint>
		<respStmt>
			<orgName>University of Utrecht</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">PhD thesis</note>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m" type="main">Bayesian network classiers in weka for version 3-5-7</title>
		<author>
			<persName><forename type="first">R</forename><surname>Remco</surname></persName>
		</author>
		<author>
			<persName><surname>Bouckaert</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2008-05">May 2008</date>
		</imprint>
	</monogr>
	<note type="report_type">Technical report</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Exploring early glaucoma and the visual field test: Classification and clustering using bayesian networks</title>
		<author>
			<persName><forename type="first">S</forename><surname>Ceccon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">F</forename><surname>Garway-Heath</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">P</forename><surname>Crabb</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Tucker</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">IEEE Journal of Biomedical and Health Informatics</title>
		<imprint>
			<biblScope unit="volume">18</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="1008" to="1014" />
			<date type="published" when="2014-05">May 2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">CRISP-DM 1.0 step-by-step data mining guide</title>
		<author>
			<persName><forename type="first">Pete</forename><surname>Chapman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Julian</forename><surname>Clinton</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Randy</forename><surname>Kerber</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Thomas</forename><surname>Khabaza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Thomas</forename><surname>Reinartz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Colin</forename><surname>Shearer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Rudiger</forename><surname>Wirth</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">The CRISP-DM consortium</title>
				<imprint>
			<date type="published" when="2000-08">August 2000</date>
		</imprint>
	</monogr>
	<note type="report_type">Technical report</note>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">SMOTE: Synthetic minority over-sampling technique</title>
		<author>
			<persName><forename type="first">V</forename><surname>Nitesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Kevin</forename><forename type="middle">W</forename><surname>Chawla</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lawrence</forename><forename type="middle">O</forename><surname>Bowyer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">Philip</forename><surname>Hall</surname></persName>
		</author>
		<author>
			<persName><surname>Kegelmeyer</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Artificial Intelligence Research</title>
		<imprint>
			<biblScope unit="volume">16</biblScope>
			<biblScope unit="page">321357</biblScope>
			<date type="published" when="2002">2002</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Comparing bayesian network classifiers</title>
		<author>
			<persName><forename type="first">Jie</forename><surname>Cheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Russell</forename><surname>Greiner</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, UAI&apos;99, page 101108</title>
				<meeting>the Fifteenth Conference on Uncertainty in Artificial Intelligence, UAI&apos;99, page 101108<address><addrLine>San Francisco, CA, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Morgan Kaufmann Publishers Inc</publisher>
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">A bayesian method for constructing bayesian belief networks from databases</title>
		<author>
			<persName><forename type="first">Gregory</forename><forename type="middle">F</forename><surname>Cooper</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Edward</forename><surname>Herskovits</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Seventh Conference on Uncertainty in Artificial Intelligence, UAI&apos;91, page 8694</title>
				<meeting>the Seventh Conference on Uncertainty in Artificial Intelligence, UAI&apos;91, page 8694<address><addrLine>San Francisco, CA, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Morgan Kaufmann Publishers Inc</publisher>
			<date type="published" when="1991">1991</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">A bayesian method for the induction of probabilistic networks from data</title>
		<author>
			<persName><surname>Gregoryf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Edward</forename><surname>Cooper</surname></persName>
		</author>
		<author>
			<persName><surname>Herskovits</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Machine Learning</title>
				<imprint>
			<date type="published" when="1992">1992</date>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="page" from="309" to="347" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Bayesian network classifiers</title>
		<author>
			<persName><forename type="first">Nir</forename><surname>Friedman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dan</forename><surname>Geiger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Moises</forename><surname>Goldszmidt</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Machine Learning</title>
				<imprint>
			<date type="published" when="1997-11">November 1997</date>
			<biblScope unit="volume">29</biblScope>
			<biblScope unit="page" from="131" to="163" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Building classifiers using bayesian networks</title>
		<author>
			<persName><forename type="first">Nir</forename><surname>Friedman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Moises</forename><surname>Goldszmidt</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the national conference on artificial intelligence</title>
				<meeting>the national conference on artificial intelligence</meeting>
		<imprint>
			<date type="published" when="1996">1996</date>
			<biblScope unit="page">12771284</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Bayesian network classifiers</title>
		<author>
			<persName><forename type="first">Moises</forename><surname>Goldszmidt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">James</forename><forename type="middle">J</forename><surname>Cochran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Louis</forename><forename type="middle">A</forename><surname>Cox</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Pinar</forename><surname>Keskinocak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jeffrey</forename><forename type="middle">P</forename><surname>Kharoufeh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">Cole</forename><surname>Smith</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Wiley Encyclopedia of Operations Research and Management Science</title>
				<imprint>
			<publisher>John Wiley &amp; Sons, Inc</publisher>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">The WEKA data mining software: An update</title>
		<author>
			<persName><forename type="first">Mark</forename><surname>Hall</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Eibe</forename><surname>Frank</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Geoffrey</forename><surname>Holmes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Bernhard</forename><surname>Pfahringer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Peter</forename><surname>Reutemann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Ian</forename><forename type="middle">H</forename><surname>Witten</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">SIGKDD Explor. Newsl</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page">1018</biblScope>
			<date type="published" when="2009-11">November 2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Integrating classification and association rule mining</title>
		<author>
			<persName><forename type="first">Bing</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Wynne</forename><surname>Hsu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yiming</forename><surname>Ma</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Fourth International Conference on Knowledge Discovery and Data Mining</title>
				<imprint>
			<publisher>AAAI Press</publisher>
			<date type="published" when="1998">1998</date>
			<biblScope unit="page" from="80" to="86" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Licitacao e Contrato Administrativo</title>
		<author>
			<persName><forename type="first">Hely</forename><surname>Meirelles</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="s">Editora Revista dos Tribunais</title>
		<imprint>
			<biblScope unit="page" from="300" to="386" />
			<date type="published" when="1987">1987</date>
		</imprint>
	</monogr>
	<note>atualizada pelo decreto-lei 2. edition</note>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<title level="m" type="main">A critical study of the brazilian procurement law</title>
		<author>
			<persName><forename type="first">Andre</forename><surname>Mueller</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1998">1998</date>
			<publisher>IBI -The Institute of Brazilian Business &amp; Public Management Issues</publisher>
			<pubPlace>Washington</pubPlace>
		</imprint>
	</monogr>
	<note type="report_type">Technical report</note>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">A bayesian approach to filtering junk e-mail</title>
		<author>
			<persName><forename type="first">Mehran</forename><surname>Sahami</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Susan</forename><surname>Dumais</surname></persName>
		</author>
		<author>
			<persName><forename type="first">David</forename><surname>Heckerman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Eric</forename><surname>Horvitz</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Learning for Text Categorization: Papers from the 1998 workshop</title>
				<imprint>
			<date type="published" when="1998">1998</date>
			<biblScope unit="volume">62</biblScope>
			<biblScope unit="page">98105</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">The defect identification of LED chips based on bayesian classifier</title>
		<author>
			<persName><forename type="first">Wei</forename><surname>Shi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Yao</forename><surname>Wu Pei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Liang</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jian Guo</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Shao</forename><surname>Qing Ren</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Applied Mechanics and Materials</title>
		<imprint>
			<biblScope unit="page" from="1564" to="1568" />
			<date type="published" when="2013-07">July 2013</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">CRISP-DM: Towards a standard process model for data mining</title>
		<author>
			<persName><forename type="first">Rdiger</forename><surname>Wirth</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Fourth International Conference on the Practical Application of Knowledge Discovery and Data Mining</title>
				<meeting>the Fourth International Conference on the Practical Application of Knowledge Discovery and Data Mining</meeting>
		<imprint>
			<date type="published" when="2000">2939. 2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Influenza detection from emergency department reports using natural language processing and bayesian network classifiers</title>
		<author>
			<persName><forename type="first">Ye</forename><surname>Ye</surname></persName>
		</author>
		<author>
			<persName><forename type="first">(</forename><surname>Fuchiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">)</forename><surname>Rich</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Michael</forename><surname>Tsui</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jeremy</forename><forename type="middle">U</forename><surname>Wagner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Qi</forename><surname>Espino</surname></persName>
		</author>
		<author>
			<persName><surname>Li</surname></persName>
		</author>
		<idno>amiajnl-2013-001934</idno>
	</analytic>
	<monogr>
		<title level="j">Journal of the American Medical Informatics Association</title>
		<imprint>
			<date type="published" when="2014-01">January 2014</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
