=Paper=
{{Paper
|id=Vol-1218/bmaw2014_paper_7
|storemode=property
|title=Using Bayesian Networks to Identify and Prevent Split Purchases in Brazil
|pdfUrl=https://ceur-ws.org/Vol-1218/bmaw2014_paper_7.pdf
|volume=Vol-1218
|dblpUrl=https://dblp.org/rec/conf/uai/CarvalhoSRM14
}}
==Using Bayesian Networks to Identify and Prevent Split Purchases in Brazil==
Using Bayesian Networks to Identify and Prevent Split Purchases in Brazil
Rommel N. Carvalho⇤, Leonardo J. Sales, Henrique A. da Rocha, and Gilson L. Mendes
Department of Research and Strategic Information (DIE)
Brazil’s Office of the Comptroller General (CGU)
SAS, Quadra 01, Bloco A, Edifı́cio Darcy Ribeiro
Brası́lia, DF, Brazil
Abstract A primary responsibility of CGU is to prevent and detect
government corruption. To carry out this mission, CGU
To cope with society’s demand for transparency must gather information from a variety of sources and com-
and corruption prevention, the Brazilian Office of bine it to evaluate whether further action, such as an inves-
the Comptroller General (CGU) has carried out tigation, is required. One of the most difficult challenges is
a number of actions, including: awareness cam- the information explosion. Auditors must fuse vast quanti-
paigns aimed at the private sector; campaigns ties of information from a variety of sources in a way that
to educate the public; research initiatives; and highlights its relevance to decision makers and helps them
regular inspections and audits of municipalities focus their efforts on the most critical cases. This is no triv-
and states. Although CGU has collected infor- ial duty. The Growing Acceleration Program (PAC) alone
mation from various different sources – Rev- has a budget greater than 250 billion dollars with more than
enue Agency, Federal Police, and others –, going one thousand projects only in the state of Sao Paulo 1 . All
through all the data in order to find suspicious of these have to be audited and inspected by CGU – and, in
transactions has proven to be really challenging. spite of having only three thousand employees. Therefore,
In this paper, we present a Data Mining study ap- CGU must optimize its processes in order to carry out its
plied on real data – government purchases – for mission.
finding transactions that might become irregular In Brazil, all contracts with the private sector must be in
before they are considered as such in order to act accordance with the Law N 8,666/93, also known as the
proactively. Moreover, we compare the perfor- national Procurement Law. According to [15] procurement
mance of various Bayesian Network (BN) learn- is the administrative procedure by which the Public Ad-
ing algorithms with different parameters in order ministration selects the most advantageous proposal for a
to fine tune the learned models and improve their contract in its interest. From the former definition, the con-
performance. The best result was obtained us- clusion is that the public interest must always be the objec-
ing the Tree Augmented Network (TAN) algo- tive of the procedure. In terms of purchasing with the use
rithm and oversampling the minority class in or- of public money, this means that not only must the winner
der to balance the data set. Using a 10-fold cross- of the procurement process be the best supplier in terms of
validation, the model correctly classified all split the price of the good or service supplied, but also in terms
purchases, it obtained a ROC area of .999, and its of other objectives of the procurement process.
accuracy was 99.197%.
Corruption can happen in many ways in Public Procure-
ments [16]. The public agent may favor a specific supplier
1 Introduction that she happens to know. She may receive, from the bid-
der, a financial compensation for awarding a contract to that
The Brazilian Office of the Comptroller General (CGU) is firm. Bidders may collude as to set the results of the pro-
the Brazilian central body of the internal control system of curement. The whole process is susceptible to many forms
the federal executive branch. It has, among its responsi- of corruption, from within and outside the public adminis-
bilities, the task of inspecting and auditing the Brazilian tration.
Government projects and programs with respect to their le- The government purchases large quantities of goods and
gality, results, efficacy and efficiency. services. It is also the sole purchaser for some goods, such
⇤
Department of Computer Science (CIC), University of
1
Brası́lia (UnB), Brası́lia, DF, Brazil http://www.pac.gov.br/
70
as hydraulic turbines for large damns. The government The process model provided by the consortium CRISP-DM
spends large quantities of money in the market and is a can be summarized through the life cycle of the data mining
guaranteed payer. Hence, many firms have a strong inter- process presented in Figure 1, reproduced from [5]. The
est in negotiating with the public administration. There is a outer circle symbolizes the cyclical nature of the process of
temptation for many suppliers to cheat in the procurement data mining. Data mining does not end when a solution is
process to find means of being awarded a lucrative govern- achieved. In fact, the lessons learned during a process can
ment contract. be useful in subsequent processes.
The Brazilian Procurement Law has as one of its main ob-
jectives the curbing of corruption in public purchasing and
contracting. Concern over corruption is evident in Brazil’s
daily press. There are frequent accusations of public ad-
ministrators who did not abide by the procurement rules,
and are accused of favoring a certain supplier or worse, re-
ceiving a payment in the process.
When writing the law, legislators included many articles
that established penalties for firms or/and public legisla-
tors caught in corruption activities. There are two types of
penalties stated in Law N 8,666/93 dealing with this sub-
ject. They are administrative actions and penal actions.
Since enforcing the law is difficult [16], legislators will
have to find another manner to prevent corruption in public
procurement. The question is one of preventing corruption
practices, against one of punishing the ones that have al-
ready happened.
Therefore, the main objective of this project is to apply
Data Mining methods to create models, more specifically Figure 1: Phases of the CRISP-DM Process Model
to learn Bayesian Network models, that will aid the experts
in identifying procurement frauds and, if possible, identify The life cycle of a data mining project consists of six
suspicious transactions as soon as possible in order to pre- phases. The sequence of the phases is not strict. Moving
vent the fraud from happening. back and forth between different phases is often required.
The arrows indicate the most important and frequent de-
The structure of this paper is as follows: Section 2 presents pendencies between phases. The phases are [19]:
the CRoss Industry Standard Process for Data Mining
(CRISP-DM) methodology used in this work. Section 3 Business Understanding: This initial phase fo-
presents the data used and its underlying semantics. Sec- cuses on understanding the project objec-
tion 4 presets the models learned as well as their evaluation. tives and requirements from a business per-
Section 5 presents ideas on how the learned model for iden- spective, and then converting this knowl-
tifying split purchases will be deployed. Finally, Section 6 edge into a data mining problem definition,
presents the conclusion and future work. and a preliminary project plan designed to
achieve the objectives.
Data Understanding: The data understanding
2 Methodology phase starts with an initial data collection
and proceeds with activities in order to get
The CRoss Industry Standard Process for Data Min- familiar with the data, to identify data qual-
ing (CRISP-DM) project developed an industry and tool- ity problems, to discover first insights into
neutral data mining process model. Starting from the the data, or to detect interesting subsets
embryonic knowledge discovery processes used in early to form hypotheses for hidden information.
data mining projects and responding directly to user re- There is a close link between Business Un-
quirements, this project defined and validated a data min- derstanding and Data Understanding. The
ing process that is applicable in diverse industry sectors. formulation of the data mining problem and
This methodology makes large data mining projects faster, the project plan require at least some under-
cheaper, more reliable, and more manageable. Even small- standing of the available data.
scale data mining investigations benefit from using CRISP- Data Preparation: The data preparation phase
DM. covers all activities to construct the final
71
data set (data that will be fed into the mod- merged data from different databases (DB) available in dif-
eling tool(s)) from the initial raw data. Data ferent agencies in Brazil.
preparation tasks are likely to be performed
CGU has been working closely with subject matter experts
multiple times, and not in any prescribed
(SMEs) in order to identify certain fraud topologies, such
order. Tasks include table, record, and at-
as:
tribute selection, data cleaning, construction
of new attributes, and transformation of data
for modeling tools. 1. Identify if owners from different companies are ac-
tually partners. In the public procurement we expect
Modeling: In this phase, various modeling tech-
competition, but if the only two enterprises participat-
niques are selected and applied, and their
ing in the procurement have owners that are partners,
parameters are calibrated to optimal values.
then it is obvious that there is no competition at all, so
Typically, there are several techniques for
this should be prevented/identified.
the same data mining problem type. Some
techniques require specific data formats. 2. In Brazil if the contract is small enough (8 thousand
Evaluation: At this stage in the project you have reais, or roughly around 4 thousand dollars), then
built one or more models that appear to have there is no need to actually go through the whole pro-
high quality, from a data analysis perspec- curement process. However, sometimes, they break
tive. Before proceeding to final deployment down a big contract in a lot of small ones. This is
of the model, it is important to more thor- called split purchase2 and it is not allowable by law
oughly evaluate the model, and review the and should be prevented/identified. E.g., identify a
steps executed to construct the model, to be lot of contracts that sums up to more than the thresh-
certain it properly achieves the business ob- old with the same objective (buying computers, for in-
jectives. A key objective is to determine if stance) in a week or a month.
there is some important business issue that
has not been sufficiently considered. At the There are actually more than 20 typologies like these
end of this phase, a decision on the use of two described by the SMEs that we would like to iden-
the data mining results should be reached. tify/predict using the data set we have. However, this
Deployment: Creation of the model is gener- project focuses on the second one described above.
ally not the end of the project. Usually, the First the data set was loaded in Excel, then we re-
knowledge gained will need to be organized moved some characters (comma, double quotes, and single
and presented in a way that the customer quotes), because they were being a problem when loading
can use it. Depending on the requirements, the file in Weka (see [13] for details about Weka). Then we
the deployment phase can be as simple as replaced values like NA, -9, -8, and 0 by “?” to represent a
generating a report or as complex as imple- missing value in Weka.
menting a repeatable data mining process.
In many cases it will be the user, not the data The initial data set had 42 attributes and 70,365 transac-
analyst, who will carry out the deployment tions. From Excel we were able to bring that number down
steps. In any case, it is important to under- to 26. Most of the attributes removed were identification of
stand up front what actions will need to be descriptions that were also available as a different attribute.
carried out in order to actually make use of We decided to keep the description because it is easier to
the created models. understand what it means. For instance, we kept the name
of the state instead of its ID.
The Business Understanding phase was already covered in The next step was to save the data as a CSV file and load
the introduction section. The Data Understanding phase it in Weka. Then we changed the year to nominal and re-
will not be covered in detail since the data and its struc- moved rows with missing values in the final price attribute,
ture is confidential. The following sections will cover the since we will need a value later on to identify the class.
Data Understanding, Data Preparation, Modeling, Evalua-
Before running any algorithm we started looking at the data
tion, and Deployment phases.
trying to understand it better. The first thing that we no-
ticed was the range of the values. There were values from
3 Data Understanding and Preparation cents to hundreds of trillions of dollars. Besides that, there
were cases where the proposed unit price was 4 thousand
The data set used in this work is related to procurements 2
For more information about split purchases and
and contracts of IT services in the Brazilian Federal Gov- case examples see http://guide.iacrc.org/
ernment from 2005 to 2010. This data set is actually potential-scheme-split-purchases/.
72
whereas the final actual price was a dollar. Those are typ- 20]), we decided to experiment with different BN learning
ical cases of data that should not be trusted or considered algorithms in order to classify the transactions as regular or
incorrect. not, in order to identify split purchases as soon as possible,
without having to wait for them to actually be confirmed as
The number of cases with a really high value is small
such (before having several purchases summing up to more
so they should be carefully analyzed before being thrown
than 8 thousand reais). We ran all classifiers with 10-fold
away, just to make sure they are not outliers, but actually in-
cross-validation.
consistent data. For that, we got in contact with the SMEs
at CGU and asked them which ones should be ignored. At first we compared the results of Naı̈ve Bayes versus
With their consent we removed those transactions that were Bayesian Network models. Since the data is unbalanced
considered noise. The transactions that could be outliers with 50,911 regular cases and 2,626 irregular cases, we de-
are being analyzed by the experts in more detail. cided to resample the data.
Since we are interested in transactions that sum to 8,000 in Table 1 presents the performance of the Naı̈ve Bayes
a given period of time (see typology 2 above), we computed and Bayesian Network classifiers using 10-fold cross-
using R3 the transactions that involved the same institutions validation. We learned models without resampling the data
on the same month and year that added up to more than set, oversampling the minority class, undersampling the
8,000, then we flagged them as irregular transactions4 (i.e., majority class, and both oversampling the minority class
split purchases). and undersampling the majority class. The metrics used
to compare the models are regular false positive (FP) rate,
4 Modeling and Evaluation irregular FP rate, accuracy, and Receiver Operating Char-
acteristic (ROC) area.
The main objective of this work is to try to classify, without The standard algorithm used in Weka for BN is K2. K2
looking at the sums, which transactions should be consid- uses a hill climbing algorithm restricted by an order on
ered suspicious. The reason for that is to avoid waiting for the variables. For more information on K2 see [8, 9].
the problem to occur (having several transactions that add The parameters used for learning the BN model were all
to more than 8,000 reais) and to identify the problem as the default values in Weka expect for the maximum num-
soon as possible. ber of parents allowed, which we changed from 1 to 2.
The estimator parameter used, which is independent
Before being able to classify we had to discretize the value
of which algorithm we use to learn the model, was the
of the transactions, since we decide to work with learning
SimpleEstimator. SimpleEstimator is used for
algorithms that do not support continuous values. We used
estimating the conditional probability tables of a BN once
equal frequency to generate 11 different bins. The num-
the structure has been learned. It estimates probabilities di-
ber of bins could be different. The only thing we kept in
rectly from data. The only configurable parameter for this
mind was that we did not want it to be binary, nor to have
estimator is alpha. Alpha is used for estimating the prob-
too many bins to the point that we would have just a few
ability tables and can be interpreted as the initial count on
transactions in each bin.
each value. The default value used for alpha is .5. As
Since Bayesian Networks (BNs) have been successfully explained in [3], the options for the K2 algorithm in Weka
used in classification problems (e.g., see [4,7,10–12,17,18, are:
3
R is a free software environment for statistical comput-
ing and graphics. For more information see http://www. initAsNaiveBayes – When set to true (de-
r-project.org/. fault), the initial network used for structure
4
Usually we also consider the type of service/product con- learning is a Naı̈ve Bayes Network, that is,
tracted/bought. However, since we limited the scope of our anal- a network with an arrow from the classifier
ysis to only computer purchases, all 10 different types of ser-
node to each other node. When set to false,
vice/product are too broad and really similar (they all basically
say that these are computer related purchases without any signifi- an empty network is used as initial network
cant difference between each other). Therefore, we can safely ig- structure.
nore this field when defining split purchases, since they do not add markovBlanketClassifier – When set
any more useful information. To avoid comparing purchases of to true (default is false), after a network
different nature, we use the contractor field, since a company that
provides software development does not usually sell computer pe- structure is learned a Markov Blanket cor-
ripherals. Unfortunately, we do not have any other structured field rection is applied to the network structure.
that we could use to improve this proxy classification. Of course This ensures that all nodes in the network
we might have cases that purchases identified as split purchases are part of the Markov blanket of the classi-
are, in fact, false positives. However, our current alert system fier node.
would also identify those purchases as split purchases. What we
are trying to do is to identify those ”split purchases” as soon as maxNrOfParents – Set the maximum num-
possible, instead of waiting for the whole month cycle. ber of parents a node in the Bayes net can
73
have. When initialized as Naı̈ve Bayes, set- as explained before, the main objective is to obtain low reg-
ting this parameter to 1 results in a Naı̈ve ular FP rate and not necessarily high accuracy, which could
Bayes classifier. When set to 2, a Tree Aug- only mean that we are getting just the majority class right.
mented Bayes Network (TAN) is learned,
As BN gave the best result, we decided to try different
and when set to greater than 2, a Bayes
search algorithms. The algorithms used were K2 (again
Net Augmented Bayes Network (BAN) is
using the default parameters), Hill Climber5 , Tabu Search6 ,
learned. By setting it to a value much larger
and Tree Augmented Network7 (TAN).
than the number of nodes in the network
(the default of 100000 pretty much guaran- The options for the Hill Climber algorithm in Weka
tees this), no restriction on the number of are the same as K2, except that it does not have
parents is enforced. the parameter randomOrder. Besides, we have the
randomOrder – When set to true, the order of useArcReversal, which when set to true (default is
the nodes in the network is random. Default false), the arc reversal operation is used in the search. As
random order is false and the order of the in K2, we have used all default parameters, except for the
nodes in the data set is used. In any case, maximum number of parents and the score metric (we used
when the network was initialized as Naı̈ve MDL instead of Bayes).
Bayes Network, the class variable is first in The Tabu Search algorithm in Weka has the same options
the ordering though. as Hill Climber. Besides that, it also has runs, which sets
scoreType – The score type determines the the number of steps to be performed (default is 10), and
measure used to judge the quality of a tabuList, which sets the length of the tabu list (default is
network structure. It can be one of 5). As in both K2 and Hill Climber, we have used all default
Bayes (default), BDeu, Minimum Descrip- parameters, except for the maximum number of parents and
tion Length (MDL), Akaike Information the score metric (we used MDL instead of Bayes).
Criterion (AIC), and Entropy.
The TAN algorithm in Weka only has the
When oversampling, we used the sample size of 200% to markovBlanketClassifier and scoreType
avoid removing known cases of the majority class, which is options with the same values as K2. We have used the
basically the oversampling of the minority class. The goal default parameters. Note that the TAN algorithm does
was to increase the irregular cases to match the number of not allow the configuration in the maximum number of
regular ones. parents, since it is always 2.
We can see that we got much better results with Bayesian Table 2 shows a summary of the BN classifiers perfor-
Network, when oversampling the minority class. More- mance using different search algorithms with oversampled
over, the false positive rate for the regular cases, which is data. It is worth noting that we were only able to use the
our main concern since it represents the number of irregular default Bayes metric with the K2 and TAN algorithms. We
transactions that were classified as regular, has dropped sig- had to use MDL metric in order to avoid out of memory
nificantly after oversampling (from .363 to .140 for Naı̈ve error with the Hill Climbing and Tabu Search algorithms.
Bayes and from .452 to .005 for Bayesian Network). As This difference might be the reason why K2 and TAN out-
a result, the FP rate for the irregular cases increased while performs both Hill Climbing and Tabu Search, since we
the accuracy decreased for NB and slightly increased for have tried both K2 and TAN using MDL metric and we did
BN. Nevertheless, since we are more interested in correctly obtain worse results. As it can be seen, TAN search algo-
classifying the irregular cases, the result with oversampling rithm had the best performance with a ROC area of .999.
is better. Since oversampling resulted in very good results, we de-
When undersampling, we decided to use the sample size cided to also try Synthetic Minority Oversampling TEch-
of (2, 626/50, 911) ⇤ 2 ⇡ 10%. Finally, we tried doing nique (SMOTE) [6] to oversample the minority class. We
both oversampling and undersampling by keeping the sam- used 1800% to get roughly the same number of transac-
ple size at 100%.
5
Hill Climber uses a hill climbing algorithm adding, deleting
As it can be seen from Table 1, oversampling the minority and reversing arcs. The search is not restricted by an order on
class gave us the best results both in accuracy and in ROC the variables (unlike K2). For more details see http://weka.
Area. Besides that, we did have a significant decrease in sourceforge.net/doc.dev/weka/classifiers/
regular FP rate, which is our main concern. The accuracy bayes/net/search/global/HillClimber.html.
6
Tabu Search algorithm uses tabu search for finding a well
without resampling is higher than most of the other results
scoring BN structure. For more details see [2].
with resampling due to the fact that we had an increase in ir- 7
TAN algorithm determines the maximum weight spanning
regular FP rate and since this is the majority class the num- tree and returns a NB network augmented with a tree. For more
ber of correctly classified cases is much larger. However, details see [10].
74
Table 1: Evaluation of Naı̈ve Bayes (NB) and Bayesian Network (BN) classifiers without resampling, with oversampling,
with undersampling, and with both over and undersampling
Metric NB BN NB - Over BN - Over NB - Under BN - Under NB - Both BN - Both
Regular FP Rate .363 .452 .140 .005 .169 .068 .146 .012
Irregular FP Rate .028 .003 .048 .036 .082 .054 .054 .045
Accuracy 95.6% 97.5% 90.6% 97.9% 87.4% 90.6% 90.0% 97.1%
ROC Area .931 .973 .975 .997 .945 .968 .971 .996
Table 2: BN classifiers performance with oversampling and different number of parents
Metric BN - K2 BN - HC BN - Tabu BN - TAN
Parents 2 3 2 3 2 3 -
Regular FP Rate .005 .018 .098 .066 .093 .093 0
Irregular FP Rate .036 .038 .082 .067 .039 .039 .02
Accuracy 97.94% 97.23% 91.01% 93.35% 93.40% 93.40% 98.98%
ROC Area .997 .996 .976 .985 .988 .988 .999
tions in both classes. After we oversampled the data, we using Apriori algorithm [1, 14]. In the first run we used .1
discretized it as we did previously (11 bins with equal fre- as lower bound min support, 1 as upper bound min sup-
quency). However, even though the SMOTE oversampling port, lift as the metric type, 1.1 as min metric, and 50 as
gave better results for most metrics, the standard oversam- number of rules. For more information on each of these
pling method outperformed the SMOTE method on the reg- parameters see http://weka.sourceforge.net/
ular FP rate, which is the one that we are most interested doc.dev/weka/associations/Apriori.html.
in. Therefore, we do not present the details of the results
The results were related to variables that were not of in-
here.
terest and even though they had a high confidence (1) and
Since TAN algorithm presented the best results, we decided high lift (1.82) they were not describing anything that we
to try different values for the alpha parameter in order to did not already know. So we decided to remove those vari-
improve its performance. Table 3 presents performance for ables from the data set and run the algorithm again to see
the alpha values .7, .5, .3, .1, and .05. On the one hand, the new rules it would generate.
when we increased the default value of .5 to .7, the per-
We analyzed all 50 new rules generated and they still did
formance slightly worsened. On the other hand, when we
not provide any insightful information. Besides that, these
decreased the default value we kept getting slightly better
new rules were not as “strong” as the previous ones in the
results up to .05, when the performance started worsening
sense that the best confidence was .62 and the best lift was
again.
1.11.
Table 4 presents the confusion matrix for the best model we
The reason for not presenting the rules is two-fold. The first
obtained, which was learned using TAN with the value .1
is because, as previously explained, they did not provide
for the alpha parameter. As it can be seen, every single
any insightful information. The second is because they are
split purchase (irregular) was correctly classified. Although
confidential and we cannot discuss the contents of the rules.
the regular FP rate in Table 3 suggests that alpha values
All that can be said is that they were not useful.
of .5, .3, and .05 also classified all irregular instances cor-
rectly, the model with .5 alpha missclassified 7 instances Finally, we analyzed the BN that had the best evaluation
(the regular FP rate is shown as 0 due to rounding, since results (BN – TAN with .1 alpha – standard oversampled
the value is actually .00013). Besides that, even though the data set) to try to better understand the relations between
irregular FP rate is the same for the models with .1 and the variables in order to comprehend why it is getting such
.05 alpha, the model with the .05 alpha missclassified 5 in- good results. This could provide insightful information that
stances more than the model with .1 alpha. we were not able to extract from association rules.
In order to try to better understand the relationship be- By analyzing the structure of the BN we were able to iden-
tween the variables, we generated some association rules tify some interesting relations associated to the government
75
Table 3: TAN performance with different values for the alpha parameter
Metric TAN - .7 alpha TAN - .5 alpha TAN - .3 alpha TAN - .1 alpha TAN - .05 alpha
Regular FP Rate .001 0 0 0 0
Irregular FP Rate .023 .020 .018 .016 .016
Accuracy 98.834% 98.984% 99.108% 99.197% 99.192%
ROC Area .999 .999 .999 .999 .999
Table 4: Confusion matrix of the best model, which was learned using TAN with .1 alpha
Predicted Regular Predicted Irregular
Is Regular 52,670 860
Is Irregular 0 53,544
office responsible for the procurement and also to the win- fier is that as we warn/teach the different offices in Brazil
ner of the procurements. However, due to the confiden- about this type of fraud, it is expected that they will de-
tiality nature of the problem we are not allowed to discuss crease the number of irregular transactions associated to
these relations further. Nevertheless, we can comment on this type of fraud. Therefore, it is important to keep track
the fact that to better understand these relations a more de- of these numbers in order to evaluate if this expected be-
tailed study is necessary by looking at the conditional prob- havior is actually happening and at what rate.
ability tables on these relations, which was not done in this
Besides that, as people learn and change their behavior, it is
work due to time constraints.
also expected that our classifier will also need some modifi-
cations to cope with these changes. Therefore, it should be
5 Deployment expected that every month or so a new classifier should be
trained and evaluated to learn the new behavior and analyze
its performance.
As it was explained before, the main goal is to identify sus-
picious transactions as soon as possible to avoid the occur-
rence of split purchases. 6 Conclusion
Split purchases in the data set we worked on happen when
a procurement that should be done just once with a value This project shows how Data Mining, BN learning more
higher than 8,000 is broken down in smaller ones to avoid specifically, can be used to identify and prevent split pur-
the normal procurement procedure, allowing the office to chases in Brazil using real data provided by the Brazil’s
hire an enterprise directly. Office of the Comptroller General (CGU).
So, the idea is that we want to detect one or more of these At first, the idea was to allow the identification of a few
procurements, but that still does not sum up to more than different types of irregularities. However, as we started
8,000, as soon as they become suspicious (using the classi- working with the data and creating the different models, it
fier) and act proactively. This will allow a faster response became clear that this is a great effort and after discussing
and prevent irregularities in the first place. This means that with stakeholders at CGU, we decided to prioritize depth
instead of waiting a whole month to find an irregular trans- instead of breadth. In other words, it was decided that it
action, we will be finding suspicious ones (not necessarily was best to choose one type of fraud and do a thoroughly
irregular ones) and warn/educate/teach the people involved job to develop a model that is good enough to actually de-
that they should be careful, since breaking down big pro- ploy it and analyze the results before looking at other types
curements in small ones is not allowed. of frauds.
Hopefully, this will make them think twice before continu- Fortunately, the results paid off in the sense that we were
ing with these kinds of transactions, if they are really think- able to generate a model that correctly classified all split
ing about doing irregular transactions on purpose. And at purchases and a really high ROC area (.999). The next step
the same time this will educate those that are not aware they is to deploy this classifier in CGUs intelligence system and
cannot do these kinds of transactions by law. analyze its impact.
Another thing to keep in mind when deploying this classi- As future work, it is necessary to better understand why
76
the model is capable of getting such good results. By an- minority over-sampling technique. Journal of Artifi-
alyzing the structure of the BNs generated, it is possible cial Intelligence Research, 16:321357, 2002.
to understand relations between variables that were previ-
[7] Jie Cheng and Russell Greiner. Comparing bayesian
ously unknown. A brief analysis showed that some inter-
network classifiers. In Proceedings of the Fifteenth
esting relations were found, but a more detailed analysis is
Conference on Uncertainty in Artificial Intelligence,
needed.
UAI’99, page 101108, San Francisco, CA, USA,
Even though we have used resampling for dealing with the 1999. Morgan Kaufmann Publishers Inc.
rare event of split purchases and cross-validation to avoid
[8] Gregory F. Cooper and Edward Herskovits. A
overfitting, we might have not completely ruled out the pos-
bayesian method for constructing bayesian belief net-
sibility that this model might be slightly overfitted. There-
works from databases. In Proceedings of the Seventh
fore, before deploying the model, we will gather new data
Conference on Uncertainty in Artificial Intelligence,
from recent purchases in order to test the performance of
UAI’91, page 8694, San Francisco, CA, USA, 1991.
our model with new unseen data. Due to time constraints,
Morgan Kaufmann Publishers Inc.
we have not been able to gather the new data and perform
this test yet. [9] GregoryF. Cooper and Edward Herskovits. A
Besides that, there are a lot of different types of frauds that bayesian method for the induction of probabilistic
were not analyzed. So, a starting point for future work is networks from data. Machine Learning, 9(4):309–
to do the same kind of analysis done in this project with 347, 1992.
different types of fraud. [10] Nir Friedman, Dan Geiger, and Moises Goldszmidt.
Bayesian network classifiers. Machine Learning,
Acknowledgements 29(2-3):131–163, November 1997.
The authors would like to thank Prof. Daniel Barbará from [11] Nir Friedman and Moises Goldszmidt. Building clas-
George Mason University for providing useful insights for sifiers using bayesian networks. In Proceedings of
the development of this work. Finally, the authors would the national conference on artificial intelligence, page
like to thank CGU for providing the resources necessary to 12771284, 1996.
work in this research, as well as for allowing its publication. [12] Moises Goldszmidt, James J. Cochran, Louis A. Cox,
Pinar Keskinocak, Jeffrey P. Kharoufeh, and J. Cole
References Smith. Bayesian network classifiers. In Wiley En-
cyclopedia of Operations Research and Management
[1] R. Agrawal and R. Srikant. Fast algorithms for min- Science. John Wiley & Sons, Inc., 2010.
ing association rules in large databases. In 20th Inter-
[13] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard
national Conference on Very Large Data Bases, pages
Pfahringer, Peter Reutemann, and Ian H. Witten. The
478–499. Morgan Kaufmann, Los Altos, CA, 1994.
WEKA data mining software: An update. SIGKDD
[2] R. R. Bouckaert. Bayesian Belief Networks: from Explor. Newsl., 11(1):1018, November 2009.
Construction to Inference. PhD thesis, University of [14] Bing Liu, Wynne Hsu, and Yiming Ma. Integrating
Utrecht, Utrecht, Netherlands, 1995. classification and association rule mining. In Fourth
[3] Remco R. Bouckaert. Bayesian network classiers in International Conference on Knowledge Discovery
weka for version 3-5-7. Technical report, May 2008. and Data Mining, pages 80–86. AAAI Press, 1998.
[15] Hely Meirelles. Licitacao e Contrato Administra-
[4] S. Ceccon, D.F. Garway-Heath, D.P. Crabb, and tivo. Editora Revista dos Tribunais, Sao Paulo, Brazil,
A. Tucker. Exploring early glaucoma and the visual 7a. ed., atualizada pelo decreto-lei 2,300/86. edition,
field test: Classification and clustering using bayesian 1987.
networks. IEEE Journal of Biomedical and Health In-
formatics, 18(3):1008–1014, May 2014. [16] Andre Mueller. A critical study of the brazilian pro-
curement law. Technical report, IBI - The Institute
[5] Pete Chapman, Julian Clinton, Randy Kerber, of Brazilian Business & Public Management Issues,
Thomas Khabaza, Thomas Reinartz, Colin Shearer, Washington, 1998.
and Rudiger Wirth. CRISP-DM 1.0 step-by-step data
mining guide. Technical report, The CRISP-DM con- [17] Mehran Sahami, Susan Dumais, David Heckerman,
sortium, August 2000. and Eric Horvitz. A bayesian approach to filtering
junk e-mail. In Learning for Text Categorization:
[6] Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Papers from the 1998 workshop, volume 62, page
Hall, and W. Philip Kegelmeyer. SMOTE: Synthetic 98105, 1998.
77
[18] Wei Shi, Yao Wu Pei, Liang Sun, Jian Guo Wang,
and Shao Qing Ren. The defect identification of LED
chips based on bayesian classifier. Applied Mechanics
and Materials, 333-335:1564–1568, July 2013.
[19] Rdiger Wirth. CRISP-DM: Towards a standard pro-
cess model for data mining. In Proceedings of the
Fourth International Conference on the Practical Ap-
plication of Knowledge Discovery and Data Mining,
page 2939, 2000.
[20] Ye Ye, Fuchiang (Rich) Tsui, Michael Wagner,
Jeremy U. Espino, and Qi Li. Influenza detection
from emergency department reports using natural lan-
guage processing and bayesian network classifiers.
Journal of the American Medical Informatics Asso-
ciation, pages amiajnl–2013–001934, January 2014.
78