=Paper=
{{Paper
|id=Vol-1218/bmaw2014_paper_7
|storemode=property
|title=Using Bayesian Networks to Identify and Prevent Split Purchases in Brazil
|pdfUrl=https://ceur-ws.org/Vol-1218/bmaw2014_paper_7.pdf
|volume=Vol-1218
|dblpUrl=https://dblp.org/rec/conf/uai/CarvalhoSRM14
}}
==Using Bayesian Networks to Identify and Prevent Split Purchases in Brazil==
<pdf width="1500px">https://ceur-ws.org/Vol-1218/bmaw2014_paper_7.pdf</pdf>
<pre>
    Using Bayesian Networks to Identify and Prevent Split Purchases in Brazil


               Rommel N. Carvalho⇤, Leonardo J. Sales, Henrique A. da Rocha, and Gilson L. Mendes
                             Department of Research and Strategic Information (DIE)
                                Brazil’s Office of the Comptroller General (CGU)
                                SAS, Quadra 01, Bloco A, Edifı́cio Darcy Ribeiro
                                                Brası́lia, DF, Brazil


                         Abstract                                A primary responsibility of CGU is to prevent and detect
                                                                 government corruption. To carry out this mission, CGU
     To cope with society’s demand for transparency              must gather information from a variety of sources and com-
     and corruption prevention, the Brazilian Office of          bine it to evaluate whether further action, such as an inves-
     the Comptroller General (CGU) has carried out               tigation, is required. One of the most difficult challenges is
     a number of actions, including: awareness cam-              the information explosion. Auditors must fuse vast quanti-
     paigns aimed at the private sector; campaigns               ties of information from a variety of sources in a way that
     to educate the public; research initiatives; and            highlights its relevance to decision makers and helps them
     regular inspections and audits of municipalities            focus their efforts on the most critical cases. This is no triv-
     and states. Although CGU has collected infor-               ial duty. The Growing Acceleration Program (PAC) alone
     mation from various different sources – Rev-                has a budget greater than 250 billion dollars with more than
     enue Agency, Federal Police, and others –, going            one thousand projects only in the state of Sao Paulo 1 . All
     through all the data in order to find suspicious            of these have to be audited and inspected by CGU – and, in
     transactions has proven to be really challenging.           spite of having only three thousand employees. Therefore,
     In this paper, we present a Data Mining study ap-           CGU must optimize its processes in order to carry out its
     plied on real data – government purchases – for             mission.
     finding transactions that might become irregular            In Brazil, all contracts with the private sector must be in
     before they are considered as such in order to act          accordance with the Law N 8,666/93, also known as the
     proactively. Moreover, we compare the perfor-               national Procurement Law. According to [15] procurement
     mance of various Bayesian Network (BN) learn-               is the administrative procedure by which the Public Ad-
     ing algorithms with different parameters in order           ministration selects the most advantageous proposal for a
     to fine tune the learned models and improve their           contract in its interest. From the former definition, the con-
     performance. The best result was obtained us-               clusion is that the public interest must always be the objec-
     ing the Tree Augmented Network (TAN) algo-                  tive of the procedure. In terms of purchasing with the use
     rithm and oversampling the minority class in or-            of public money, this means that not only must the winner
     der to balance the data set. Using a 10-fold cross-         of the procurement process be the best supplier in terms of
     validation, the model correctly classified all split        the price of the good or service supplied, but also in terms
     purchases, it obtained a ROC area of .999, and its          of other objectives of the procurement process.
     accuracy was 99.197%.
                                                                 Corruption can happen in many ways in Public Procure-
                                                                 ments [16]. The public agent may favor a specific supplier
1   Introduction                                                 that she happens to know. She may receive, from the bid-
                                                                 der, a financial compensation for awarding a contract to that
The Brazilian Office of the Comptroller General (CGU) is         firm. Bidders may collude as to set the results of the pro-
the Brazilian central body of the internal control system of     curement. The whole process is susceptible to many forms
the federal executive branch. It has, among its responsi-        of corruption, from within and outside the public adminis-
bilities, the task of inspecting and auditing the Brazilian      tration.
Government projects and programs with respect to their le-       The government purchases large quantities of goods and
gality, results, efficacy and efficiency.                        services. It is also the sole purchaser for some goods, such
    ⇤
    Department of Computer Science (CIC), University of
                                                                    1
Brası́lia (UnB), Brası́lia, DF, Brazil                                  http://www.pac.gov.br/

                                                            70
as hydraulic turbines for large damns. The government            The process model provided by the consortium CRISP-DM
spends large quantities of money in the market and is a          can be summarized through the life cycle of the data mining
guaranteed payer. Hence, many firms have a strong inter-         process presented in Figure 1, reproduced from [5]. The
est in negotiating with the public administration. There is a    outer circle symbolizes the cyclical nature of the process of
temptation for many suppliers to cheat in the procurement        data mining. Data mining does not end when a solution is
process to find means of being awarded a lucrative govern-       achieved. In fact, the lessons learned during a process can
ment contract.                                                   be useful in subsequent processes.
The Brazilian Procurement Law has as one of its main ob-
jectives the curbing of corruption in public purchasing and
contracting. Concern over corruption is evident in Brazil’s
daily press. There are frequent accusations of public ad-
ministrators who did not abide by the procurement rules,
and are accused of favoring a certain supplier or worse, re-
ceiving a payment in the process.
When writing the law, legislators included many articles
that established penalties for firms or/and public legisla-
tors caught in corruption activities. There are two types of
penalties stated in Law N 8,666/93 dealing with this sub-
ject. They are administrative actions and penal actions.
Since enforcing the law is difficult [16], legislators will
have to find another manner to prevent corruption in public
procurement. The question is one of preventing corruption
practices, against one of punishing the ones that have al-
ready happened.
Therefore, the main objective of this project is to apply
Data Mining methods to create models, more specifically              Figure 1: Phases of the CRISP-DM Process Model
to learn Bayesian Network models, that will aid the experts
in identifying procurement frauds and, if possible, identify     The life cycle of a data mining project consists of six
suspicious transactions as soon as possible in order to pre-     phases. The sequence of the phases is not strict. Moving
vent the fraud from happening.                                   back and forth between different phases is often required.
                                                                 The arrows indicate the most important and frequent de-
The structure of this paper is as follows: Section 2 presents    pendencies between phases. The phases are [19]:
the CRoss Industry Standard Process for Data Mining
(CRISP-DM) methodology used in this work. Section 3                   Business Understanding: This initial phase fo-
presents the data used and its underlying semantics. Sec-                 cuses on understanding the project objec-
tion 4 presets the models learned as well as their evaluation.            tives and requirements from a business per-
Section 5 presents ideas on how the learned model for iden-               spective, and then converting this knowl-
tifying split purchases will be deployed. Finally, Section 6              edge into a data mining problem definition,
presents the conclusion and future work.                                  and a preliminary project plan designed to
                                                                          achieve the objectives.
                                                                      Data Understanding: The data understanding
2   Methodology                                                           phase starts with an initial data collection
                                                                          and proceeds with activities in order to get
The CRoss Industry Standard Process for Data Min-                         familiar with the data, to identify data qual-
ing (CRISP-DM) project developed an industry and tool-                    ity problems, to discover first insights into
neutral data mining process model. Starting from the                      the data, or to detect interesting subsets
embryonic knowledge discovery processes used in early                     to form hypotheses for hidden information.
data mining projects and responding directly to user re-                  There is a close link between Business Un-
quirements, this project defined and validated a data min-                derstanding and Data Understanding. The
ing process that is applicable in diverse industry sectors.               formulation of the data mining problem and
This methodology makes large data mining projects faster,                 the project plan require at least some under-
cheaper, more reliable, and more manageable. Even small-                  standing of the available data.
scale data mining investigations benefit from using CRISP-            Data Preparation: The data preparation phase
DM.                                                                       covers all activities to construct the final

                                                           71
          data set (data that will be fed into the mod-         merged data from different databases (DB) available in dif-
          eling tool(s)) from the initial raw data. Data        ferent agencies in Brazil.
          preparation tasks are likely to be performed
                                                                CGU has been working closely with subject matter experts
          multiple times, and not in any prescribed
                                                                (SMEs) in order to identify certain fraud topologies, such
          order. Tasks include table, record, and at-
                                                                as:
          tribute selection, data cleaning, construction
          of new attributes, and transformation of data
          for modeling tools.                                     1. Identify if owners from different companies are ac-
                                                                     tually partners. In the public procurement we expect
     Modeling: In this phase, various modeling tech-
                                                                     competition, but if the only two enterprises participat-
        niques are selected and applied, and their
                                                                     ing in the procurement have owners that are partners,
        parameters are calibrated to optimal values.
                                                                     then it is obvious that there is no competition at all, so
        Typically, there are several techniques for
                                                                     this should be prevented/identified.
        the same data mining problem type. Some
        techniques require specific data formats.                 2. In Brazil if the contract is small enough (8 thousand
     Evaluation: At this stage in the project you have               reais, or roughly around 4 thousand dollars), then
         built one or more models that appear to have                there is no need to actually go through the whole pro-
         high quality, from a data analysis perspec-                 curement process. However, sometimes, they break
         tive. Before proceeding to final deployment                 down a big contract in a lot of small ones. This is
         of the model, it is important to more thor-                 called split purchase2 and it is not allowable by law
         oughly evaluate the model, and review the                   and should be prevented/identified. E.g., identify a
         steps executed to construct the model, to be                lot of contracts that sums up to more than the thresh-
         certain it properly achieves the business ob-               old with the same objective (buying computers, for in-
         jectives. A key objective is to determine if                stance) in a week or a month.
         there is some important business issue that
         has not been sufficiently considered. At the           There are actually more than 20 typologies like these
         end of this phase, a decision on the use of            two described by the SMEs that we would like to iden-
         the data mining results should be reached.             tify/predict using the data set we have. However, this
     Deployment: Creation of the model is gener-                project focuses on the second one described above.
         ally not the end of the project. Usually, the          First the data set was loaded in Excel, then we re-
         knowledge gained will need to be organized             moved some characters (comma, double quotes, and single
         and presented in a way that the customer               quotes), because they were being a problem when loading
         can use it. Depending on the requirements,             the file in Weka (see [13] for details about Weka). Then we
         the deployment phase can be as simple as               replaced values like NA, -9, -8, and 0 by “?” to represent a
         generating a report or as complex as imple-            missing value in Weka.
         menting a repeatable data mining process.
         In many cases it will be the user, not the data        The initial data set had 42 attributes and 70,365 transac-
         analyst, who will carry out the deployment             tions. From Excel we were able to bring that number down
         steps. In any case, it is important to under-          to 26. Most of the attributes removed were identification of
         stand up front what actions will need to be            descriptions that were also available as a different attribute.
         carried out in order to actually make use of           We decided to keep the description because it is easier to
         the created models.                                    understand what it means. For instance, we kept the name
                                                                of the state instead of its ID.
The Business Understanding phase was already covered in         The next step was to save the data as a CSV file and load
the introduction section. The Data Understanding phase          it in Weka. Then we changed the year to nominal and re-
will not be covered in detail since the data and its struc-     moved rows with missing values in the final price attribute,
ture is confidential. The following sections will cover the     since we will need a value later on to identify the class.
Data Understanding, Data Preparation, Modeling, Evalua-
                                                                Before running any algorithm we started looking at the data
tion, and Deployment phases.
                                                                trying to understand it better. The first thing that we no-
                                                                ticed was the range of the values. There were values from
3   Data Understanding and Preparation                          cents to hundreds of trillions of dollars. Besides that, there
                                                                were cases where the proposed unit price was 4 thousand
The data set used in this work is related to procurements          2
                                                                     For more information about split purchases and
and contracts of IT services in the Brazilian Federal Gov-      case    examples  see    http://guide.iacrc.org/
ernment from 2005 to 2010. This data set is actually            potential-scheme-split-purchases/.

                                                           72
whereas the final actual price was a dollar. Those are typ-            20]), we decided to experiment with different BN learning
ical cases of data that should not be trusted or considered            algorithms in order to classify the transactions as regular or
incorrect.                                                             not, in order to identify split purchases as soon as possible,
                                                                       without having to wait for them to actually be confirmed as
The number of cases with a really high value is small
                                                                       such (before having several purchases summing up to more
so they should be carefully analyzed before being thrown
                                                                       than 8 thousand reais). We ran all classifiers with 10-fold
away, just to make sure they are not outliers, but actually in-
                                                                       cross-validation.
consistent data. For that, we got in contact with the SMEs
at CGU and asked them which ones should be ignored.                    At first we compared the results of Naı̈ve Bayes versus
With their consent we removed those transactions that were             Bayesian Network models. Since the data is unbalanced
considered noise. The transactions that could be outliers              with 50,911 regular cases and 2,626 irregular cases, we de-
are being analyzed by the experts in more detail.                      cided to resample the data.
Since we are interested in transactions that sum to 8,000 in           Table 1 presents the performance of the Naı̈ve Bayes
a given period of time (see typology 2 above), we computed             and Bayesian Network classifiers using 10-fold cross-
using R3 the transactions that involved the same institutions          validation. We learned models without resampling the data
on the same month and year that added up to more than                  set, oversampling the minority class, undersampling the
8,000, then we flagged them as irregular transactions4 (i.e.,          majority class, and both oversampling the minority class
split purchases).                                                      and undersampling the majority class. The metrics used
                                                                       to compare the models are regular false positive (FP) rate,
4       Modeling and Evaluation                                        irregular FP rate, accuracy, and Receiver Operating Char-
                                                                       acteristic (ROC) area.
The main objective of this work is to try to classify, without         The standard algorithm used in Weka for BN is K2. K2
looking at the sums, which transactions should be consid-              uses a hill climbing algorithm restricted by an order on
ered suspicious. The reason for that is to avoid waiting for           the variables. For more information on K2 see [8, 9].
the problem to occur (having several transactions that add             The parameters used for learning the BN model were all
to more than 8,000 reais) and to identify the problem as               the default values in Weka expect for the maximum num-
soon as possible.                                                      ber of parents allowed, which we changed from 1 to 2.
                                                                       The estimator parameter used, which is independent
Before being able to classify we had to discretize the value
                                                                       of which algorithm we use to learn the model, was the
of the transactions, since we decide to work with learning
                                                                       SimpleEstimator. SimpleEstimator is used for
algorithms that do not support continuous values. We used
                                                                       estimating the conditional probability tables of a BN once
equal frequency to generate 11 different bins. The num-
                                                                       the structure has been learned. It estimates probabilities di-
ber of bins could be different. The only thing we kept in
                                                                       rectly from data. The only configurable parameter for this
mind was that we did not want it to be binary, nor to have
                                                                       estimator is alpha. Alpha is used for estimating the prob-
too many bins to the point that we would have just a few
                                                                       ability tables and can be interpreted as the initial count on
transactions in each bin.
                                                                       each value. The default value used for alpha is .5. As
Since Bayesian Networks (BNs) have been successfully                   explained in [3], the options for the K2 algorithm in Weka
used in classification problems (e.g., see [4,7,10–12,17,18,           are:
    3
      R is a free software environment for statistical comput-
ing and graphics. For more information see http://www.                      initAsNaiveBayes – When set to true (de-
r-project.org/.                                                                fault), the initial network used for structure
    4
      Usually we also consider the type of service/product con-                learning is a Naı̈ve Bayes Network, that is,
tracted/bought. However, since we limited the scope of our anal-               a network with an arrow from the classifier
ysis to only computer purchases, all 10 different types of ser-
                                                                               node to each other node. When set to false,
vice/product are too broad and really similar (they all basically
say that these are computer related purchases without any signifi-             an empty network is used as initial network
cant difference between each other). Therefore, we can safely ig-              structure.
nore this field when defining split purchases, since they do not add        markovBlanketClassifier – When set
any more useful information. To avoid comparing purchases of                   to true (default is false), after a network
different nature, we use the contractor field, since a company that
provides software development does not usually sell computer pe-               structure is learned a Markov Blanket cor-
ripherals. Unfortunately, we do not have any other structured field            rection is applied to the network structure.
that we could use to improve this proxy classification. Of course              This ensures that all nodes in the network
we might have cases that purchases identified as split purchases               are part of the Markov blanket of the classi-
are, in fact, false positives. However, our current alert system               fier node.
would also identify those purchases as split purchases. What we
are trying to do is to identify those ”split purchases” as soon as          maxNrOfParents – Set the maximum num-
possible, instead of waiting for the whole month cycle.                        ber of parents a node in the Bayes net can

                                                                 73
        have. When initialized as Naı̈ve Bayes, set-             as explained before, the main objective is to obtain low reg-
        ting this parameter to 1 results in a Naı̈ve             ular FP rate and not necessarily high accuracy, which could
        Bayes classifier. When set to 2, a Tree Aug-             only mean that we are getting just the majority class right.
        mented Bayes Network (TAN) is learned,
                                                                 As BN gave the best result, we decided to try different
        and when set to greater than 2, a Bayes
                                                                 search algorithms. The algorithms used were K2 (again
        Net Augmented Bayes Network (BAN) is
                                                                 using the default parameters), Hill Climber5 , Tabu Search6 ,
        learned. By setting it to a value much larger
                                                                 and Tree Augmented Network7 (TAN).
        than the number of nodes in the network
        (the default of 100000 pretty much guaran-               The options for the Hill Climber algorithm in Weka
        tees this), no restriction on the number of              are the same as K2, except that it does not have
        parents is enforced.                                     the parameter randomOrder. Besides, we have the
     randomOrder – When set to true, the order of                useArcReversal, which when set to true (default is
        the nodes in the network is random. Default              false), the arc reversal operation is used in the search. As
        random order is false and the order of the               in K2, we have used all default parameters, except for the
        nodes in the data set is used. In any case,              maximum number of parents and the score metric (we used
        when the network was initialized as Naı̈ve               MDL instead of Bayes).
        Bayes Network, the class variable is first in            The Tabu Search algorithm in Weka has the same options
        the ordering though.                                     as Hill Climber. Besides that, it also has runs, which sets
     scoreType – The score type determines the                   the number of steps to be performed (default is 10), and
        measure used to judge the quality of a                   tabuList, which sets the length of the tabu list (default is
        network structure.       It can be one of                5). As in both K2 and Hill Climber, we have used all default
        Bayes (default), BDeu, Minimum Descrip-                  parameters, except for the maximum number of parents and
        tion Length (MDL), Akaike Information                    the score metric (we used MDL instead of Bayes).
        Criterion (AIC), and Entropy.
                                                                 The TAN algorithm in Weka only has the
When oversampling, we used the sample size of 200% to            markovBlanketClassifier and scoreType
avoid removing known cases of the majority class, which is       options with the same values as K2. We have used the
basically the oversampling of the minority class. The goal       default parameters. Note that the TAN algorithm does
was to increase the irregular cases to match the number of       not allow the configuration in the maximum number of
regular ones.                                                    parents, since it is always 2.

We can see that we got much better results with Bayesian         Table 2 shows a summary of the BN classifiers perfor-
Network, when oversampling the minority class. More-             mance using different search algorithms with oversampled
over, the false positive rate for the regular cases, which is    data. It is worth noting that we were only able to use the
our main concern since it represents the number of irregular     default Bayes metric with the K2 and TAN algorithms. We
transactions that were classified as regular, has dropped sig-   had to use MDL metric in order to avoid out of memory
nificantly after oversampling (from .363 to .140 for Naı̈ve      error with the Hill Climbing and Tabu Search algorithms.
Bayes and from .452 to .005 for Bayesian Network). As            This difference might be the reason why K2 and TAN out-
a result, the FP rate for the irregular cases increased while    performs both Hill Climbing and Tabu Search, since we
the accuracy decreased for NB and slightly increased for         have tried both K2 and TAN using MDL metric and we did
BN. Nevertheless, since we are more interested in correctly      obtain worse results. As it can be seen, TAN search algo-
classifying the irregular cases, the result with oversampling    rithm had the best performance with a ROC area of .999.
is better.                                                       Since oversampling resulted in very good results, we de-
When undersampling, we decided to use the sample size            cided to also try Synthetic Minority Oversampling TEch-
of (2, 626/50, 911) ⇤ 2 ⇡ 10%. Finally, we tried doing           nique (SMOTE) [6] to oversample the minority class. We
both oversampling and undersampling by keeping the sam-          used 1800% to get roughly the same number of transac-
ple size at 100%.
                                                                    5
                                                                       Hill Climber uses a hill climbing algorithm adding, deleting
As it can be seen from Table 1, oversampling the minority        and reversing arcs. The search is not restricted by an order on
class gave us the best results both in accuracy and in ROC       the variables (unlike K2). For more details see http://weka.
Area. Besides that, we did have a significant decrease in        sourceforge.net/doc.dev/weka/classifiers/
regular FP rate, which is our main concern. The accuracy         bayes/net/search/global/HillClimber.html.
                                                                     6
                                                                       Tabu Search algorithm uses tabu search for finding a well
without resampling is higher than most of the other results
                                                                 scoring BN structure. For more details see [2].
with resampling due to the fact that we had an increase in ir-       7
                                                                       TAN algorithm determines the maximum weight spanning
regular FP rate and since this is the majority class the num-    tree and returns a NB network augmented with a tree. For more
ber of correctly classified cases is much larger. However,       details see [10].

                                                           74
Table 1: Evaluation of Naı̈ve Bayes (NB) and Bayesian Network (BN) classifiers without resampling, with oversampling,
with undersampling, and with both over and undersampling

       Metric            NB        BN      NB - Over      BN - Over      NB - Under      BN - Under       NB - Both     BN - Both

 Regular FP Rate        .363       .452        .140          .005            .169             .068           .146          .012
 Irregular FP Rate      .028       .003        .048          .036            .082             .054           .054          .045
     Accuracy          95.6%      97.5%       90.6%         97.9%           87.4%            90.6%          90.0%         97.1%
     ROC Area           .931       .973        .975          .997            .945             .968           .971          .996


                  Table 2: BN classifiers performance with oversampling and different number of parents
                   Metric                 BN - K2                 BN - HC            BN - Tabu          BN - TAN

                   Parents             2          3           2          3           2          3            -
              Regular FP Rate        .005       .018        .098       .066        .093       .093           0
              Irregular FP Rate      .036       .038        .082       .067        .039       .039          .02
                  Accuracy          97.94%     97.23%      91.01%     93.35%      93.40%     93.40%      98.98%
                  ROC Area           .997       .996        .976       .985        .988       .988         .999


tions in both classes. After we oversampled the data, we            using Apriori algorithm [1, 14]. In the first run we used .1
discretized it as we did previously (11 bins with equal fre-        as lower bound min support, 1 as upper bound min sup-
quency). However, even though the SMOTE oversampling                port, lift as the metric type, 1.1 as min metric, and 50 as
gave better results for most metrics, the standard oversam-         number of rules. For more information on each of these
pling method outperformed the SMOTE method on the reg-              parameters see http://weka.sourceforge.net/
ular FP rate, which is the one that we are most interested          doc.dev/weka/associations/Apriori.html.
in. Therefore, we do not present the details of the results
                                                                    The results were related to variables that were not of in-
here.
                                                                    terest and even though they had a high confidence (1) and
Since TAN algorithm presented the best results, we decided          high lift (1.82) they were not describing anything that we
to try different values for the alpha parameter in order to         did not already know. So we decided to remove those vari-
improve its performance. Table 3 presents performance for           ables from the data set and run the algorithm again to see
the alpha values .7, .5, .3, .1, and .05. On the one hand,          the new rules it would generate.
when we increased the default value of .5 to .7, the per-
                                                                    We analyzed all 50 new rules generated and they still did
formance slightly worsened. On the other hand, when we
                                                                    not provide any insightful information. Besides that, these
decreased the default value we kept getting slightly better
                                                                    new rules were not as “strong” as the previous ones in the
results up to .05, when the performance started worsening
                                                                    sense that the best confidence was .62 and the best lift was
again.
                                                                    1.11.
Table 4 presents the confusion matrix for the best model we
                                                                    The reason for not presenting the rules is two-fold. The first
obtained, which was learned using TAN with the value .1
                                                                    is because, as previously explained, they did not provide
for the alpha parameter. As it can be seen, every single
                                                                    any insightful information. The second is because they are
split purchase (irregular) was correctly classified. Although
                                                                    confidential and we cannot discuss the contents of the rules.
the regular FP rate in Table 3 suggests that alpha values
                                                                    All that can be said is that they were not useful.
of .5, .3, and .05 also classified all irregular instances cor-
rectly, the model with .5 alpha missclassified 7 instances          Finally, we analyzed the BN that had the best evaluation
(the regular FP rate is shown as 0 due to rounding, since           results (BN – TAN with .1 alpha – standard oversampled
the value is actually .00013). Besides that, even though the        data set) to try to better understand the relations between
irregular FP rate is the same for the models with .1 and            the variables in order to comprehend why it is getting such
.05 alpha, the model with the .05 alpha missclassified 5 in-        good results. This could provide insightful information that
stances more than the model with .1 alpha.                          we were not able to extract from association rules.
In order to try to better understand the relationship be-           By analyzing the structure of the BN we were able to iden-
tween the variables, we generated some association rules            tify some interesting relations associated to the government


                                                            75
                         Table 3: TAN performance with different values for the alpha parameter
          Metric           TAN - .7 alpha        TAN - .5 alpha   TAN - .3 alpha     TAN - .1 alpha     TAN - .05 alpha

     Regular FP Rate            .001                   0                   0                  0                 0
     Irregular FP Rate          .023                 .020                .018               .016              .016
         Accuracy             98.834%              98.984%             99.108%            99.197%           99.192%
         ROC Area               .999                 .999                .999               .999              .999


                Table 4: Confusion matrix of the best model, which was learned using TAN with .1 alpha
                                                   Predicted Regular     Predicted Irregular

                                  Is Regular            52,670                    860
                                  Is Irregular            0                      53,544


office responsible for the procurement and also to the win-       fier is that as we warn/teach the different offices in Brazil
ner of the procurements. However, due to the confiden-            about this type of fraud, it is expected that they will de-
tiality nature of the problem we are not allowed to discuss       crease the number of irregular transactions associated to
these relations further. Nevertheless, we can comment on          this type of fraud. Therefore, it is important to keep track
the fact that to better understand these relations a more de-     of these numbers in order to evaluate if this expected be-
tailed study is necessary by looking at the conditional prob-     havior is actually happening and at what rate.
ability tables on these relations, which was not done in this
                                                                  Besides that, as people learn and change their behavior, it is
work due to time constraints.
                                                                  also expected that our classifier will also need some modifi-
                                                                  cations to cope with these changes. Therefore, it should be
5   Deployment                                                    expected that every month or so a new classifier should be
                                                                  trained and evaluated to learn the new behavior and analyze
                                                                  its performance.
As it was explained before, the main goal is to identify sus-
picious transactions as soon as possible to avoid the occur-
rence of split purchases.                                         6     Conclusion
Split purchases in the data set we worked on happen when
a procurement that should be done just once with a value          This project shows how Data Mining, BN learning more
higher than 8,000 is broken down in smaller ones to avoid         specifically, can be used to identify and prevent split pur-
the normal procurement procedure, allowing the office to          chases in Brazil using real data provided by the Brazil’s
hire an enterprise directly.                                      Office of the Comptroller General (CGU).
So, the idea is that we want to detect one or more of these       At first, the idea was to allow the identification of a few
procurements, but that still does not sum up to more than         different types of irregularities. However, as we started
8,000, as soon as they become suspicious (using the classi-       working with the data and creating the different models, it
fier) and act proactively. This will allow a faster response      became clear that this is a great effort and after discussing
and prevent irregularities in the first place. This means that    with stakeholders at CGU, we decided to prioritize depth
instead of waiting a whole month to find an irregular trans-      instead of breadth. In other words, it was decided that it
action, we will be finding suspicious ones (not necessarily       was best to choose one type of fraud and do a thoroughly
irregular ones) and warn/educate/teach the people involved        job to develop a model that is good enough to actually de-
that they should be careful, since breaking down big pro-         ploy it and analyze the results before looking at other types
curements in small ones is not allowed.                           of frauds.
Hopefully, this will make them think twice before continu-        Fortunately, the results paid off in the sense that we were
ing with these kinds of transactions, if they are really think-   able to generate a model that correctly classified all split
ing about doing irregular transactions on purpose. And at         purchases and a really high ROC area (.999). The next step
the same time this will educate those that are not aware they     is to deploy this classifier in CGUs intelligence system and
cannot do these kinds of transactions by law.                     analyze its impact.
Another thing to keep in mind when deploying this classi-         As future work, it is necessary to better understand why

                                                             76
the model is capable of getting such good results. By an-              minority over-sampling technique. Journal of Artifi-
alyzing the structure of the BNs generated, it is possible             cial Intelligence Research, 16:321357, 2002.
to understand relations between variables that were previ-
                                                                   [7] Jie Cheng and Russell Greiner. Comparing bayesian
ously unknown. A brief analysis showed that some inter-
                                                                       network classifiers. In Proceedings of the Fifteenth
esting relations were found, but a more detailed analysis is
                                                                       Conference on Uncertainty in Artificial Intelligence,
needed.
                                                                       UAI’99, page 101108, San Francisco, CA, USA,
Even though we have used resampling for dealing with the               1999. Morgan Kaufmann Publishers Inc.
rare event of split purchases and cross-validation to avoid
                                                                   [8] Gregory F. Cooper and Edward Herskovits. A
overfitting, we might have not completely ruled out the pos-
                                                                       bayesian method for constructing bayesian belief net-
sibility that this model might be slightly overfitted. There-
                                                                       works from databases. In Proceedings of the Seventh
fore, before deploying the model, we will gather new data
                                                                       Conference on Uncertainty in Artificial Intelligence,
from recent purchases in order to test the performance of
                                                                       UAI’91, page 8694, San Francisco, CA, USA, 1991.
our model with new unseen data. Due to time constraints,
                                                                       Morgan Kaufmann Publishers Inc.
we have not been able to gather the new data and perform
this test yet.                                                     [9] GregoryF. Cooper and Edward Herskovits.         A
Besides that, there are a lot of different types of frauds that        bayesian method for the induction of probabilistic
were not analyzed. So, a starting point for future work is             networks from data. Machine Learning, 9(4):309–
to do the same kind of analysis done in this project with              347, 1992.
different types of fraud.                                         [10] Nir Friedman, Dan Geiger, and Moises Goldszmidt.
                                                                       Bayesian network classifiers. Machine Learning,
Acknowledgements                                                       29(2-3):131–163, November 1997.
The authors would like to thank Prof. Daniel Barbará from        [11] Nir Friedman and Moises Goldszmidt. Building clas-
George Mason University for providing useful insights for              sifiers using bayesian networks. In Proceedings of
the development of this work. Finally, the authors would               the national conference on artificial intelligence, page
like to thank CGU for providing the resources necessary to             12771284, 1996.
work in this research, as well as for allowing its publication.   [12] Moises Goldszmidt, James J. Cochran, Louis A. Cox,
                                                                       Pinar Keskinocak, Jeffrey P. Kharoufeh, and J. Cole
References                                                             Smith. Bayesian network classifiers. In Wiley En-
                                                                       cyclopedia of Operations Research and Management
 [1] R. Agrawal and R. Srikant. Fast algorithms for min-               Science. John Wiley & Sons, Inc., 2010.
     ing association rules in large databases. In 20th Inter-
                                                                  [13] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard
     national Conference on Very Large Data Bases, pages
                                                                       Pfahringer, Peter Reutemann, and Ian H. Witten. The
     478–499. Morgan Kaufmann, Los Altos, CA, 1994.
                                                                       WEKA data mining software: An update. SIGKDD
 [2] R. R. Bouckaert. Bayesian Belief Networks: from                   Explor. Newsl., 11(1):1018, November 2009.
     Construction to Inference. PhD thesis, University of         [14] Bing Liu, Wynne Hsu, and Yiming Ma. Integrating
     Utrecht, Utrecht, Netherlands, 1995.                              classification and association rule mining. In Fourth
 [3] Remco R. Bouckaert. Bayesian network classiers in                 International Conference on Knowledge Discovery
     weka for version 3-5-7. Technical report, May 2008.               and Data Mining, pages 80–86. AAAI Press, 1998.
                                                                  [15] Hely Meirelles. Licitacao e Contrato Administra-
 [4] S. Ceccon, D.F. Garway-Heath, D.P. Crabb, and                     tivo. Editora Revista dos Tribunais, Sao Paulo, Brazil,
     A. Tucker. Exploring early glaucoma and the visual                7a. ed., atualizada pelo decreto-lei 2,300/86. edition,
     field test: Classification and clustering using bayesian          1987.
     networks. IEEE Journal of Biomedical and Health In-
     formatics, 18(3):1008–1014, May 2014.                        [16] Andre Mueller. A critical study of the brazilian pro-
                                                                       curement law. Technical report, IBI - The Institute
 [5] Pete Chapman, Julian Clinton, Randy Kerber,                       of Brazilian Business & Public Management Issues,
     Thomas Khabaza, Thomas Reinartz, Colin Shearer,                   Washington, 1998.
     and Rudiger Wirth. CRISP-DM 1.0 step-by-step data
     mining guide. Technical report, The CRISP-DM con-            [17] Mehran Sahami, Susan Dumais, David Heckerman,
     sortium, August 2000.                                             and Eric Horvitz. A bayesian approach to filtering
                                                                       junk e-mail. In Learning for Text Categorization:
 [6] Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O.                    Papers from the 1998 workshop, volume 62, page
     Hall, and W. Philip Kegelmeyer. SMOTE: Synthetic                  98105, 1998.

                                                            77
[18] Wei Shi, Yao Wu Pei, Liang Sun, Jian Guo Wang,
     and Shao Qing Ren. The defect identification of LED
     chips based on bayesian classifier. Applied Mechanics
     and Materials, 333-335:1564–1568, July 2013.
[19] Rdiger Wirth. CRISP-DM: Towards a standard pro-
     cess model for data mining. In Proceedings of the
     Fourth International Conference on the Practical Ap-
     plication of Knowledge Discovery and Data Mining,
     page 2939, 2000.
[20] Ye Ye, Fuchiang (Rich) Tsui, Michael Wagner,
     Jeremy U. Espino, and Qi Li. Influenza detection
     from emergency department reports using natural lan-
     guage processing and bayesian network classifiers.
     Journal of the American Medical Informatics Asso-
     ciation, pages amiajnl–2013–001934, January 2014.


                                                       78

</pre>