=Paper=
{{Paper
|id=Vol-1663/bmaw2016_paper_3
|storemode=property
|title=Bayesian Networks on Income Tax Audit Selection - A Case Study of Brazilian Tax Administration
|pdfUrl=https://ceur-ws.org/Vol-1663/bmaw2016_paper_3.pdf
|volume=Vol-1663
|authors=Leon Silva,Henrique Rigitano,Rommel Novaes Carvalho,João Carlos Felix Souza
|dblpUrl=https://dblp.org/rec/conf/uai/SilvaRCS16
}}
==Bayesian Networks on Income Tax Audit Selection - A Case Study of Brazilian Tax Administration==
Bayesian Networks on Income Tax Audit Selection - A Case Study of Brazilian Tax Administration Leon Sólon da Silva ⇤ Henrique de C. Rigitano † Rommel N. Carvalho João Carlos F. Souza ¶ Secretariat of Federal Secretariat of Federal Brazil’s Office of the Universidade de Brası́lia Revenue of Brazil Revenue of Brazil Comptroller General ‡ jocafs@unb.br Universidade de Brası́lia henrique.rigitano@rfb.gov.br Universidade de Brası́lia § leon.silva@rfb.gov.br rommel.carvalho@cgu.gov.br Abstract earns. In most countries, sales taxes amount are consid- erably larger than income taxes (OECD, 2013). In Brazil, Tax administrations in most countries have corporate and personal income taxes are about 50% of more corporate and personal information than the country’s revenue (RFB, 2016). Although corporate any other government office. Data mining tax has much greater impact on final numbers, personal techniques can be used in many different prob- income tax audits affects a considerably large share of lems due to the large amount of tax returns re- the Brazilian citizens. There are 27 million individual ceived every year. In the present work we show taxpayers in Brazil, about 13% of the population (RFB, an essay of the Brazilian Tax Administration 2016). on using Bayesian networks to predict taxpay- In order to facilitate and prioritize tax audits on personal ers behavior based on historical analysis of in- income tax, RFB created the concept of a “fiscal lattice”. come tax compliance. More specifically, we One can understand the fiscal lattice as a first audit se- tried to improve a previous risk based audit se- lection based on historical risk analysis of tax compli- lection which detects a large amount of taxpay- ance by taxpayers. This lattice is a complex process in ers as high risk. However, in its current form which many tax auditors specialized in personal income it identifies much more cases than the tax audi- tax frauds create risk based rules for audit selection. The tors can handle. Our first results are promising, main difference between a regular audit and fiscal lattice considerably improving tax audit performance. audit is that the former has a much simpler process of analysis in order to determine whether to punish a tax- 1 INTRODUCTION payer or not. Since the number of taxpayers has increased, and the ra- Tax administrations have more information on people tio between tax auditors and citizens has been reducing and companies than any other government office. Tax re- (RFB, 2016), the number of income taxpayers caught on turns, bank transactions, and invoices arrive as hundreds fiscal lattice has increased as well. From 2010 to 2014, of millions of records every year. The Secretariat of Fed- the taxpayers selected for this kind of audit highly in- eral Revenue of Brazil (RFB) is the Brazilian Tax Ad- creased (RFB, 2016). This changing scenario is pushing ministration and Brazilian Customs as well. This combi- the tax administration to a limit of the tax auditors ca- nation is a major leverage and also a challenge. pacity of analysis. RFB’s major office, has about 10,000 Basically, there are two types of taxes: sales taxes and in- tax auditors and a huge backlog of fiscal lattice audits to come taxes. Sales taxes includes value-added taxes and analyze. they are based on the value of the product being sold. In- Data mining techniques can help better selecting taxpay- come tax is based on how much a person or a company ers for audit and the present work offers one solution ⇤ Anexo Ministério da Defesa, 5o andar Brası́lia, DF, Brazil to improve the selection of this kind of audits. In Sec- † Av. Rogerio Weber, 1752 - Centro, Porto Velho, RO, tion 2.1 we discuss how Bayesian networks can be used Brazil as a classification algorithm in order to create predictive ‡ SAS, Quadra 01, Bloco A, Edificio Darcy Ribeiro Brasilia, models. DF, Brazil § Campus Darcy Ribeiro Brasilia, DF, Brazil The document is organized as follows: Section 2 de- ¶ Campus Darcy Ribeiro Brasilia, DF, Brazil scribes some background information about Bayesian BMAW 2016 - Page 14 of 59 networks; Section 3 details the solution for the tax audit selection problem, from its methodology to our first re- sults; Section 4 presents the conclusion and future work. 2 BACKGROUND In this section we bring some tax administration con- cepts, formulate the problem assessed by the present work, and discuss Bayesian networks for prediction. 2.1 BAYESIAN NETWORKS FOR PREDICTIVE MODELS As stated by (Korb and Nicholson, 2010) Bayesian net- Figure 1: Example of Naı̈ve Bayes Network (Zhang, works (BNs) are graphical models for reasoning under 2004) uncertainty, where the nodes represent variables (discrete or continuous) and arcs represent direct connections be- tween them. These direct connections are often causal forcing a tree structure. In this case, each explanatory connections. In addition, BNs model the quantitative variable only depends on the class and one other variable. strength of the connections between variables, allowing This relaxation allows the representation of more com- probabilistic beliefs about them to be updated automati- plex models, leading to possible performance improve- cally as new information becomes available. ments, as shown in (Carvalho et al., 2014). Bayesian networks are useful to learn from data and dis- cover causalities between variables and it can be used as a classifier algorithm. It is being used for predic- tion in many different problems, from genetics (Jansen et al., 2003) and prognostics of breast cancer (Gevaert et al., 2006), to identification of split purchases (Car- valho et al., 2014). In the present work, we use Bayesian networks as a solution for predicting a taxpayer to be compliant or non-compliant in terms of tax obligations. In more detail, our approach presents an improvement of tax audit selection using Bayesian networks to build pre- dictive models. In the next section we present the details for the solution to our problem, as well as the first results. The next subsections describe two different types of Bayesian networks, Naı̈ve Bayes and Tree-Augmented Naı̈ve Bayes. Figure 2: Example of Tree-Augmented Naı̈ve Bayes Network (Jiang et al., 2009) 2.1.1 Naı̈ve Bayes Naı̈ve Bayes is the most simple version of Bayesian net- 2.2 RELATED WORK work. It uses strong connections between the nodes and it considers all explanatory variables (nodes) as indepen- As stated in (Silva et al., 2015) many tax administrations dent. Despite its simpleness it has many applications have been using data mining techniques to create pre- with good results and great run performance as stated in dictive models for tax compliance risk. Despite being (Zhang, 2004). a topic of great interest, tax administrations have many concerns in publishing internal projects. Since taxpayer 2.1.2 Tree-Augmented Naı̈ve Bayes information is classified and should be protected by tax officers, many of them do not share the details of tax Tree-Augmented Naı̈ve Bayes (TAN), as explained in compliance risk projects. (Zheng and Webb, 2011), relaxes the assumption of com- plete independence of the explanatory variables by en- A source of such information, case studies, methodolo- BMAW 2016 - Page 15 of 59 gies, and best practices are intergovernmental organiza- selection. tions. For tax administrations and customs the World Customs Organization (WCO) and the Organization for 3.1 METHODOLOGY Economic Cooperation and Development (OECD) are important sources. In a recent survey that gathered The methodology of the present work follows the well- many countries, OECD presented a comparative chart known CRISP-DM (CRISP-DM). The Cross Industry that shows the use of data mining to detect tax fraud Standard Process for Data Mining is a technology- (OECD, 2013). independent methodology and reference model to im- Tax Administrations internal publications also present plement data mining process in every business. It de- many studies that can be applied by other countries and scribes each phase every data mining work should pass. many of them have developed methodologies based on Each phase is equally relevant for the success of the data statistical analysis and data mining to create tax com- analysis process and should not be underestimated. The pliance risk systems. Most countries use data mining process has six phases and it is possible to perform the for taxpayers classification considering its risks of non- same step more than once. The phases of CRISP-DM compliance. are (Wirth and Hipp, 2000): Some studies, however, reveal different data analysis ap- proach being held in tax administration. The US In- ternal Revenue Service (IRS) uses data mining for dif- ferent purposes, according to (Castellón González and Velásquez, 2013), among which are tax compliance risk based taxpayer classification, tax fraud detection, tax re- fund fraud, criminal activities, and money laundering (Watkins et al., 2003). Another related reference is Jani Martikainens master thesis (Martikainen et al., 2012). He presents results of studies conducted by the Australian Taxation Office (ATO) concerning the usage of models to detect high- risk tax refund claims. Also according to the author, the ATO avoided the payment of refunds of about US$ 665,000,000.00 between 2010 and 2011 based on data mining tools. ATO uses refund models based on social networking discovery algorithms that detect connections between individuals, companies, partnerships, or tax re- turns. The models are updated and refined to enhance detection and increase the recognition of new fraud (Mar- tikainen et al., 2012). More related to the present work Gupta et al. in (Gupta and Nagadevara, 2007) describes in details different ap- proaches on using data mining techniques to improve tax audit selection. The main difference is that in (Gupta and Nagadevara, 2007) the main taxes are value-added taxes Figure 3: CRISP-DM Reference Model (Wirth and Hipp, in contrast with income taxes, object of the present re- 2000) search. Also in (Kirkos et al., 2007) data mining is used to detect frauds on financial statements, which can be easily customized to tax returns and tax evasion/fraud. Business Understanding Every data analysis process is designed to answer busi- 3 SOLUTION AND FIRST RESULTS ness questions to achieve business goals. In the busi- ness understanding phase of CRISP-DM these questions In this section we describe the methodology used in the are asked and possible solutions are also proposed. Pos- present work and detail each step of the data analysis sible quantitative and qualitative business process’ im- from the information and data gathering to the construc- provements are also detailed, in order to justify the use tion of predictive models for improvement of tax audit of data mining techniques to solve business problems. BMAW 2016 - Page 16 of 59 According to (Chapman et al., 2000), this initial phase ness objectives. A key objective is to determine if there focuses on understanding the project objectives and re- is some important business issue that has not been suffi- quirements from a business perspective, and then con- ciently considered. At the end of this phase, a decision verting this knowledge into a data mining problem defi- on the use of the data mining results should be reached. nition, and a preliminary project plan designed to achieve the objectives. Deployment Creation of the model is generally not the end of the Data Understanding project. Usually, the knowledge gained will need to be Once the business questions are clear, it is time to under- organized and presented in a way that the customer can stand the required information to perform the changes use it. Depending on the requirements, the deployment needed in the business process and achieve the goals phase can be as simple as generating a report or as com- identified in the previous phase. In data understanding, plex as implementing a repeatable data mining process. all sources of information needed to perform the analysis In many cases it will be the user, not the data analyst, are determined. The first insights and main patterns are who will carry out the deployment steps. In any case, also identified in the first contact with the data available it is important to understand up front what actions will from the possible sources. Each business question needs need to be carried out in order to actually make use of to be mapped to every data source (systems, databases, the created models. webpages, etc.) in order to address every goal and iden- tify possible gaps and lack of information. 3.2 BUSINESS UNDERSTANDING In (Wirth and Hipp, 2000) it is stated that there is a Our main goal is to improve individuals tax audit selec- close link between business understanding and data un- tion. We try to achieve a better audit process perfor- derstanding. The formulation of the data mining problem mance by better using the tax auditors knowledge and and the project plan require at least some understanding time available to perform these audits. As in any tax ad- of the available data. ministration, there are far more taxpayers returns and in- formation to analyze than tax officers, and to achieve the Data Preparation revenue goals and tax fairness it is major that the selec- The data preparation phase covers all activities to con- tion of audit is as risk based as possible. struct the final dataset (data that will be fed into the mod- In Brazil, personal taxpayers pay their income taxes ev- eling tool(s)) from the initial raw data. Data preparation ery month. Since the tax is calculated on a year based, tasks are likely to be performed multiple times, and not by April of the next year, taxpayers are obliged to send in any prescribed order. Tasks include table, record, and their income tax return in order to adjust their debt (or attribute selection, data cleaning, construction of new at- credit). Every year, tens of million of returns are sent to tributes, and transformation of data for modeling tools. RFB, much more than it could handle if there were no risk based selection. Modeling RFB created the concept of “fiscal lattice” to select per- In this phase, various modeling techniques are selected sonal income tax returns based on tax compliance risk. and applied, and their parameters are calibrated to op- In this technique personal income tax fraud experts an- timal values. Typically, there are several techniques for alyze the historical of all taxpayers and their previous the same data mining problem type. Some techniques re- knowledge in order to come up with parameters to select quire specific data formats. There is a close link between the tax returns for audit. Once caught on “fiscal lattice”, data preparation and modeling. Often, one realizes data only a tax officer could release the tax return, prevent- problems while modeling or one gets ideas for construct- ing fraudsters from receiving a possible credit. There are ing new data. three main purposes in using this technique: Evaluation • to better select taxpayers based on tax compliance At this stage in the project you have built one or more risk; models that appear to have high quality, from a data anal- • to facilitate the verification of tax auditors, since ysis perspective. Before proceeding to final deployment each parameter has a well defined analysis and treat- of the model, it is important to more thoroughly evalu- ment activities; ate the model, and review the steps executed to construct the model, to be certain it properly achieves the busi- • to ease the auto-correction of tax returns by taxpay- ers, since many of them were caught due to filling BMAW 2016 - Page 17 of 59 errors. variable (compliant) other 20 characteristics of taxpay- ers and information retrieved from returns and other sys- tems. From these, 13,547 are women and 10,730 are Besides all Brazilian tax administration efforts to select men. Other explanatory variables are information of tax the individuals tax audits, the number of audits selected return and unfortunately cannot be specified because it by fiscal lattice has increased from 569,000 in 2011 to could present classified information, since the result of 937,000 in 20141 in contrast with the number of tax of- the analysis could lead taxpayers to learn fraud patterns ficers, that decreased from 12,273 in 2010 to 10,419 in and use that information to avoid being caught. 2015 (RFB, 2016). For preparation, all independent variables were analyzed More specifically, we intend to use data mining tech- in order to remove the incomplete rows and to discretize niques to discharge as many taxpayers as possible of fis- continuous ones to comply with the Bayesian network al- cal lattice, with the minimum compliance risk to tax ad- gorithms constraints. The numeric variables where clas- ministration. With thousands of audits already finalized sified within bands in terms of average multipliers (one by experienced tax auditors, it is possible to assess this average, half average, three times average, etc.). After problem with machine learning tools and achieve best re- data preparation the final number of individual taxpayers sults in letting go those taxpayers that offer less risk of returns was 24,277. tax compliance. All data preparation took place using R language2 and its In our first approach on trying data mining techniques packages. to address the problem, we selected a certain RFB’s unit that has been suffering from the large number of fiscal lattice audits. The “Delegacia Especial de Pessoa Fsica” 3.4 MODELING AND EVALUATION (DERPF) or “Individual Taxpayers Special Office” is an individual taxpayer specialized unit located at Sao Paulo We used bnlearn R package3 in order to run the City, the Brazilian biggest city, in the most economically Bayesian network algorithms. Specifically the functions active federation unit (State of Sao Paulo). This unit has naive.bayes and tree.bayes where chosen to create the come to its limit of fiscal audits since its creation in 2014, predictive models. The first is the well-known Naı̈ve and has the largest number of this kind of audits in the Bayes algorithm, which does not take parameters for cus- whole country. It was a natural choice for our first exper- tomizing the models and the former is an implementation iments. of the Tree-Augmented Naı̈ve Bayes (TAN) algorithm. The TAN algorithm takes white list (force the inclusion of arcs in Bayesian network), black list (force the exclu- 3.3 DATA UNDERSTANDING AND sion of arcs in Bayesian network), and mi4 parameters. PREPARATION To create the predictive models we took the compli- To answer the business question on how to improve the ant variable as dependent and the other 35 (thirty five) selection of individual taxpayers caught in fiscal lattice, information as independent variables. The sample of we evaluated the sources of the information needed to 24,277 where divided into training (80%) and test (20%). perform the data mining analysis. Our sample was taken No validation sample was needed since we used 10- from audits performed by DERPF from years 2014 to fold cross-validation technique with bnlearn’s function 2016. bn.cv(). Basically, all individuals taxpayer information was taken As stated in bn.cv() documentation (CRAN, 2016) k- from internal systems, from online systems to data- fold is a technique where the data is split in k subsets marts and datawarehouses. Most of taxpayer informa- of equal size. For each subset in turn, bn is fitted (and tion caught in fiscal lattice is available from tax returns, possibly learned as well) on the other k - 1 subsets and but some information is taken from invoices and financial the loss function is then computed using that subset. Loss operations. The exact properties retrieved by the data estimates for each of the k subsets are then combined to extraction as well as the fraud/non-compliance rate are give an overall loss for data. classified information. 2 https://www.r-project.org/. The final taxpayer table has 25,322 taxpayer’s returns 3 http://www.bnlearn.com/. analyzed by tax auditors and classified as compliant or 4 The estimator used for the mutual information coefficients non-compliant. Each line has, besides the dependent for the Chow-Liu algorithm in TAN. Possible values are mi (discrete mutual information) and mi-g (Gaussian mutual infor- 1 In 2015 this number decreased to 670,000 due to efforts in mation). We use discrete since all explanatory variables have better selecting individuals tax returns for audits been discretized BMAW 2016 - Page 18 of 59 Since the proportion of compliant/non-compliant taxpay- 4 CONCLUSION AND FUTURE WORK ers is classified information, we present the results of the predictive models in terms of improvements from Brazil has been through a major crisis and the respon- the actual process of discharging taxpayers from fiscal sibility of the RFB as a tax administration has also in- lattice. Since our dependent variable is compliant/non- creased in order to guarantee the revenue for public poli- compliant, we are interested in evaluating the models by cies. A better selection of tax audits save resources and specificity more than sensitivity, since it is more danger- increase the performance of the collecting tax process. ous to let a non-compliant taxpayer go away without be- Our approach on creating predictive models to improve ing audited than to select one that is compliant to be au- the risk based selection of the so called “fiscal lattice” dited. proved to be a promising one based on the first results. Each Brazilian tax administration local unit is au- We intend to use different approaches and Bayesian tonomous and may choose whatever criteria it finds best networks algorithms in order to create compliance risk to dismiss taxpayers from fiscal lattice. So, to a matter scores and leave the decision of taxpayers being compli- of possible comparison with our proposal, we consider a ant or not to the tax officers and possibly increase the linear cut (random selection) of taxpayers until it reaches specificity. The approach in the present work delegates a units capacity. If, for example, an office has the ca- this decision to the prediction algorithm. pacity to audit 2,000 taxpayers per month, and there are 3,000, we consider the actual process to randomly choose Furthermore we will try and build Bayesian networks the 1,000 to be dismissed. The overall taxpayers wrongly with larger samples and more tax units and include dismissed, is the same as the proportion between non- more information about the taxpayer, since in this compliant taxpayers from overall caught on fiscal lattice. work we basically used income tax returns and registry Our goal is to better predict if a taxpayer caught on fiscal information. Financial transactions and invoice data lattice is compliant or not. If we come to a specificity could be interesting explanatory variables and will be considerably better than random selection, we achieve used in future applications. our goal to let go as few non-compliant taxpayers as pos- sible. Acknowledgements As we learn from Table 1, using Naı̈ve Bayes is already a good tool to select those taxpayers which can and can- The authors would like to thank RFB, specially DERPF, not be dismissed from being audited. Tree-Augmented for providing the resources necessary to work in this re- Naı̈ve Bayes had no major advantages, despite the cus- search, as well as for allowing its publication. tomization of parameters (root chose automatically or user defined). References Rommel N Carvalho, Leonardo Sales, Henrique A Da Rocha, and Gilson Libório Mendes. Using Table 1: Predictive Models by Algorithm/Parameters bayesian networks to identify and prevent split pur- chases in brazil. In BMA@ UAI, pages 70–78, 2014. Algorithm Performance Rate Pamela Castellón González and Juan D Velásquez. Char- Naive Bayes 41 % acterization and detection of taxpayers with false in- TAN (auto root) 34 % voices using data mining techniques. Expert Systems TAN (selected root) 35 % with Applications, 40(5):1427–1436, 2013. Pete Chapman, Julian Clinton, Randy Kerber, Thomas Khabaza, Thomas Reinartz, Colin Shearer, and Rudi- ger Wirth. Crisp-dm 1.0 step-by-step data mining Therefore, the predictive models in this first results guide. 2000. showed optimistic results, resulting in a increase of more CRAN. Cran project. package bnlearn. https://cran.r- then 30% in tax audit selection in comparison to ran- project.org/web/packages/bnlearn/index.html, 2016. domly discharging taxpayers. It is major to recollect that Accessed: 2016-05-08. the taxpayers caught in fiscal lattice have already been through a risk based process of selection and any increase Olivier Gevaert, Frank De Smet, Dirk Timmerman, Yves in this criteria is a leverage in using Bayesian networks Moreau, and Bart De Moor. Predicting the prognosis to build models of tax compliance. of breast cancer by integrating clinical and microarray BMAW 2016 - Page 19 of 59 data with bayesian networks. Bioinformatics, 22(14): Rüdiger Wirth and Jochen Hipp. Crisp-dm: Towards a e184–e190, 2006. standard process model for data mining. In Proceed- ings of the 4th international conference on the practi- Manish Gupta and Vishnuprasad Nagadevara. Audit se- cal applications of knowledge discovery and data min- lection strategy for improving tax compliance: Ap- ing, pages 29–39. Citeseer, 2000. plication of data mining techniques. In Foundations of Risk-Based Audits. Proceedings of the eleventh In- Harry Zhang. The optimality of naive bayes. AA, 1(2):3, ternational Conference on e-Governance, Hyderabad, 2004. India, December, pages 28–30, 2007. Fei Zheng and Geoffrey I Webb. Tree augmented naive Ronald Jansen, Haiyuan Yu, Dov Greenbaum, Yuval bayes. In Encyclopedia of Machine Learning, pages Kluger, Nevan J Krogan, Sambath Chung, Andrew 990–991. Springer, 2011. Emili, Michael Snyder, Jack F Greenblatt, and Mark Gerstein. A bayesian networks approach for predicting protein-protein interactions from genomic data. Sci- ence, 302(5644):449–453, 2003. Liangxiao Jiang, Harry Zhang, and Zhihua Cai. A novel bayes model: Hidden naive bayes. Knowledge and Data Engineering, IEEE Transactions on, 21(10): 1361–1371, 2009. Efstathios Kirkos, Charalambos Spathis, and Yannis Manolopoulos. Data mining techniques for the detec- tion of fraudulent financial statements. Expert Systems with Applications, 32(4):995–1003, 2007. Kevin B Korb and Ann E Nicholson. Bayesian artificial intelligence. CRC press, 2010. Jani Martikainen et al. Data mining in tax administration-using analytics to enhance tax compli- ance. Department of Information and Service Econ- omy. Aalto University, 2012. OECD. Tax administration 2013 - compara- tive information on oecd and other advanced and emerging economies. Technical Re- port 2308-7331, Organisation for Economic Co-operation and Development, Paris, 2013. URL http://www.oecd-ilibrary.org/ content/serial/23077727. RFB. Secretariat of federal revenue of brazil (rfb) web- site. http://www.receita.fazenda.gov.br, 2016. Ac- cessed: 2016-05-08. Leon Sólon da Silva, Rommel Novaes Carvalho, and João Carlos Felix Souza. Predictive models on tax re- fund claims-essays of data mining in brazilian tax ad- ministration. In Electronic Government and the Infor- mation Systems Perspective, pages 220–228. Springer, 2015. R CORY Watkins, K Michael Reynolds, Ron Demara, Michael Georgiopoulos, Avelino Gonzalez, and Ron Eaglin. Tracking dirty proceeds: exploring data min- ing technologies as tools to investigate money laun- dering. Police Practice and Research, 4(2):163–178, 2003. BMAW 2016 - Page 20 of 59