=Paper=
{{Paper
|id=Vol-2267/523-527-paper-100
|storemode=property
|title=Checking foreign counterparty companies using Big Data
|pdfUrl=https://ceur-ws.org/Vol-2267/523-527-paper-100.pdf
|volume=Vol-2267
|authors=Lazar A. Badalov,Sergey D. Belov,Ivan S. Kadochnikov
}}
==Checking foreign counterparty companies using Big Data==
Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 CHECKING FOREIGN COUNTERPARTY COMPANIES USING BIG DATA L.A. Badalov 1, S.D. Belov 1,2, a, I.S. Kadochnikov1,2 1 Plekhanov Russian University of Economics, 36 Stremyanny per., Moscow, 117997, Russia 2 Laboratory of Information Technologies, Joint Institute for Nuclear Research, 6 Joliot-Curie, Dubna, Moscow region, 141980, Russia E-mail: a sergey.belov@jinr.ru The project aims to create a database of companies and company data and an automated analytical system based on this data. The development of the system will allow credit institutions to obtain information about the links between companies, to carry out a policy of "Know your customer" - to identify the final beneficiaries, to assess risks, to identify relationships between customers. It could the need of banks to fulfill the requirements on national authorities, laws on offshore tax evasion and FATCA, the recommendations of the Group of development of financial measures of struggle against money-laundering (FATF), the Basel Committee on banking supervision. For the moment, there are some projects like OpenCorporates having global databases of companies collected from many jurisdictions. But at the same they don’t cover neither all the national registries, nor other useful d ata sources (courts, customs, press, etc.). Also, the existing services have rather sketchy abilities on searching for relations between companies, which are not always direct. The project we present is about to overcome main of these deficiencies. Number of companies worldwide is more than 150 million. Having company information from many sources, there is no other reasonable way to process it using Big Data technologies. In the research we use such technologies along with machine learning and graph databases. Keywords: companies controlling, money laundering, finances, Big Data © 2018 Lazar A. Badalov, Sergey D. Belov, Ivan S. Kadochnikov 523 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 1. Introduction The integration of the Russian financial system into the international system has led to the emergence of a wide range of regulatory documents of the Bank of Russia on disclosure of information about the ultimate beneficiaries of non-resident clients, as well as other requirements. In addition, deoffshorization laws and FATCA [1] have made it necessary for banks to have information about the ultimate beneficiaries. The American law on taxation of foreign accounts – (Foreign Account Tax Compliance Act-FATCA) not only obliges Russian banks to identify American taxpayers among their customers, but also threatens large fines in case of non-execution. The risks associated with possible sanctions by the us internal Revenue service (IRS) are extremely high banks can be fined in the form of compulsory withdrawal of 30% of the amounts of international transfers. Thus, us law establishes obligations for credit institutions and measures for non-performance or improper performance of such obligations but does not provide the tools with which to comply with such requirements. Another responsibility that is imposed on credit institutions is the need for regular re- examination of the entire customer base in order to identify non-resident customers and customers- foreign taxpayers, which in the absence of information about the beneficial owners is equivalent, in fact, re-identification. The main result of these responsibilities of banks is the ability to confirm or refute the reality of non-resident clients, that is, to establish whether they are real activities or technical companies that can be used for questionable transactions. 2. Information sources 2.1. Classification of the open data sources The work on the analysis of the main available sources of information on several jurisdictions, the definition of the nature and completeness of the data contained in them, technical ways of access to them. Number of companies worldwide is more than 150 million [2]. Currently, the system for analysis considers information from the following types of open sources: • National register of companies; • Portals aggregating various companies’ data (like OpenCorporates [3]); • Financial oversight services databases; • Global database of legal entity identifiers (GLEIF [4]). In the future, it is also planned to use the following sources: • Tax service data; • Customs information; • Court decisions databases; • Data leakage of information from offshore, the results of investigative journalism; • Databases of distressed companies (temporarily managed, in the process of closing, etc.)); • Other sources of commercial information (persons, beneficiaries, public documents on administrative proceedings, etc.). At the moment, there are 40 considered main data sources. 2.2. Data consolidation Data collection and processing is carried out based on modern methods and technologies for obtaining information from web-based sources. Depending on the data source, three main types of queries are available: • Retrieving all or part of the information as archive files; • Programmatic access to sources using HTTP requests; • Search for information about a company or companies using the source website. The first method, the use of archives, makes it possible to quickly obtain significant amounts of fairly complete information about companies. 524 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 The second method, programmatic access to sources, involves primarily the search for data about a particular company. At the same time, obtaining information about all companies available in the register requires additional scanning and considerable time. The third method, the extraction of information from web site source, is the most time- consuming web sites are designed for visual perception by a person, machine-readable information varies greatly from source to source, and can be assumed to be additional software processing on the client side, etc. So often to retrieve information from sources of this type requires the development of separate specialized modules that is implemented in the framework of the project. 2.3. Data gathering and pre-processing The system uses two types of information gathering from the sources: scheduled and on request. The scheduled one is implemented as a periodically run job, performing the downloading of the archives, scanning the sources for information about the largest possible number of companies, as well as the updating of information on subjects already entered into the database. Obtaining information on request is carried out when the user accesses the system to obtain relevant information from sources and save it in the database. Special modules have been developed to collect the information from the sources on the Internet. At the stage of pre-processing information is structured, highlights the main fields necessary for further preservation and analysis of relationships with other companies. Basic information about the company includes: • Name (including previous names); • The IDs of the company in the registers; • Jurisdiction; • Company status; • Form of organization; • Registration date; • Legal address, other contact information; • CEOs and other management officials; • Founders, owners, subsidiaries; • Links to the source of the source information. 3. Information processing and analysis 3.1. Revealing links between companies To identify the affiliation of the companies, in addition to direct comparison of relationships through the founders and owners, the analysis of indirect signs is used. We consider companies that have a coincidence in several positions. First, the fragments of the name, officers, founders, registration address, contact information, owners, subsidiaries, historical ties, similarity of the names and company profiles, etc. in addition, it uses the previously found relations. Discovered information about those or other links of the companies is stored in a graph database, entries in which both the company and the other object types (officers, founders, registration address, contact information). This approach allows for more flexible link analysis and complex search queries. For the analysis and storage of the revealed connections the graph base Neo4j [5] is used. This database also allows you to visualize the graph links using built-in tools. 2.3. Data gathering and pre-processing In the implementation of the software infrastructure of the system used a stack of software products and tools that have become de facto industry standards in their areas: Spark [6, 7], Hadoop, Kafka, Flume, Marathon, Docker, Elasticsearch. The deployment of clusters of these basic components allows for scalability and high availability of the system. The technologies and algorithms used make it easy to develop collection and analysis tools and to connect new structured and unstructured data sources. 525 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 Aggregation and buffering input data and processed data are the means used to organize data flows software package, Apache Flume and system data buffering Kafka. These products have shown their reliability and stability when working in high-load systems. The data collection tools are developed in Python using public open source software libraries. Django software platform is used as a basis for the organization of the graphical user interface in the form of a website. All user requests are made on the server side. Figure 1. Information processing workflow and tools The scheme of interaction of software infrastructure components is shown in Figure 1. For the data processing, the original program code developed during the project implementation are used. 4. Acknowledgement The work of Sergey Belov and Ivan Kadochnikov was partially supported by the Russian Foundation for Basic Research (RFBR), grant 18-07-01359 "Development of information- analytical system of monitoring and analysis of labour market's needs for graduates of Universities on the basis of Big Data analytics". 5. Conclusion Modern banking is impossible without the use of information systems. Banks are actively using several modern IT program systems to solve current problems. Bank controlling is no exception and advanced IT solutions are needed to effectively implement its functions. This will provide banking supervision with significant coverage of credit institutions of different scale-from non-Bank credit institutions and banks with basic licenses to banks with universal licenses. We believe that the gradual and well-thought-out implementation of all the measures proposed in the work will contribute to the growth of the quality and role of Bank controlling, and, as a result, will entail an increase in the stability and reliability of the entire banking sector. 526 Proceedings of the VIII International Conference "Distributed Computing and Grid-technologies in Science and Education" (GRID 2018), Dubna, Moscow region, Russia, September 10 - 14, 2018 In the framework of the study, considering the regulatory framework of regulators, approaches and methods were developed to automate the receipt of information about non-resident companies operating in the territory of the Russian Federation, as well as to make decisions on companies- contractors. As a result of the work, an information and analytical system was created that allows: Search for information on non-resident companies using open sources in various jurisdictions; Identify links between companies based on different information about them; Identify the ultimate beneficiaries, if possible. During the operation of the system, the internal database is also increased when receiving data from external sources. In addition to be valuable, the company information database, which is compiled from many different sources, also provides a unique opportunity to automate the acquisition of new knowledge, such as links between companies registered in different jurisdictions around the world. Based on the developed system can be deployed information service that provides a range of services for controlling non-resident companies and decision-making on specific companies in the key requirements of regulators. The software infrastructure of the system, created based on open-source products of industrial level and used in various sectors of economic activity, makes it possible both to expand the functionality of the created system and to scale it under large volumes of processed data and requests of users of services. References [1] Foreign account tax compliance act FATCA. Available at: https://www.irs.gov/businesses/corporations/foreign-account-tax-compliance-act-fatca [2] World Bank, number of companies worldwide. Available at: https://data.worldbank.org/indicator/CM.MKT.LDOM.NO [3] OpenCorporates database. Available at: https://opencorporates.com [4] Global Legal Entity Identifier (GLEIF). Available at: https://www.gleif.org/ [5] Neo4j graph database. Available at: https://neo4j.com/ [6] M. Zaharia et al., Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI 2012. April 2012. [7] M. Armbrust et al., Spark SQL: Relational Data Processing in Spark. SIGMOD 2015. June 2015. 527