=Paper=
{{Paper
|id=Vol-2029/paper20
|storemode=property
|title=Crecemas: A Transactional Data-based Big Data Solution to Support a Bank's Corporate Clients with their Commercial Decisions
|pdfUrl=https://ceur-ws.org/Vol-2029/paper20.pdf
|volume=Vol-2029
|authors=Antonio Martín Cachuán Alipázaga,Willy Alexis Peña Vilca
|dblpUrl=https://dblp.org/rec/conf/simbig/AlipazagaV17
}}
==Crecemas: A Transactional Data-based Big Data Solution to Support a Bank's Corporate Clients with their Commercial Decisions==
Crecemas: A transactional data-based big data solution to support a bank’s corporate clients with their commercial decisions Antonio Martı́n Cachuán Alipázaga Willy Alexis Peña Vilca Big Data Center of Excellence Banco de Crédito del Perú Lima, Perú {acahuan,wpena}@bcp.com.pe Abstract 1 Introduction The search for dynamism the generation of work Peru is recognized in the region of Latin in a country is mainly to support new en- America as a country of entrepreneurs trepreneurs in order to contribute directly to in- so enterprises that have grown rapidly novation (Harman Andrea, 2012). All this im- are more commonly found in the market; proves the country’s economy by bringing wel- however a great number of them cease fare to its inhabitants. The region of Latin Amer- their operations in a short time due to lack ica and the Caribbean is characterized for the en- of capabilities to use data to know better trepreneurship, Peru is not the exception. It is a their clients and their competition. As a country which is occupying the eighth place in a solution to help this kind of business to group of 60 economies according to the Global be sustainable in time, exists new ways Entrepreneurship Monitor, but it also has the high- to process and analyse data generated by est rate in failure (Donna Kelley Slavica Singer, clients through the transactions they made 2016), one reason for this result, is the low in- with their credit or debit cards. In addi- dex of strategic alliances (Global Innovation in- tion, this information is usually manage dex, 2015), which means that large companies are by banks and financial services which with not actively seeking to do business with smaller the help of new technologies as Big Data companies, all this has an impact on the compet- and Cloud Computing, that banks use in itiveness index of the country where it is ranked the daily basis, can help their corporate 69th out of 140 countries (Donna Kelley Slavica clients to achieve their goals by providing Singer, 2016). In addition, most of them do not them with aggregated information through take advantage of the information they generate in analytic indicators about their clients and the daily basis , because they don’t have access competition. to all the data that they need to accomplish this or don’t have the required capabilities inside the company to bring strategic insights (Brynjolfsson In this article, we show the Big Data ar- et al., 2011). chitecture of the web platform ”Crece- mas”, which was developed in 16 weeks under agile methodologies with the Scrum framework and using Cloud Computing technologies. In this web, KPIs are shown using anonymized transactions about the company, its clients and its competitors, which is helpful to support commercial de- cisions. Currently the platform handles Figure 1: Number of credit and debit cards in cir- 200gb of information with 7 worker nodes culation (2015) and 3 master nodes and is used by almost On the other hand, according to the Tecnocom 500 different companies in Peru. report of 2016 (Tecnocom, 2016), the annual aver- 208 age of transactions with debit and credit card are framework that allows the distributed processing 15 and 5 respectively. That happens because the of large data sets across clusters of computers us- number of debit and credit cards in circulation in ing simple programming models. It is designed to Peru is growing steadily over the last five years, as scale up from single servers to thousands of ma- is shown in Fig. 1, with a growth of 9.6% in debit chines, each offering local computation and stor- cards. This translates into an increase in the use age. The project includes these modules (The of payment cards usage. As you can see in Fig. 2, Apache Software Foundation, 2007a) Peru had a ratio of POS card vs. Cash withdrawals • Hadoop Common: The common utilities that of ATM of less than 0.3, which means that there is support the other Hadoop modules. a great potential for growth. • Hadoop Distributed File System (HDFS): A distributed file system that provides high- throughput access to application data. • Hadoop YARN: A framework for job scheduling and cluster resource management. • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets. Figure 2: Ratio of value of POS card transactions This distributed framework has been adopted by to ATM withdrawals (2010-2015) different vendors, such as Cloudera and Horton- The digital transformation is a chance for the works who have added important features, such as organizations to get better at managing informa- data governance and security compliance, which tion as a strategic asset and Big Data is a game give to this powerful technology an enterprise at- changer who adds more sources to the mix. How- tractive characteristic. In addition, the community ever, unless the lessons of the productivity paradox has played an important role; for example, some of are applied (E. Brynjolfsson, 1994), these changes the main tools included by the Apache Foundation will only serve as distractions. Companies that an- are the following: ticipate the changing needs of the volatile market- • Apache Hive: data warehouse software that place and successfully implement new technolo- facilitates reading, writing and managing gies, place themselves in a great position to over- large datasets residing in distributed storage come their competitors (Earley, 2014). using SQL (The Apache Software Founda- Consequently, there are favorable circum- tion, 2011). stances for banks and financial services companies interested in Digital Transformation who are will- • Apache HBase: Hadoop database that brings ing to use the information generated by the users the possibility of random accesses and real- of credit and debit card to help the smaller and time reading and writing to Big Data storages newer companies to grow in the country’s econ- (The Apache Software Foundation, 2007b). omy. To accomplish this, banks can provide aggre- 2.1.2 Cloud Services gated information obtained from sales transactions Cloud Services are applications or services offered so that the business can make better decisions. by means of cloud computing. Nowadays, nearly all large software companies, such as Google, Mi- crosoft and Amazon, are providing this kind of 2 Literature Review service. In addition, cloud computing has revolu- tionized the standard model of service provision- ing allowing delivery over the Internet of virtual- 2.1 Concepts ized services that can scale up and down in terms of processing power and storage.Cloud comput- 2.1.1 Apache Hadoop ing also provides strong storage, computation, and Hadoop is an open-source software for reliable, distributed capabilities to support Big Data pro- scalable and distributed computing. It brings a cessing. In order to achieve the full potential of 209 Big Data, it is required to adopt both new data 2.2 Related Work analytics algorithms and new approaches to han- dle the dramatic data growth. As a result, one of Nowadays, data is being generated at an unprece- the underlying advantages of deploying services dented scale. Decisions that previously were on the cloud is the economy of scale. By using the based on guesswork or handcrafted models of re- cloud infrastructure, a service provider can offer ality, can now be made using data-driven math- better, cheaper, and more reliable services. Cloud ematical models. However, this increase of the services offer the following schemas of services amount of data and the variety of formats has put (Campbell et al., 2016).: new challenges on the table; for example, is more complicated to deal with this kind of data and • SaaS: Costumers do not have control over the have a high performance process with the technol- hardware and software level configurations of ogy that many organizations are using in the daily the consumed service. basis. For that reason, Big Data has the poten- tial to revolutionize much more than just research • PaaS: Platform usually includes frameworks, the batch processing of data, this technology has developing and testing tools, configuration come to enable analysis on every aspect of mobile management, and abstraction of hardware services, society, retail, manufacturing, financial level resources services, life science and others (Jagadish et al., 2014). In addition, in order to accomplish the suc- • IaaS: Costumers can hire a hardware-level re- cessful use of this kind technology, organizations sources. need to have an enterprise infrastructure that could • DBaaS: Database installation, maintenance support this initiative with the goal of maintain and and accessibility interface are provided by a run the transformation process in an efficient way. database service provider. That said, purchasing and deploying equipment in a short term is important, in order to reduce the 2.1.3 Agile Methods delivery time of the solution, Cloud Computing is a revolutionary mechanism that is changing the Agile methods are contemporary software engi- way that enterprise enable hardware and software neering approaches based on teamwork, customer design and procurements in an efficient and eco- collaboration, iterative development, and con- nomical way; with this in mind, the possibility to stantly changing people, process and tech. This enable an infrastructure in more flexible environ- approach diverts from traditional methods which ments such as those of the cloud, makes the use are software engineering approaches based on of this type of technology much more attractive, highly structured project plans, exhaustive docu- in order to provide end users with fast and useful mentation, and extremely rigid processes designed results for them (Philip Chen and Zhang, 2014). to minimize change. Agile methods are a de- With all these great benefits the use of Big Data evolution of management thought predating the technologies and Cloud Computing are a perfect industrial revolution and use craft industry prin- combination to start this journey. Also, as men- ciples like artisans creating made-to-order items tioned in (Vurukonda and Rao, 2016), it is impor- for individual customers. Traditional methods rep- tant to keep in mind that although the cloud is an resent the amalgamation of management thought attractive option, the biggest challenge regarding over the last century and use scientific manage- cloud is the security and regulatory issues about ment principles such as efficient production of a company’s customer data in such environments, items for mass markets. Agile methods are new so it also carries great challenges that are been cur- product development processes that have the abil- rently working (Hashem et al., 2014). ity to bring innovative products to market quickly and inexpensively on complex projects with ill- defined requirements. Traditional methods resem- 3 Proposed Solution ble manufacturing processes that have the abil- ity to economically and efficiently produce high quality products on projects with stable and well- The solution is structured in three stages that range defined requirements (Trovati et al., 2016). from obtaining the data directly from the internal 210 Figure 3: Dashboard from crecemas.com sources, loading them to the cloud and transform- mation from the Enterprise Data Warehouse and ing them to be able to calculate the indicators end- then store it into a file server. This processes also ing in their visualization in the web (Fig.4). perform some field filtering and record according to the requirements for the KPI construction. Re- garding the regulatory constraints, the main objec- tive of this processes was to tokenize some sensi- tive fields that couldn’t be stored in a Cloud en- vironment (e.g., client’s names, client’s address, card number). In addition, each file generated has a control file to perform a validation in the data upload process. This file contains the number of exported records Figure 4: Proposed solution diagram and the date in which the file was processed. For The entire solution for Crecemas http:// this reason, data ingestion worked with two files: www.crecemasbcp.com/ (Fig. 3) was devel- the first one just with credit card’s transactions oped in 16 weeks led by a scrum master and a to- data (with extension .dat) and the other one just tal of 13 people dedicated exclusively where each with control data (with extension .ctrl) member was grouped according to one of the three The files that were extracted with these pro- roles (Table 1). cesses have the following types: • Business: Dedicated to engage internal areas and avoid possible business stoppers. In ad- dition the design of the KPIs. • Daily master tables: That are completely loaded. • Data: Responsible for Big Data stage. • Development: In charge of the visualization of the data and the web. • Daily incremental tables: That have informa- tion of one day and are stored in order to have history of the data. 3.1 Data Ingestion For this stage, different information extraction • Monthly master tables: That are incremen- processes were built in order to obtain the infor- tally loaded. 211 Business Data Development 1 Product Owner 1 Big Data Architect 1 Back-end developer 1 Navigator 2 Data Engineer 2 Front-end developer 1 Research 2 Data Expert 1 UI Expert 1 UX Expert Table 1: Crecemas team Figure 5: Solution proposed for data ingestion As can be seen in Fig. 5, The ingestion process as performed in the following way: All the orches- tration of the processes in the On-Premise environ- Figure 6: Data transformation process ment was made by the enterprise scheduling tool, which controls and executes the information ex- traction processes. Once this processes have fin- ished running, one final job executes an AzCopy data ingestion. Data can be accessed through a command (Multi-thread tool to upload data to Mi- database composed of External Tables created us- crosoft Azure cloud environment), which is in ing HQL (Hive Query Language). First, incon- charge of uploading aforementioned files from the sistencies in the volume of information processed file server to the Linux servers(that were created to by each table and business-level inconsistencies in deploy the Big Data technology) in the cloud envi- the values are reviewed (birth date, sex, foreign ronment. Next, in these servers another job is exe- characters and incongruent transactions). Then a cuted which invokes a Ad-Hoc client that uploads data cleaning process occurs, which eliminates du- the data from Linux to HDFS, to finally upload plicates and replaces empty nulls. Finally, the data the created Apache Hive tables, which are used moves to a new HDFS location and is created a for the KPI’s construction. This client (Apache Hive database called Tmp Transformation Area. Hadoop and Apache Hive) was created using the Maven Artifacts from Cloudera, who is the Big 3.2.2 Transformation Area Data provider selected for this project. This area consists of two databases in Hive: the first one (Tmp Area Transformation) contains the 3.2 Data Transformation tables generated by the previous component with potential update, adds, eliminations if necessary. After the data ingestion process, data is stored in The second database is the result of a transfor- the Hadoop ecosystem (HDFS) within a Clouderas mation process of the first DB and consists of cluster which uses 7 worker nodes and 3 name a ’Tablon’ (a large table with more than 100 nodes (1 main and 2 for backup) its architecture columns) that consolidates all the information at can be observed in Fig. 6. the level of transactions and commerce including tokenized customer information (such as sex, age, 3.2.1 Landing Area date, educational level and economic level). This This is the initial zone where the data in HDFS large table is used for the KPI Calculations com- is located as they were loaded by the process of ponent. 212 3.2.3 KPI Calculations In this area, the ’Tablon’ is used to generate 7 ta- bles, each linked to one or more KPIs that are de- tailed in the Data Visualization part. The tables are in a Hive database, which is connected to NoSQL tables in HBase responsible to display the reports on the web. 3.2.4 Consumer Tables There are 7 tables of typethat are Figure 7: Data Visualization Diagram consumed using Java applying a Facade Design Pattern (details in the section 3.3). This stage is composed by the following com- ponents: Today we handle almost 200 gb of historical data and the entire transformation process is ex- • External Communication: Services that al- ecuted daily in 1 hour for a volume of information low consumption of the information in the of approximately 12gb (11gb for master tables like Big Data environment. customers and business and 1gb for transactions) which accumulate during the month. Also every • Client Communication: In charge of estab- month there is a process that goes through all the lishing the remote connection with the appli- stages and takes about 1 hour with a volume of 10 cation back-end from web pages. gb of data (mainly other master tables related to business location, local geography). • Graphics: Statistics graphs on the web pages. At the steady state, daily processing should be • Web Page: Represent all the system’s web 20gb and monthly processing 15gb which means pages. a total of 600gb historically1 (Table 2) • Reporting: Allows to access to the system Daily Monthly Historic repositories in order to perform analysis and Actual 12gb 10gb 200gb create new reports. Projected 20gb 15gb 600gb Table 2: Data size processed 4 Conclusion and Future Works The proposed solution allows enterprises to en- able Big Data capabilities inside the organization making more efficient batch processes and reduc- ing solutions time to market. In addition, the use 3.3 Data Vizualization of Cloud environments facilitates the adoption of technologies that require an intensive infrastruc- ture deployment. Likewise, the agile framework Data Visualization solution for this project was a used by the project, demonstrates that having the web platform. (Figure 3). For that reason, this client in the center of all decisions and solutions stage is a real-time request processor that allows allows a product to be created in a short time and to query the Apache HBase database, in order to with great value to the clients. obtain the data and show the results to the end user. For future works, the process developed in The workflow for this stage is shown in Fig. 7. Apache Hive could be migrated to different tech- nologies that improves the processing speed. For example, Apache Impala (Kornacker et al., 2015) 1 Historic data from last 2 years. or Apache Spark (Zaharia et al., 2016) are great 213 options, because this technologies offer a differ- Erickson, Martin Grund, Daniel Hecht, Matthew ent engine of execution that brings new capabili- Jacobs, Ishaan Joshi, Lenni Kuff, Dileep Kumar, Alex Leblang, Nong Li, Ippokratis Pandis, Henry ties to the solution proposed. On the other hand, Robinson, David Rorke, Silvius Rus, John Rus- the information that users are generating inside the sell, Dimitris Tsirogiannis, Skye Wanderman-Milne, web could help to make important improvements and Michael Yoder. 2015. Impala: A Mod- in how the company knows better their clients in ern, Open-Source SQL Engine for Hadoop. Cidr order to offer solutions that could help to accom- http://impala.io/. plish their main goals. C. L. Philip Chen and Chun Yang Zhang. 2014. Data-intensive applications, challenges, tech- niques and technologies: A survey on Big Data. Information Sciences 275:314–347. https://doi.org/10.1016/j.ins.2014.01.015. References Tecnocom. 2016. Informe Tecnocom Tendencias en Medios de Pago. https://goo.gl/95th2L. Erik Brynjolfsson, Lorin M. Hitt, and Heekyung Hellen Kim. 2011. Strength in Numbers: How does The Apache Software Foundation. 2007a. Apache data-driven decision-making affect firm perfor- Hadoop. http://hadoop.apache.org. mance? ICIS 2011 Proceedings page 18. https://doi.org/10.2139/ssrn.1819486. The Apache Software Foundation. 2007b. Apache HBase. https://hbase.apache.org. Jennifer Campbell, Stan Kurkovsky, Chun Wai Liew, and Anya Tafliovich. 2016. Scrum and Agile The Apache Software Foundation. 2011. Apache Hive. Methods in Software Engineering Courses. In https://doi.org/10.1002/ciuz.201500721. Proceedings of the 47th ACM Technical Sympo- sium on Computing Science Education. ACM, New Marcello Trovati, Richard Hill, Ashiq Anjum, York, NY, USA, SIGCSE ’16, pages 319–320. Shao Ying Zhu, and Lu Liu. 2016. Big-Data An- https://doi.org/10.1145/2839509.2844664. alytics and Cloud Computing: Theory, Algorithms and Applications. Springer. Mike Herrington Donna Kelley Slavica Singer. 2016. Global Enterpreneurship Monitor. Naresh Vurukonda and B. Thirumala Rao. 2016. http://www.gemconsortium.org/report/49480. A Study on Data Storage Security Issues in Cloud Computing. In Procedia Com- E. Brynjolfsson. 1994. The Productivity Para- puter Science. volume 92, pages 128–135. dox of Information Technology: Review and https://doi.org/10.1016/j.procs.2016.07.335. Assessment Center for Coordination Science. http://ccs.mit.edu/papers/CCSWP130/ccswp130.html. Matei Zaharia, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, Ion Stoica, Reynold S. S. Earley. 2014. The digital transformation: Stay- Xin, Patrick Wendell, Tathagata Das, Michael ing competitive. IT Professional 16(2):58–60. Armbrust, Ankur Dave, Xiangrui Meng, Josh https://doi.org/10.1109/MITP.2014.24. Rosen, and Shivaram Venkataraman. 2016. Apache Spark: a unified engine for big data process- Harman Andrea. 2012. Un estudio de los factories de ing. Communications of the ACM 59(11):56–65. exito y fracaso en emprendedores de un programa https://doi.org/10.1145/2934664. de incubacion de empresas: Caso del proyecto RAMP Perú. Master’s thesis, Pontificia Universidad Catolica del Peru. Ibrahim Abaker Targio Hashem, Ibrar Yaqoob, Nor Badrul Anuar, Salimah Mokhtar, Abdullah Gani, and Samee Ullah Khan. 2014. The rise of Big Data on cloud computing: Review and open re- search issues. Information Systems 47:98–115. https://doi.org/10.1016/j.is.2014.07.006. H. V. Jagadish, Johannes Gehrke, Alexandros Labrinidis, Yannis Papakonstantinou, Jignesh M. Patel, Raghu Ramakrishnan, and Cyrus Sha- habi. 2014. Big data and its technical chal- lenges. Communications of the ACM 57(7):86–94. https://doi.org/10.1145/2611567. Marcel Kornacker, Alexander Behm, Victor Bittorf, Taras Bobrovytsky, Casey Ching, Alan Choi, Justin 214