=Paper=
{{Paper
|id=Vol-2029/paper20
|storemode=property
|title=Crecemas: A Transactional Data-based Big Data Solution to Support a Bank's Corporate Clients with their Commercial Decisions
|pdfUrl=https://ceur-ws.org/Vol-2029/paper20.pdf
|volume=Vol-2029
|authors=Antonio Martín Cachuán Alipázaga,Willy Alexis Peña Vilca
|dblpUrl=https://dblp.org/rec/conf/simbig/AlipazagaV17
}}
==Crecemas: A Transactional Data-based Big Data Solution to Support a Bank's Corporate Clients with their Commercial Decisions==
<pdf width="1500px">https://ceur-ws.org/Vol-2029/paper20.pdf</pdf>
<pre>
Crecemas: A transactional data-based big data solution to support a
     bank’s corporate clients with their commercial decisions

             Antonio Martı́n Cachuán Alipázaga Willy Alexis Peña Vilca
                            Big Data Center of Excellence
                              Banco de Crédito del Perú
                                     Lima, Perú
                       {acahuan,wpena}@bcp.com.pe


                Abstract                             1   Introduction

                                                     The search for dynamism the generation of work
Peru is recognized in the region of Latin            in a country is mainly to support new en-
America as a country of entrepreneurs                trepreneurs in order to contribute directly to in-
so enterprises that have grown rapidly               novation (Harman Andrea, 2012). All this im-
are more commonly found in the market;               proves the country’s economy by bringing wel-
however a great number of them cease                 fare to its inhabitants. The region of Latin Amer-
their operations in a short time due to lack         ica and the Caribbean is characterized for the en-
of capabilities to use data to know better           trepreneurship, Peru is not the exception. It is a
their clients and their competition. As a            country which is occupying the eighth place in a
solution to help this kind of business to            group of 60 economies according to the Global
be sustainable in time, exists new ways              Entrepreneurship Monitor, but it also has the high-
to process and analyse data generated by             est rate in failure (Donna Kelley Slavica Singer,
clients through the transactions they made           2016), one reason for this result, is the low in-
with their credit or debit cards. In addi-           dex of strategic alliances (Global Innovation in-
tion, this information is usually manage             dex, 2015), which means that large companies are
by banks and financial services which with           not actively seeking to do business with smaller
the help of new technologies as Big Data             companies, all this has an impact on the compet-
and Cloud Computing, that banks use in               itiveness index of the country where it is ranked
the daily basis, can help their corporate            69th out of 140 countries (Donna Kelley Slavica
clients to achieve their goals by providing          Singer, 2016). In addition, most of them do not
them with aggregated information through             take advantage of the information they generate in
analytic indicators about their clients and          the daily basis , because they don’t have access
competition.                                         to all the data that they need to accomplish this
                                                     or don’t have the required capabilities inside the
                                                     company to bring strategic insights (Brynjolfsson
In this article, we show the Big Data ar-
                                                     et al., 2011).
chitecture of the web platform ”Crece-
mas”, which was developed in 16 weeks
under agile methodologies with the Scrum
framework and using Cloud Computing
technologies. In this web, KPIs are shown
using anonymized transactions about the
company, its clients and its competitors,
which is helpful to support commercial de-
cisions. Currently the platform handles              Figure 1: Number of credit and debit cards in cir-
200gb of information with 7 worker nodes             culation (2015)
and 3 master nodes and is used by almost               On the other hand, according to the Tecnocom
500 different companies in Peru.                     report of 2016 (Tecnocom, 2016), the annual aver-


                                               208
age of transactions with debit and credit card are        framework that allows the distributed processing
15 and 5 respectively. That happens because the           of large data sets across clusters of computers us-
number of debit and credit cards in circulation in        ing simple programming models. It is designed to
Peru is growing steadily over the last five years, as     scale up from single servers to thousands of ma-
is shown in Fig. 1, with a growth of 9.6% in debit        chines, each offering local computation and stor-
cards. This translates into an increase in the use        age. The project includes these modules (The
of payment cards usage. As you can see in Fig. 2,         Apache Software Foundation, 2007a)
Peru had a ratio of POS card vs. Cash withdrawals
                                                            • Hadoop Common: The common utilities that
of ATM of less than 0.3, which means that there is
                                                              support the other Hadoop modules.
a great potential for growth.
                                                            • Hadoop Distributed File System (HDFS): A
                                                              distributed file system that provides high-
                                                              throughput access to application data.
                                                            • Hadoop YARN: A framework for job
                                                              scheduling and cluster resource management.
                                                            • Hadoop MapReduce: A YARN-based system
                                                              for parallel processing of large data sets.
Figure 2: Ratio of value of POS card transactions            This distributed framework has been adopted by
to ATM withdrawals (2010-2015)                            different vendors, such as Cloudera and Horton-
   The digital transformation is a chance for the         works who have added important features, such as
organizations to get better at managing informa-          data governance and security compliance, which
tion as a strategic asset and Big Data is a game          give to this powerful technology an enterprise at-
changer who adds more sources to the mix. How-            tractive characteristic. In addition, the community
ever, unless the lessons of the productivity paradox      has played an important role; for example, some of
are applied (E. Brynjolfsson, 1994), these changes        the main tools included by the Apache Foundation
will only serve as distractions. Companies that an-       are the following:
ticipate the changing needs of the volatile market-         • Apache Hive: data warehouse software that
place and successfully implement new technolo-                facilitates reading, writing and managing
gies, place themselves in a great position to over-           large datasets residing in distributed storage
come their competitors (Earley, 2014).                        using SQL (The Apache Software Founda-
   Consequently, there are favorable circum-                  tion, 2011).
stances for banks and financial services companies
interested in Digital Transformation who are will-          • Apache HBase: Hadoop database that brings
ing to use the information generated by the users             the possibility of random accesses and real-
of credit and debit card to help the smaller and              time reading and writing to Big Data storages
newer companies to grow in the country’s econ-                (The Apache Software Foundation, 2007b).
omy. To accomplish this, banks can provide aggre-         2.1.2 Cloud Services
gated information obtained from sales transactions
                                                          Cloud Services are applications or services offered
so that the business can make better decisions.
                                                          by means of cloud computing. Nowadays, nearly
                                                          all large software companies, such as Google, Mi-
                                                          crosoft and Amazon, are providing this kind of
2   Literature Review                                     service. In addition, cloud computing has revolu-
                                                          tionized the standard model of service provision-
                                                          ing allowing delivery over the Internet of virtual-
2.1 Concepts
                                                          ized services that can scale up and down in terms
                                                          of processing power and storage.Cloud comput-
2.1.1   Apache Hadoop                                     ing also provides strong storage, computation, and
Hadoop is an open-source software for reliable,           distributed capabilities to support Big Data pro-
scalable and distributed computing. It brings a           cessing. In order to achieve the full potential of


                                                    209
Big Data, it is required to adopt both new data        2.2    Related Work
analytics algorithms and new approaches to han-
dle the dramatic data growth. As a result, one of      Nowadays, data is being generated at an unprece-
the underlying advantages of deploying services        dented scale. Decisions that previously were
on the cloud is the economy of scale. By using the     based on guesswork or handcrafted models of re-
cloud infrastructure, a service provider can offer     ality, can now be made using data-driven math-
better, cheaper, and more reliable services. Cloud     ematical models. However, this increase of the
services offer the following schemas of services       amount of data and the variety of formats has put
(Campbell et al., 2016).:                              new challenges on the table; for example, is more
                                                       complicated to deal with this kind of data and
  • SaaS: Costumers do not have control over the       have a high performance process with the technol-
    hardware and software level configurations of      ogy that many organizations are using in the daily
    the consumed service.                              basis. For that reason, Big Data has the poten-
                                                       tial to revolutionize much more than just research
  • PaaS: Platform usually includes frameworks,        the batch processing of data, this technology has
    developing and testing tools, configuration        come to enable analysis on every aspect of mobile
    management, and abstraction of hardware            services, society, retail, manufacturing, financial
    level resources                                    services, life science and others (Jagadish et al.,
                                                       2014). In addition, in order to accomplish the suc-
  • IaaS: Costumers can hire a hardware-level re-
                                                       cessful use of this kind technology, organizations
    sources.
                                                       need to have an enterprise infrastructure that could
  • DBaaS: Database installation, maintenance          support this initiative with the goal of maintain and
    and accessibility interface are provided by a      run the transformation process in an efficient way.
    database service provider.                         That said, purchasing and deploying equipment in
                                                       a short term is important, in order to reduce the
2.1.3   Agile Methods                                  delivery time of the solution, Cloud Computing
                                                       is a revolutionary mechanism that is changing the
Agile methods are contemporary software engi-          way that enterprise enable hardware and software
neering approaches based on teamwork, customer         design and procurements in an efficient and eco-
collaboration, iterative development, and con-         nomical way; with this in mind, the possibility to
stantly changing people, process and tech. This        enable an infrastructure in more flexible environ-
approach diverts from traditional methods which        ments such as those of the cloud, makes the use
are software engineering approaches based on           of this type of technology much more attractive,
highly structured project plans, exhaustive docu-      in order to provide end users with fast and useful
mentation, and extremely rigid processes designed      results for them (Philip Chen and Zhang, 2014).
to minimize change. Agile methods are a de-               With all these great benefits the use of Big Data
evolution of management thought predating the          technologies and Cloud Computing are a perfect
industrial revolution and use craft industry prin-     combination to start this journey. Also, as men-
ciples like artisans creating made-to-order items      tioned in (Vurukonda and Rao, 2016), it is impor-
for individual customers. Traditional methods rep-     tant to keep in mind that although the cloud is an
resent the amalgamation of management thought          attractive option, the biggest challenge regarding
over the last century and use scientific manage-       cloud is the security and regulatory issues about
ment principles such as efficient production of        a company’s customer data in such environments,
items for mass markets. Agile methods are new          so it also carries great challenges that are been cur-
product development processes that have the abil-      rently working (Hashem et al., 2014).
ity to bring innovative products to market quickly
and inexpensively on complex projects with ill-
defined requirements. Traditional methods resem-
                                                       3     Proposed Solution
ble manufacturing processes that have the abil-
ity to economically and efficiently produce high
quality products on projects with stable and well-     The solution is structured in three stages that range
defined requirements (Trovati et al., 2016).           from obtaining the data directly from the internal


                                                 210
                                Figure 3: Dashboard from crecemas.com


sources, loading them to the cloud and transform-        mation from the Enterprise Data Warehouse and
ing them to be able to calculate the indicators end-     then store it into a file server. This processes also
ing in their visualization in the web (Fig.4).           perform some field filtering and record according
                                                         to the requirements for the KPI construction. Re-
                                                         garding the regulatory constraints, the main objec-
                                                         tive of this processes was to tokenize some sensi-
                                                         tive fields that couldn’t be stored in a Cloud en-
                                                         vironment (e.g., client’s names, client’s address,
                                                         card number).
                                                            In addition, each file generated has a control file
                                                         to perform a validation in the data upload process.
                                                         This file contains the number of exported records
      Figure 4: Proposed solution diagram                and the date in which the file was processed. For
   The entire solution for Crecemas http://              this reason, data ingestion worked with two files:
www.crecemasbcp.com/ (Fig. 3) was devel-                 the first one just with credit card’s transactions
oped in 16 weeks led by a scrum master and a to-         data (with extension .dat) and the other one just
tal of 13 people dedicated exclusively where each        with control data (with extension .ctrl)
member was grouped according to one of the three           The files that were extracted with these pro-
roles (Table 1).                                         cesses have the following types:
  • Business: Dedicated to engage internal areas
    and avoid possible business stoppers. In ad-
    dition the design of the KPIs.                         • Daily master tables: That are completely
                                                             loaded.
  • Data: Responsible for Big Data stage.
  • Development: In charge of the visualization
    of the data and the web.                               • Daily incremental tables: That have informa-
                                                             tion of one day and are stored in order to have
                                                             history of the data.
3.1   Data Ingestion

For this stage, different information extraction           • Monthly master tables: That are incremen-
processes were built in order to obtain the infor-           tally loaded.


                                                   211
                    Business             Data                       Development
                    1 Product Owner      1 Big Data Architect       1 Back-end developer
                    1 Navigator          2 Data Engineer            2 Front-end developer
                    1 Research           2 Data Expert              1 UI Expert
                                                                    1 UX Expert

                                         Table 1: Crecemas team


 Figure 5: Solution proposed for data ingestion
   As can be seen in Fig. 5, The ingestion process
as performed in the following way: All the orches-
tration of the processes in the On-Premise environ-               Figure 6: Data transformation process
ment was made by the enterprise scheduling tool,
which controls and executes the information ex-
traction processes. Once this processes have fin-
ished running, one final job executes an AzCopy           data ingestion. Data can be accessed through a
command (Multi-thread tool to upload data to Mi-          database composed of External Tables created us-
crosoft Azure cloud environment), which is in             ing HQL (Hive Query Language). First, incon-
charge of uploading aforementioned files from the         sistencies in the volume of information processed
file server to the Linux servers(that were created to     by each table and business-level inconsistencies in
deploy the Big Data technology) in the cloud envi-        the values are reviewed (birth date, sex, foreign
ronment. Next, in these servers another job is exe-       characters and incongruent transactions). Then a
cuted which invokes a Ad-Hoc client that uploads          data cleaning process occurs, which eliminates du-
the data from Linux to HDFS, to finally upload            plicates and replaces empty nulls. Finally, the data
the created Apache Hive tables, which are used            moves to a new HDFS location and is created a
for the KPI’s construction. This client (Apache           Hive database called Tmp Transformation Area.
Hadoop and Apache Hive) was created using the
Maven Artifacts from Cloudera, who is the Big             3.2.2    Transformation Area
Data provider selected for this project.
                                                          This area consists of two databases in Hive: the
                                                          first one (Tmp Area Transformation) contains the
3.2 Data Transformation                                   tables generated by the previous component with
                                                          potential update, adds, eliminations if necessary.
After the data ingestion process, data is stored in       The second database is the result of a transfor-
the Hadoop ecosystem (HDFS) within a Clouderas            mation process of the first DB and consists of
cluster which uses 7 worker nodes and 3 name              a ’Tablon’ (a large table with more than 100
nodes (1 main and 2 for backup) its architecture          columns) that consolidates all the information at
can be observed in Fig. 6.                                the level of transactions and commerce including
                                                          tokenized customer information (such as sex, age,
3.2.1   Landing Area                                      date, educational level and economic level). This
This is the initial zone where the data in HDFS           large table is used for the KPI Calculations com-
is located as they were loaded by the process of          ponent.


                                                    212
3.2.3      KPI Calculations

In this area, the ’Tablon’ is used to generate 7 ta-
bles, each linked to one or more KPIs that are de-
tailed in the Data Visualization part. The tables are
in a Hive database, which is connected to NoSQL
tables in HBase responsible to display the reports
on the web.


3.2.4      Consumer Tables

There are 7 tables of type <Key, Value> that are                    Figure 7: Data Visualization Diagram
consumed using Java applying a Facade Design
Pattern (details in the section 3.3).                          This stage is composed by the following com-
                                                             ponents:
   Today we handle almost 200 gb of historical
data and the entire transformation process is ex-                • External Communication: Services that al-
ecuted daily in 1 hour for a volume of information                 low consumption of the information in the
of approximately 12gb (11gb for master tables like                 Big Data environment.
customers and business and 1gb for transactions)
which accumulate during the month. Also every                    • Client Communication: In charge of estab-
month there is a process that goes through all the                 lishing the remote connection with the appli-
stages and takes about 1 hour with a volume of 10                  cation back-end from web pages.
gb of data (mainly other master tables related to
business location, local geography).                             • Graphics: Statistics graphs on the web pages.
   At the steady state, daily processing should be               • Web Page: Represent all the system’s web
20gb and monthly processing 15gb which means                       pages.
a total of 600gb historically1 (Table 2)
                                                                 • Reporting: Allows to access to the system
                       Daily      Monthly   Historic               repositories in order to perform analysis and
       Actual          12gb         10gb      200gb                create new reports.
       Projected       20gb         15gb      600gb

               Table 2: Data size processed                  4    Conclusion and Future Works

                                                             The proposed solution allows enterprises to en-
                                                             able Big Data capabilities inside the organization
                                                             making more efficient batch processes and reduc-
                                                             ing solutions time to market. In addition, the use
3.3 Data Vizualization                                       of Cloud environments facilitates the adoption of
                                                             technologies that require an intensive infrastruc-
                                                             ture deployment. Likewise, the agile framework
Data Visualization solution for this project was a           used by the project, demonstrates that having the
web platform. (Figure 3). For that reason, this              client in the center of all decisions and solutions
stage is a real-time request processor that allows           allows a product to be created in a short time and
to query the Apache HBase database, in order to              with great value to the clients.
obtain the data and show the results to the end user.           For future works, the process developed in
  The workflow for this stage is shown in Fig. 7.            Apache Hive could be migrated to different tech-
                                                             nologies that improves the processing speed. For
                                                             example, Apache Impala (Kornacker et al., 2015)
   1
       Historic data from last 2 years.                      or Apache Spark (Zaharia et al., 2016) are great


                                                       213
options, because this technologies offer a differ-          Erickson, Martin Grund, Daniel Hecht, Matthew
ent engine of execution that brings new capabili-           Jacobs, Ishaan Joshi, Lenni Kuff, Dileep Kumar,
                                                            Alex Leblang, Nong Li, Ippokratis Pandis, Henry
ties to the solution proposed. On the other hand,
                                                            Robinson, David Rorke, Silvius Rus, John Rus-
the information that users are generating inside the        sell, Dimitris Tsirogiannis, Skye Wanderman-Milne,
web could help to make important improvements               and Michael Yoder. 2015.          Impala: A Mod-
in how the company knows better their clients in            ern, Open-Source SQL Engine for Hadoop. Cidr
order to offer solutions that could help to accom-          http://impala.io/.
plish their main goals.                                   C. L. Philip Chen and Chun Yang Zhang. 2014.
                                                            Data-intensive applications, challenges, tech-
                                                            niques and technologies: A survey on Big
                                                            Data.         Information Sciences 275:314–347.
                                                            https://doi.org/10.1016/j.ins.2014.01.015.
References
                                                          Tecnocom. 2016. Informe Tecnocom Tendencias en
                                                            Medios de Pago. https://goo.gl/95th2L.
Erik Brynjolfsson, Lorin M. Hitt, and Heekyung Hellen
   Kim. 2011. Strength in Numbers: How does               The Apache Software Foundation. 2007a.      Apache
   data-driven decision-making affect firm perfor-          Hadoop. http://hadoop.apache.org.
   mance?          ICIS 2011 Proceedings page 18.
   https://doi.org/10.2139/ssrn.1819486.                  The Apache Software Foundation. 2007b.      Apache
                                                            HBase. https://hbase.apache.org.
Jennifer Campbell, Stan Kurkovsky, Chun Wai Liew,
   and Anya Tafliovich. 2016. Scrum and Agile             The Apache Software Foundation. 2011. Apache Hive.
   Methods in Software Engineering Courses. In              https://doi.org/10.1002/ciuz.201500721.
   Proceedings of the 47th ACM Technical Sympo-
   sium on Computing Science Education. ACM, New          Marcello Trovati, Richard Hill, Ashiq Anjum,
   York, NY, USA, SIGCSE ’16, pages 319–320.               Shao Ying Zhu, and Lu Liu. 2016. Big-Data An-
   https://doi.org/10.1145/2839509.2844664.                alytics and Cloud Computing: Theory, Algorithms
                                                           and Applications. Springer.
Mike Herrington Donna Kelley Slavica Singer.
  2016.       Global Enterpreneurship Monitor.            Naresh Vurukonda and B. Thirumala Rao. 2016.
  http://www.gemconsortium.org/report/49480.                A Study on Data Storage Security Issues
                                                            in Cloud Computing.             In Procedia Com-
E. Brynjolfsson. 1994.        The Productivity Para-        puter Science. volume 92, pages 128–135.
   dox of Information Technology: Review and                https://doi.org/10.1016/j.procs.2016.07.335.
   Assessment Center for Coordination Science.
   http://ccs.mit.edu/papers/CCSWP130/ccswp130.html. Matei Zaharia, Michael J. Franklin, Ali Ghodsi, Joseph
                                                       Gonzalez, Scott Shenker, Ion Stoica, Reynold S.
S. Earley. 2014. The digital transformation: Stay-     Xin, Patrick Wendell, Tathagata Das, Michael
   ing competitive.      IT Professional 16(2):58–60.  Armbrust, Ankur Dave, Xiangrui Meng, Josh
   https://doi.org/10.1109/MITP.2014.24.               Rosen, and Shivaram Venkataraman. 2016. Apache
                                                       Spark: a unified engine for big data process-
Harman Andrea. 2012. Un estudio de los factories de    ing. Communications of the ACM 59(11):56–65.
   exito y fracaso en emprendedores de un programa     https://doi.org/10.1145/2934664.
   de incubacion de empresas: Caso del proyecto
   RAMP Perú. Master’s thesis, Pontificia Universidad
   Catolica del Peru.

Ibrahim Abaker Targio Hashem, Ibrar Yaqoob, Nor
   Badrul Anuar, Salimah Mokhtar, Abdullah Gani,
   and Samee Ullah Khan. 2014. The rise of Big
   Data on cloud computing: Review and open re-
   search issues. Information Systems 47:98–115.
   https://doi.org/10.1016/j.is.2014.07.006.

H. V. Jagadish, Johannes Gehrke, Alexandros
  Labrinidis, Yannis Papakonstantinou, Jignesh M.
  Patel, Raghu Ramakrishnan, and Cyrus Sha-
  habi. 2014.      Big data and its technical chal-
  lenges. Communications of the ACM 57(7):86–94.
  https://doi.org/10.1145/2611567.

Marcel Kornacker, Alexander Behm, Victor Bittorf,
 Taras Bobrovytsky, Casey Ching, Alan Choi, Justin


                                                    214

</pre>