Large-Scale Transformer models for Transactional Data
                                Fabrizio Garuti1,2 , Simone Luetto3 , Enver Sangineto2 and Rita Cucchiara2,4
                                1
                                  Prometeia Associazione, Bologna, Italy
                                2
                                  AImageLab, UNIMORE, Modena, Italy
                                3
                                  Prometeia SpA, Bologna, Italy
                                4
                                  IIT-CNR, Italy


                                                Abstract
                                                Following the spread of digital channels for everyday activities and electronic payments, huge collections of online transactions
                                                are available from financial institutions. These transactions are usually organized as time series, i.e., a time-dependent sequence
                                                of tabular data, where each element of the series is a collection of heterogeneous fields (e.g., dates, amounts, categories, etc.).
                                                Transactions are usually evaluated by automated or semi-automated procedures to address financial tasks and gain insights
                                                into customers’ behavior. In the last years, many Trees-based Machine Learning methods (e.g., RandomForest, XGBoost)
                                                have been proposed for financial tasks, but they do not fully exploit in an end-to-end pipeline all the information richness of
                                                individual transactions, neither they fully model the underling temporal patterns. Instead, Deep Learning approaches have
                                                proven to be very effective in modeling complex data by representing them in a semantic latent space. In this paper, inspired
                                                by the multi-modal Deep Learning approaches used in Computer Vision and NLP, we propose UniTTab, an end-to-end Deep
                                                Learning Transformer model for transactional time series which can uniformly represent heterogeneous time-dependent
                                                data in a single embedding. Given the availability of large sets of tabular transactions, UniTTab defines a pre-training
                                                self-supervised phase to learn useful representations which can be employed to solve financial tasks such as churn prediction
                                                and loan default prediction. A strength of UniTTab is its flexibility since it can be adopted to represent time series of arbitrary
                                                length and composed of different data types in the fields. The flexibility of our model in solving different types of tasks (e.g.,
                                                detection, classification, regression) and the possibility of varying the length of the input time series, from a few to hundreds
                                                of transactions, makes UniTTab a general-purpose Transformer architecture for bank transactions.

                                                Keywords
                                                Deep Learning, Large Scale Model, Representation Learning, Time series prediction, Transactional data


                                1. Introduction                                                                                         learning models. RandomForest [1], LightGBM [2], XG-
                                                                                                                                        Boost [3] and CatBoost [4] are the most frequently used.
                                Transactional data are time-dependent collections of fi-                                                   Despite the success of Deep Learning methods in other
                                nancial transactions. For instance, a bank account can be application areas (e.g., Natural Language Processing and
                                seen as a time series of transactions, each composed of a Computer Vision), trees-based models seemed to outper-
                                tabular data entry with fields specifying the transaction form deep learning models on most of the tabular datasets
                                amount, the transaction operation and the receiver type [5]. These datasets are typically composed of tens to hun-
                                (see Figure 1). These data can be used as training data dreds of features and thousands to hundreds of thousands
                                for different Machine Learning approaches, in a variety of samples. However, the size of transactional datasets
                                of tasks. Some examples are:                                                                            is growing rapidly, now exceeding millions of trans-
                                       • Customer Value Management, to support market- actions in some cases. Since the performance of deep-
                                          ing or commercial actions, for instance via the learning models improves with dataset size, tree-based
                                          creation of tailored offers;                                                                  models are the best choice only for small and medium-
                                       • Credit and Liquidity Risk, to assess the initial size datasets [6]. In addition, the use of tree-based models
                                          risk, and to detect early risk signals, for instance, for transactional data is limited to constructing simple
                                          represented by changes in expense patterns or in aggregated features, such as calculating the average
                                          the regularity of incomes;                                                                    spending over recent months or determining the total
                                       • Fraud detection and Anti Money Laundering, to income for the past year. This approach has clear limita-
                                          identify potential malicious behaviors.                                                       tions in fully harnessing the precise and timely informa-
                                                                                                                                        tion that transactional data encapsulates.
                                For these purposes, transactional data have been so far
                                                                                                                                           Recently, various deep learning networks have been
                                processed with symbolic AI (e.g., rule-based expert sys-
                                                                                                                                        developed for heterogeneous data, mostly for tabular
                                tems) and/or with ensembles of trees-based machine
                                                                                                                                        datasets. Lyu et al. [7] combine different modules and
                                Ital-IA 2024: 4th National Conference on Artificial Intelligence, orga- can represent both numerical and categorical features.
                                nized by CINI, May 29-30, 2024, Naples, Italy
                                                                                                                                        Borisov et al. [8] use a distillation approach to map deci-
                                $ fabrizio.garuti@prometeia.com (F. Garuti);
                                simone.luetto@prometeia.com (S. Luetto)                                                                 sion trees, trained on heterogeneous tabular data, onto
                                           © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License homogeneous vectors. Huang et al. [9] represent cate-
                                           Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
Figure 1: Some samples of transactions from the Transaction dataset used for fraud detection.


gorical features using an attribute-specific embedding         using natural language-based “interfaces” between the
which is used as a prefix, concatenated with the actual        tabular data and a Transformer. For instance, in LUNA
field value. Schäfl et al. [10] use a non-parametric repre-    [15], numerical values are represented as an atomic nat-
sentation of the training data, which reminds the use of       ural language string.
external networks in Transformers [11]. However, these            Our proposal differs from the aforementioned works
approaches do not model the temporal dynamics: each            in different aspects. On the one hand, we deal with all
row in the table is an individual sample. A trivial solution   the variability dimensions of the problem: numerical,
could be to concatenate multiple rows into single sam-         categorical and temporal. On the other hand, we train our
ples, but none of the previous work has demonstrated           model using arbitrary length and long-range time series,
that their architecture can model more than hundreds of        which can include up to 150 transactions per sample. As
fields as an individual sample.                                a result, it is possible to deal with transactional tasks that
   To overcome these problems, we propose a custom             require learning the long-term dependencies of the data.
Deep Learning architecture based on modern Transform-          Furthermore, the learning phase is enriched with new
ers [12], which can uniformly represent heterogeneous          custom masking techniques, which allow all related fields
time-dependent data, and which is trained on a large-          to be masked simultaneously, making the initial general-
scale transactional dataset. We call our model UniT-           purpose training more challenging for our model.
Tab (Unified Transformer for Time-Dependent Heteroge-
neous Tabular Data), and we show that it consistently out-
performs state-of-the-art approaches based on both Deep        3. Method
Learning and standard Machine Learning techniques.
                                                               3.1. Pre-training and Fine-tuning
                                                               The current great availability of data together with the
2. Related works                                               advancement of AI research have opened the possibility
The heterogeneous nature of transactional data and the         of developing larger models with more general purposes.
lack of large public annotated datasets, due to privacy        This is already a reality in almost every application re-
and commercial reasons, make these data extremely dif-         garding unstructured data, like text, images and video.
ficult to be handled by deep neural networks. However,         Many general-purpose models have gained fame in re-
in recent years some works have started to address these       cent years: GPT-3 [16], BERT [17], CLIP [18], DALL-E
challenges. For instance, Padhi et al. [13] proposed one       [19]. All these models have been pre-trained using large
of the first deep learning architectures for heterogeneous     datasets jointly with a self-supervised approach.
time series (TabBERT). As a solution to data heterogene-          The goal of the pre-training phase is to learn a good
ity, the authors quantize continuous attributes so that        representation of the input data. As a result, models
each field is defined on its finite vocabulary.                trained within this scheme show great generalization
   Another recent work is TabAConvBERT proposed by             capabilities even without further training, like the gen-
Shankaranarayana & Runje [14]. They present an archi-          eration of text of GPT models. These capabilities can
tecture that can deal with both categorical inputs (by         further improve after a second training phase, called
using an embedding neural network) and numerical in-           fine-tuning over a specific task, for example, ChatGPT’s
puts (by using a shallow neural network).                      stunning ability to conversate. Taking inspiration from
   The architecture presented by X. Huang et. al. [9]          these models, we use a large dataset of transactions to
provides a solution to data heterogeneity, but it cannot       pre-train a Transformer network using self-supervised
handle the temporal component of the data and therefore        learning, and then we use a (smaller) labeled dataset to
is unable to solve tasks involving transaction sequences.      fine-tune the network for a specific task.
   A different line of work involves directly or indirectly       During the pre-training stage, we train our UniTTab
                                                               model using the Masked Token pretext task [17]. Some
         UniTTab pre-training                                                                       UniTTab fine-tuning
                                                                 MLP
                                                                MLP
                                             MLP


                                                      ...                                            MLP


                                                      ...                                                                                          ...

                                              Sequence Transformer                                                                   Sequence Transformer

                                                      ...                                                                                          ...

                                                      ...                                           [CLASS]                                        ...

                         Field Transformer                           Field Transformer                                Field Transformer                           Field Transformer


                     Time                Category               Time                 Category                    Time                  Category               Time                Category
                   encoding              encoding             encoding               encoding                  encoding                encoding             encoding              encoding


                 Date   Time   Amount    [MASK]       ...   [MASK] [MASK]   Amount   Merchant                 Date   Time   Amount     Merchant    ...   Date    Time   Amount   Merchant


Figure 2: A schematic illustration of the UniTTab architecture for financial data.


of the input features are masked to the model, and the                                          in 𝑘 final embeddings, which are concatenated in a single
model is trained to predict the masked features as a func-                                      embedding vector. This embedding is the representation
tion of the “visible” ones. This method makes it possi-                                         of a single transaction. Then a sequence of these em-
ble to train a model to automatically learn the seman-                                          beddings, each representing a transaction, is fed to the
tics and to extract relevant information contained in the                                       second Transformer (“Sequence Transformer”). The
sequence of transactions. Specifically, for an input se-                                        Sequence Transformer models the statistic dependencies
quence, we randomly replace a field value with the special                                      between different transactions in the sequence and out-
symbol [MASK]. We use a standard replacement proba-                                             puts embedding elements in the latent space. This is the
bility value of 0.15 [17, 13]. Moreover, with probability                                       general-purpose latent space where the representation
0.1, we also mask all the fields in a transaction, while, for                                   can be potentially exploited for many tasks, such as clas-
the fields representing the time stamp, they are always                                         sification (e.g., to classify the client behavior), detection
either jointly masked or jointly unmasked. These addi-                                          (e.g., to detect anomalies, frauds), and prediction (e.g., to
tional masking strategies, inspired by the block masking                                        predict product churn in next few months).
of adjacent image patches used in BEiT [20], make the                                              As depicted in Figure 2, during pre-training, for each
pretext task more challenging for the network.                                                  masked field we use the corresponding output embedding
   It is important to remark that, within the pre-training                                      to predict the field value. Instead, during fine-tuning, we
phase, the training of the model solely relies on input                                         add a class token [CLASS] at the beginning of the transac-
features of transactions, eliminating the need for labels.                                      tion embedding sequence, and we use the corresponding
This way we can use the entire transactional dataset,                                           output embedding to solve the financial tasks. This em-
even if it is not fully labeled for the specific downstream                                     bedding can attend to all transaction embeddings in the
task. This is the case of the Czech dataset [21] used for                                       sequence, allowing it to exploit all field information of
loan default prediction, described in Section 4.1.                                              all transactions.

3.2. The architecture                                                                           3.3. Feature representation
We develop a custom model called UniTTab, designed to                                           Our model can effectively represent heterogeneous data,
be suited for sequences of heterogeneous transactional                                          encoding both numerical fields (e.g., the amount), cat-
data. Borrowing the techniques used in text analysis                                            egorical fields (e.g., the type of transaction), and fields
in BERT or GPT models, we use input time series with                                            with a specific structure (e.g., the date).
variable length. We vary the sequence length from 10                                               The most common approach to tackle this challenge
to 150 transactions, where each transaction is composed                                         is to reduce all the features to a common representation:
of a fixed number of 10 or 6 fields. As a result, each                                          usually numerical for ensemble of trees or categorical for
time series can vary in length from 100 to 900 items, a                                         deep learning architectures like TabBERT [13]. However,
challenging length to manage even for text sentences.                                           discretizing numerical features into a finite set of values
   Given the data structure, we propose the hierarchical                                        results in a loss of information. For example, it could be
architecture shown in Figure 2. The architecture is com-                                        important to know if an amount is precisely 20 euros or
posed of two different Transformers, and it is trained end-                                     20.50 to distinguish between a withdrawal and a grocery
to-end. The first Transformer (“Field Transformer”)                                             expense. For this reason we develop a custom represen-
takes as input the 𝑘 features describing a single trans-                                        tation to transform numerical values in the input vector.
action, like transaction amount, merchant information,                                          In particular, we represent each numerical value as a fea-
transaction date and time. The features are transformed                                         ture vector obtained by the concatenation of a battery of
different frequency functions (depicted with a sine wave       account we include the entire sequence of transactions if
symbol in Figure 2). Similar representations are used in       it is shorter than the maximum. Instead, if the sequence
NeRFs [22] for 3D synthesis. Conversely, we adopt a tra-       exceeds the maximum length, we only consider the most
ditional “category encoding” to represent the categorical      recent 𝑡𝑚𝑎𝑥 transactions. It’s important to note that,
features, by using simple embedding neural networks            during pre-training the average sequence length is 232
(as used in [13]). Finally, we use a custom “time encod-       transactions, whereas during fine-tuning the average se-
ing” method for the timestamp attributes. The value of a       quence length is 80 transactions. This happens because,
timestamp is split using a combination of different field      for fine-tuning, we only take transactions made before
values: the year, the month, the day and, if necessary, the    the loan begins. For this reason, if we set 𝑡𝑚𝑎𝑥 to 150,
hour. Then each such value is represented as a categorical     during fine-tuning almost all transaction sequences are
feature (e.g., with 12 elements for the month).                of variable length. It’s also interesting to observe that,
                                                               increasing the length of the sequence, the result of the
                                                               model improves. This is likely due to the information in-
4. Experimental results                                        crease in the input sequence, but it demonstrates that the
                                                               model is able to deal with long sequences of transactions.
The effectiveness of the model has been tested over two
large size datasets of transactions: the PKDD’99 Finan-
cial Dataset [21] and our Real Bank Account Transaction        4.2. Effect of Pre-Training
Dataset (in short, RBAT Dataset). The first dataset is         One of the main advantages of using Deep Learning meth-
public and is used as a benchmark for predicting loan          ods over traditional Machine Learning approaches is the
default. Instead, the second dataset is private and is used    possibility to pre-train a large network using a large un-
to assess how well our model predicts customer churn           supervised dataset, and then fine-tune the same network
in comparison to standard industry models. The cho-            on the (usually scarcer) available annotated data of a
sen experiments are binary classification tasks with a         downstream task. In order to quantify the contribution
large level of unbalance in the statistics of the two target   of the pre-training phase, and to show that this is useful
classes.                                                       also when the unlabeled dataset is not huge, we use the
   In all the experiments, the model is first pre-trained      PKDD’99 Financial Dataset, and we pre-train the mod-
using self-supervision (Section 3.1) and then fine-tuned       els with different portions of the pre-training dataset.
on the classification task, in a standard supervised way,      Specifically, in Figure 3 we indicate the fraction of the
using the labeled data.                                        pre-training dataset used for each experiment, where
                                                               zero corresponds to training the models from scratch di-
4.1. Loan default prediction                                   rectly on the (labeled) downstream task data. The results
                                                               in the figure show that both Deep Learning methods (i.e.,
The loan default prediction is a classification task defined
                                                               TabBERT and UniTTab) significantly benefit from the
on the PKDD’99 Financial Dataset, which is a public
                                                               pre-training phase, even using only a small portion of
dataset of real transactions from a Czech bank [21]. This
                                                               the unlabeled data (e.g., 0.25). Furthermore, when pre-
dataset is composed of 1M of transactions from 4500
                                                               training is performed, our UniTTab gets a significantly
clients. It also includes customer information, but we use
                                                               higher F1 score than traditional tree-based models.
only the transactions, each composed of 6 fields (times-
tamp, amount, type and channel of the transaction).
   The dataset presents a large fraction of unlabeled data,    4.3. Churn prediction: comparison with
in fact most of the accounts don’t have any loans, and              industry standards
they cannot be used for the classification task. This is
the perfect example of the potentiality of our model: we       We also compare our model with a custom transactional
perform the pre-training on all the accounts present in        tree-based pipeline on a churn rate prediction task. The
the dataset (4500) and then we fine-tune the model only        task is defined on the private RBAT dataset, which is pro-
on the labeled ones (478 for training and 204 in test). With   vided by an international bank and is composed of several
such a small number of samples UniTTab has been able           hundred million real transactions of bank customers.
to obtain good results, way higher than ensemble of trees,        The churn prediction task is defined as whether or not
and the possibility to exploit all the data in pre-training    a customer churns in the next 3 months, given a 6-month
is its main advantage.                                         history sequence of transactions performed by that cus-
   To evaluate our model’s ability to deal with longer         tomer. The history sequences provided to the models
sequences and variable lengths, we test different se-          are of variable length, with an average of 192 transac-
quence lengths of transactions. We define a maximum            tions and up to a maximum of 500 transactions. Each
length value 𝑡𝑚𝑎𝑥 (ranging from 50 to 150), and for each       sequence is associated with a binary target that repre-
                                                               sents the presence of the churn event for that customer in
Table 1
Loan default prediction task: average and standard deviation results obtained with 5 random seeds.


           𝑡𝑚𝑎𝑥     Model                 F1 score            Average Precision   ROC AUC            Accuracy
                    TabBERT [13]          0.611(±0.032)       0.594(±0.031)       0.827(±0.048)      90.7(±1.6)
           50       LUNA [15]             0.604(±0.048)       0.613(±0.048)       0.869(±0.030)      92.5(±1.7)
                    UniTTab (ours)        0.619(±0.011)       0.574(±0.017)       0.882(±0.021)      90.2(±1.5)
                    TabBERT [13]          0.636(±0.024)       0.625(±0.036)       0.874(±0.019)      91.6(±0.9)
           100      LUNA [15]             0.624(±0.075)       0.601(±0.018)       0.846(±0.025)      92.5(±1.7)
                    UniTTab (ours)        0.654(±0.032)       0.653(±0.033)       0.903(±0.006)      91.4(±1.2)
                    TabBERT [13]          0.620(±0.024)       0.603(±0.016)       0.857(±0.026)      91.6(±1.1)
           150      LUNA [15]             0.637(±0.043)       0.589(±0.017)       0.851(±0.030)      92.6(±1.2)
                    UniTTab (ours)        0.673(±0.038)       0.690(±0.030)       0.912(±0.018)      92.3(±1.1)
                    Random Forest [23]    0.2667              -                   0.6957             89.27
           -        XGBoost               0.608(±0.079)       0.700(±0.040)       0.894(±0.019)      92.8(±1.8)
                    CatBoost              0.527(±0.065)       0.617(±0.079)       0.866(±0.043)      92.0(±1.1)


Figure 3: Loan default prediction task: impact of different portions of the pre-training dataset.


the following 3 months. Initially, our model has been pre-        model drastically outperforms both deep learning and
trained on a random sample of 1M untargeted accounts,             standard machine learning based predictive models on
corresponding to approximately 300 million transactions.          different benchmarks. We believe that our work and our
Then, we evaluate the performance of our model and                results can stimulate this research field and the adoption
industry standards using fine-tuning datasets of different        of self-supervised deep learning in banking data.
sizes, ranging from 50K transaction sequences up to 1
million sequences.
   Figure 4 shows that our UniTTab model significantly            References
outperforms industry standards for every training dataset
                                                                   [1] L. Breiman, Random forests, Machine learning 45
size. It also demonstrates the scalability of our model
                                                                       (2001) 5–32.
through an increased number of fine-tuning samples:
                                                                   [2] G. M. Ke, Lightgbm: A highly efficient gradient
increasing the number of training accounts yields con-
                                                                       boosting decision tree, Advances in neural infor-
siderably improved AUC on the churn prediction task.
                                                                       mation processing systems (2017).
                                                                   [3] T. . Chen, Xgboost: A scalable tree boosting system,
5. Conclusion                                                          Proceedings of the 22nd acm sigkdd international
                                                                       conference on knowledge discovery and data min-
The UniTTab project presented in this paper is a step                  ing (2016) 785–794.
towards the creation of general-purpose architectures for          [4] L. G. Prokhorenkova, Catboost: unbiased boost-
bank transactions. The empirical results show that our                 ing with categorical features, Advances in neural
Figure 4: Customer churn rate prediction task: comparison with industry standard on different portions of the fine-tuning
dataset.


     information processing systems (2018).                    [15] H. Han, J. Xu, M. Zhou, Y. Shao, S. Han, D. Zhang,
 [5] V. e. Borisov, Deep neural networks and tabular                LUNA: Language Understanding with Number Aug-
     data: A survey, IEEE transactions on neural net-               mentations on Transformers via Number Plugins
     works and learning systems (2021).                             and Pre-training, arXiv:2212.02691 (2022).
 [6] L. Grinsztajn, E. Oyallon, G. Varoquaux, Why do           [16] T. B. Brown, B. Mann, N. Ryder, M. Subbiah,
     tree-based models still outperform deep learning               J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
     on tabular data?, 2022. arXiv:2207.08815.                      G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss,
 [7] F. Lyu, X. Tang, H. Zhu, H. Guo, Y. Zhang, R. Tang,            G. Krueger, T. Henighan, R. Child, A. Ramesh,
     X. Liu, OptEmbed: learning optimal embedding                   D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen,
     table for click-through rate prediction, in: M. A.             E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark,
     Hasan, L. Xiong (Eds.), Proceedings of the 31st ACM            C. Berner, S. McCandlish, A. Radford, I. Sutskever,
     International Conference on Information & Knowl-               D. Amodei, Language Models are Few-Shot Learn-
     edge Management, 2022.                                         ers, arXiv:2005.14165 (2020).
 [8] V. Borisov, K. Broelemann, E. Kasneci, G. Kasneci,        [17] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:
     DeepTLF: robust deep neural networks for hetero-               Pre-training of deep bidirectional transformers for
     geneous tabular data, Int. J. Data Sci. Anal. 16 (2023)        language understanding, in: NAACL„ 2019.
     85–100.                                                   [18] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh,
 [9] X. Huang, A. Khetan, M. Cvitkovic, Z. S. Karnin,               G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
     TabTransformer: Tabular Data Modeling Using                    J. Clark, G. Krueger, I. Sutskever, Learning Trans-
     Contextual Embeddings, arXiv:2012.06678 (2020).                ferable Visual Models From Natural Language Su-
[10] X. Huang, A. Khetan, M. Cvitkovic, Z. S. Karnin,               pervision, 2021. arXiv:2103.00020.
     Tabtransformer: Tabular data modeling using con-          [19] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss,
     textual embeddings, arXiv:2012.06678 (2020).                   A. Radford, M. Chen, I. Sutskever, Zero-shot text-
[11] Y. Wu, M. N. Rabe, D. Hutchins, C. Szegedy, Mem-               to-image generation, in: ICML, 2021.
     orizing transformers, in: ICLR, 2022.                     [20] H. Bao, L. Dong, F. Wei, BEiT: BERT pre-training
[12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,               of image transformers, ICLR (2022).
     L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin, At-      [21] P. Berka, Workshop notes on Discovery Challenge
     tention is All you Need, in: NeurIPS, 2017.                    PKDD’99, 1999. URL: https://sorry.vse.cz/~berka/
[13] I. Padhi, Y. Schiff, I. Melnyk, M. Rigotti, Y. Mroueh,         challenge/pkdd1999/berka.htm.
     P. L. Dognin, J. Ross, R. Nair, E. Altman, Tabu-          [22] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Bar-
     lar Transformers for Modeling Multivariate Time                ron, R. Ramamoorthi, R. Ng, NeRF: representing
     Series, in: IEEE International Conference on Acous-            scenes as neural radiance fields for view synthesis,
     tics, Speech and Signal Processing, ICASSP, 2021.              in: ECCV, 2020.
[14] S. M. Shankaranarayana, D. Runje, Attention Aug-          [23] Z. Xu, Loan default prediction with Berka
     mented Convolutional Transformer for Tabular                   dataset, 2020. https://towardsdatascience.com/
     Time-series, in: 2021 International Conference on              loan-default-prediction-an-end-to-end-ml-project-\
     Data Mining, ICDM 2021 - Workshops, 2021.                      with-real-bank-data-part-1-1405f7aecb9e.