<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Large-Scale Transformer models for Transactional Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fabrizio Garuti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simone Luetto</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Enver Sangineto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rita Cucchiara</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AImageLab</institution>
          ,
          <addr-line>UNIMORE, Modena</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>IIT-CNR</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Prometeia Associazione</institution>
          ,
          <addr-line>Bologna</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Prometeia SpA</institution>
          ,
          <addr-line>Bologna</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Following the spread of digital channels for everyday activities and electronic payments, huge collections of online transactions are available from financial institutions. These transactions are usually organized as time series, i.e., a time-dependent sequence of tabular data, where each element of the series is a collection of heterogeneous fields (e.g., dates, amounts, categories, etc.). Transactions are usually evaluated by automated or semi-automated procedures to address financial tasks and gain insights into customers' behavior. In the last years, many Trees-based Machine Learning methods (e.g., RandomForest, XGBoost) have been proposed for financial tasks, but they do not fully exploit in an end-to-end pipeline all the information richness of individual transactions, neither they fully model the underling temporal patterns. Instead, Deep Learning approaches have proven to be very efective in modeling complex data by representing them in a semantic latent space. In this paper, inspired by the multi-modal Deep Learning approaches used in Computer Vision and NLP, we propose UniTTab, an end-to-end Deep Learning Transformer model for transactional time series which can uniformly represent heterogeneous time-dependent data in a single embedding. Given the availability of large sets of tabular transactions, UniTTab defines a pre-training self-supervised phase to learn useful representations which can be employed to solve financial tasks such as churn prediction and loan default prediction. A strength of UniTTab is its flexibility since it can be adopted to represent time series of arbitrary length and composed of diferent data types in the fields. The flexibility of our model in solving diferent types of tasks (e.g., detection, classification, regression) and the possibility of varying the length of the input time series, from a few to hundreds of transactions, makes UniTTab a general-purpose Transformer architecture for bank transactions.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Deep Learning</kwd>
        <kwd>Large Scale Model</kwd>
        <kwd>Representation Learning</kwd>
        <kwd>Time series prediction</kwd>
        <kwd>Transactional data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        gorical features using an attribute-specific embedding using natural language-based “interfaces” between the
which is used as a prefix, concatenated with the actual tabular data and a Transformer. For instance, in LUNA
ifeld value. Schäfl et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] use a non-parametric repre- [15], numerical values are represented as an atomic
natsentation of the training data, which reminds the use of ural language string.
external networks in Transformers [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. However, these Our proposal difers from the aforementioned works
approaches do not model the temporal dynamics: each in diferent aspects. On the one hand, we deal with all
row in the table is an individual sample. A trivial solution the variability dimensions of the problem: numerical,
could be to concatenate multiple rows into single sam- categorical and temporal. On the other hand, we train our
ples, but none of the previous work has demonstrated model using arbitrary length and long-range time series,
that their architecture can model more than hundreds of which can include up to 150 transactions per sample. As
ifelds as an individual sample. a result, it is possible to deal with transactional tasks that
      </p>
      <p>
        To overcome these problems, we propose a custom require learning the long-term dependencies of the data.
Deep Learning architecture based on modern Transform- Furthermore, the learning phase is enriched with new
ers [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], which can uniformly represent heterogeneous custom masking techniques, which allow all related fields
time-dependent data, and which is trained on a large- to be masked simultaneously, making the initial
generalscale transactional dataset. We call our model UniT- purpose training more challenging for our model.
Tab (Unified Transformer for Time-Dependent
Heterogeneous Tabular Data), and we show that it consistently
outperforms state-of-the-art approaches based on both Deep 3. Method
Learning and standard Machine Learning techniques.
      </p>
      <sec id="sec-1-1">
        <title>3.1. Pre-training and Fine-tuning</title>
        <p>
          2. Related works The current great availability of data together with the
advancement of AI research have opened the possibility
The heterogeneous nature of transactional data and the of developing larger models with more general purposes.
lack of large public annotated datasets, due to privacy This is already a reality in almost every application
reand commercial reasons, make these data extremely dif- garding unstructured data, like text, images and video.
ifcult to be handled by deep neural networks. However, Many general-purpose models have gained fame in
rein recent years some works have started to address these cent years: GPT-3 [16], BERT [17], CLIP [18], DALL-E
challenges. For instance, Padhi et al. [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] proposed one [19]. All these models have been pre-trained using large
of the first deep learning architectures for heterogeneous datasets jointly with a self-supervised approach.
time series (TabBERT). As a solution to data heterogene- The goal of the pre-training phase is to learn a good
ity, the authors quantize continuous attributes so that representation of the input data. As a result, models
each field is defined on its finite vocabulary. trained within this scheme show great generalization
        </p>
        <p>
          Another recent work is TabAConvBERT proposed by capabilities even without further training, like the
genShankaranarayana &amp; Runje [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. They present an archi- eration of text of GPT models. These capabilities can
tecture that can deal with both categorical inputs (by further improve after a second training phase, called
using an embedding neural network) and numerical in- fine-tuning over a specific task, for example, ChatGPT’s
puts (by using a shallow neural network). stunning ability to conversate. Taking inspiration from
        </p>
        <p>
          The architecture presented by X. Huang et. al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] these models, we use a large dataset of transactions to
provides a solution to data heterogeneity, but it cannot pre-train a Transformer network using self-supervised
handle the temporal component of the data and therefore learning, and then we use a (smaller) labeled dataset to
is unable to solve tasks involving transaction sequences. fine-tune the network for a specific task.
        </p>
        <p>A diferent line of work involves directly or indirectly During the pre-training stage, we train our UniTTab
model using the Masked Token pretext task [17]. Some
UniTTab pre-training</p>
        <p>UniTTab fine-tuning
...</p>
        <p>...</p>
        <p>Sequence Transformer
...
...</p>
        <p>MLP
[CLASS]</p>
        <p>...</p>
        <p>Sequence Transformer
...
...</p>
        <p>Field Transformer</p>
        <p>Field Transformer</p>
        <p>Field Transformer</p>
        <p>
          Field Transformer
enTciomdeing eCnacteogdoinrgy enTciomdeing eCnacteogdoinrgy
Date Time Amount [MASK] ... [MASK] [MASK] Amount Merchant
enTciomdeing eCnacteogdoinrgy enTciomdeing eCnacteogdoinrgy
Date Time Amount Merchant ... Date Time Amount Merchant
of the input features are masked to the model, and the in  final embeddings, which are concatenated in a single
model is trained to predict the masked features as a func- embedding vector. This embedding is the representation
tion of the “visible” ones. This method makes it possi- of a single transaction. Then a sequence of these
emble to train a model to automatically learn the seman- beddings, each representing a transaction, is fed to the
tics and to extract relevant information contained in the second Transformer (“Sequence Transformer”). The
sequence of transactions. Specifically, for an input se- Sequence Transformer models the statistic dependencies
quence, we randomly replace a field value with the special between diferent transactions in the sequence and
outsymbol [MASK]. We use a standard replacement proba- puts embedding elements in the latent space. This is the
bility value of 0.15 [
          <xref ref-type="bibr" rid="ref13">17, 13</xref>
          ]. Moreover, with probability general-purpose latent space where the representation
0.1, we also mask all the fields in a transaction, while, for can be potentially exploited for many tasks, such as
clasthe fields representing the time stamp, they are always sification (e.g., to classify the client behavior), detection
either jointly masked or jointly unmasked. These addi- (e.g., to detect anomalies, frauds), and prediction (e.g., to
tional masking strategies, inspired by the block masking predict product churn in next few months).
of adjacent image patches used in BEiT [20], make the As depicted in Figure 2, during pre-training, for each
pretext task more challenging for the network. masked field we use the corresponding output embedding
        </p>
        <p>It is important to remark that, within the pre-training to predict the field value. Instead, during fine-tuning, we
phase, the training of the model solely relies on input add a class token [CLASS] at the beginning of the
transacfeatures of transactions, eliminating the need for labels. tion embedding sequence, and we use the corresponding
This way we can use the entire transactional dataset, output embedding to solve the financial tasks. This
emeven if it is not fully labeled for the specific downstream bedding can attend to all transaction embeddings in the
task. This is the case of the Czech dataset [21] used for sequence, allowing it to exploit all field information of
loan default prediction, described in Section 4.1. all transactions.</p>
      </sec>
      <sec id="sec-1-2">
        <title>3.2. The architecture</title>
      </sec>
      <sec id="sec-1-3">
        <title>3.3. Feature representation</title>
        <p>
          We develop a custom model called UniTTab, designed to Our model can efectively represent heterogeneous data,
be suited for sequences of heterogeneous transactional encoding both numerical fields (e.g., the amount),
catdata. Borrowing the techniques used in text analysis egorical fields (e.g., the type of transaction), and fields
in BERT or GPT models, we use input time series with with a specific structure (e.g., the date).
variable length. We vary the sequence length from 10 The most common approach to tackle this challenge
to 150 transactions, where each transaction is composed is to reduce all the features to a common representation:
of a fixed number of 10 or 6 fields. As a result, each usually numerical for ensemble of trees or categorical for
time series can vary in length from 100 to 900 items, a deep learning architectures like TabBERT [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. However,
challenging length to manage even for text sentences. discretizing numerical features into a finite set of values
        </p>
        <p>Given the data structure, we propose the hierarchical results in a loss of information. For example, it could be
architecture shown in Figure 2. The architecture is com- important to know if an amount is precisely 20 euros or
posed of two diferent Transformers, and it is trained end- 20.50 to distinguish between a withdrawal and a grocery
to-end. The first Transformer (“ Field Transformer”) expense. For this reason we develop a custom
representakes as input the  features describing a single trans- tation to transform numerical values in the input vector.
action, like transaction amount, merchant information, In particular, we represent each numerical value as a
featransaction date and time. The features are transformed ture vector obtained by the concatenation of a battery of</p>
        <sec id="sec-1-3-1">
          <title>The efectiveness of the model has been tested over two</title>
          <p>large size datasets of transactions: the PKDD’99
Financial Dataset [21] and our Real Bank Account Transaction
Dataset (in short, RBAT Dataset). The first dataset is
public and is used as a benchmark for predicting loan
default. Instead, the second dataset is private and is used
to assess how well our model predicts customer churn
in comparison to standard industry models. The
chosen experiments are binary classification tasks with a
large level of unbalance in the statistics of the two target
classes.</p>
          <p>In all the experiments, the model is rfist pre-trained
using self-supervision (Section 3.1) and then fine-tuned
on the classification task, in a standard supervised way,
using the labeled data.</p>
        </sec>
      </sec>
      <sec id="sec-1-4">
        <title>4.1. Loan default prediction</title>
        <p>The loan default prediction is a classification task defined
on the PKDD’99 Financial Dataset, which is a public
dataset of real transactions from a Czech bank [21]. This
dataset is composed of 1M of transactions from 4500
clients. It also includes customer information, but we use
only the transactions, each composed of 6 fields
(timestamp, amount, type and channel of the transaction).</p>
        <p>The dataset presents a large fraction of unlabeled data,
in fact most of the accounts don’t have any loans, and
they cannot be used for the classification task. This is
the perfect example of the potentiality of our model: we
perform the pre-training on all the accounts present in
the dataset (4500) and then we fine-tune the model only
on the labeled ones (478 for training and 204 in test). With
such a small number of samples UniTTab has been able
to obtain good results, way higher than ensemble of trees,
and the possibility to exploit all the data in pre-training
is its main advantage.</p>
        <p>
          To evaluate our model’s ability to deal with longer
sequences and variable lengths, we test diferent
sequence lengths of transactions. We define a maximum
length value  (ranging from 50 to 150), and for each
diferent frequency functions (depicted with a sine wave account we include the entire sequence of transactions if
symbol in Figure 2). Similar representations are used in it is shorter than the maximum. Instead, if the sequence
NeRFs [22] for 3D synthesis. Conversely, we adopt a tra- exceeds the maximum length, we only consider the most
ditional “category encoding” to represent the categorical recent  transactions. It’s important to note that,
features, by using simple embedding neural networks during pre-training the average sequence length is 232
(as used in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]). Finally, we use a custom “time encod- transactions, whereas during fine-tuning the average
seing” method for the timestamp attributes. The value of a quence length is 80 transactions. This happens because,
timestamp is split using a combination of diferent field for fine-tuning, we only take transactions made before
values: the year, the month, the day and, if necessary, the the loan begins. For this reason, if we set  to 150,
hour. Then each such value is represented as a categorical during fine-tuning almost all transaction sequences are
feature (e.g., with 12 elements for the month). of variable length. It’s also interesting to observe that,
increasing the length of the sequence, the result of the
model improves. This is likely due to the information
in4. Experimental results crease in the input sequence, but it demonstrates that the
model is able to deal with long sequences of transactions.
        </p>
      </sec>
      <sec id="sec-1-5">
        <title>4.2. Efect of Pre-Training</title>
        <p>One of the main advantages of using Deep Learning
methods over traditional Machine Learning approaches is the
possibility to pre-train a large network using a large
unsupervised dataset, and then fine-tune the same network
on the (usually scarcer) available annotated data of a
downstream task. In order to quantify the contribution
of the pre-training phase, and to show that this is useful
also when the unlabeled dataset is not huge, we use the
PKDD’99 Financial Dataset, and we pre-train the
models with diferent portions of the pre-training dataset.
Specifically, in Figure 3 we indicate the fraction of the
pre-training dataset used for each experiment, where
zero corresponds to training the models from scratch
directly on the (labeled) downstream task data. The results
in the figure show that both Deep Learning methods (i.e.,
TabBERT and UniTTab) significantly benefit from the
pre-training phase, even using only a small portion of
the unlabeled data (e.g., 0.25). Furthermore, when
pretraining is performed, our UniTTab gets a significantly
higher F1 score than traditional tree-based models.</p>
      </sec>
      <sec id="sec-1-6">
        <title>4.3. Churn prediction: comparison with industry standards</title>
        <sec id="sec-1-6-1">
          <title>We also compare our model with a custom transactional</title>
          <p>tree-based pipeline on a churn rate prediction task. The
task is defined on the private RBAT dataset, which is
provided by an international bank and is composed of several
hundred million real transactions of bank customers.</p>
          <p>The churn prediction task is defined as whether or not
a customer churns in the next 3 months, given a 6-month
history sequence of transactions performed by that
customer. The history sequences provided to the models
are of variable length, with an average of 192
transactions and up to a maximum of 500 transactions. Each
sequence is associated with a binary target that
represents the presence of the churn event for that customer in
50
100
150</p>
          <p>
            TabBERT [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ]
LUNA [15]
UniTTab (ours)
TabBERT [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ]
LUNA [15]
UniTTab (ours)
TabBERT [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ]
LUNA [15]
UniTTab (ours)
Random Forest [23]
XGBoost
CatBoost
          </p>
          <p>F1 score
the following 3 months. Initially, our model has been
pretrained on a random sample of 1M untargeted accounts,
corresponding to approximately 300 million transactions.
Then, we evaluate the performance of our model and
industry standards using fine-tuning datasets of diferent
sizes, ranging from 50K transaction sequences up to 1
million sequences.</p>
          <p>Figure 4 shows that our UniTTab model significantly
outperforms industry standards for every training dataset
size. It also demonstrates the scalability of our model
through an increased number of fine-tuning samples:
increasing the number of training accounts yields
considerably improved AUC on the churn prediction task.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>5. Conclusion</title>
      <p>The UniTTab project presented in this paper is a step
towards the creation of general-purpose architectures for
bank transactions. The empirical results show that our
model drastically outperforms both deep learning and
standard machine learning based predictive models on
diferent benchmarks. We believe that our work and our
results can stimulate this research field and the adoption
of self-supervised deep learning in banking data.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Breiman</surname>
          </string-name>
          , Random forests,
          <source>Machine learning 45</source>
          (
          <year>2001</year>
          )
          <fpage>5</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Ke</surname>
          </string-name>
          ,
          <article-title>Lightgbm: A highly eficient gradient boosting decision tree</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T. .</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Xgboost: A scalable tree boosting system</article-title>
          ,
          <source>Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining</source>
          (
          <year>2016</year>
          )
          <fpage>785</fpage>
          -
          <lpage>794</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L. G.</given-names>
            <surname>Prokhorenkova</surname>
          </string-name>
          ,
          <article-title>Catboost: unbiased boosting with categorical features</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          (
          <year>2018</year>
          ). [15]
          <string-name>
            <given-names>H.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>J</surname>
          </string-name>
          . Xu,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shao</surname>
          </string-name>
          , S. Han,
          <string-name>
            <surname>D</surname>
          </string-name>
          . Zhang,
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>V. e. Borisov,</surname>
          </string-name>
          <article-title>Deep neural networks and tabular LUNA: Language Understanding with Number Augdata: A survey, IEEE transactions on neural net- mentations on Transformers via Number Plugins works and learning systems (</article-title>
          <year>2021</year>
          ).
          <article-title>and Pre-training</article-title>
          ,
          <source>arXiv:2212.02691</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Grinsztajn</surname>
          </string-name>
          , E. Oyallon, G. Varoquaux, Why do [16]
          <string-name>
            <surname>T. B. Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ryder</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Subbiah, tree-based models still outperform deep learning J</article-title>
          . Kaplan,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          , P. Shyam,
          <source>on tabular data?</source>
          ,
          <year>2022</year>
          . arXiv:
          <volume>2207</volume>
          .08815. G. Sastry,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Herbert-Voss</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Lyu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tang</surname>
          </string-name>
          , G. Krueger,
          <string-name>
            <given-names>T.</given-names>
            <surname>Henighan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , X. Liu, OptEmbed: learning optimal embedding D. M.
          <string-name>
            <surname>Ziegler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Winter</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Hesse</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Chen, table for click-through rate prediction</article-title>
          , in: M.
          <string-name>
            <surname>A. E. Sigler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Litwin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Chess</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Clark</surname>
          </string-name>
          , Hasan, L. Xiong (Eds.),
          <source>Proceedings of the 31st ACM C.</source>
          <string-name>
            <surname>Berner</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>McCandlish</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>I. Sutskever</given-names>
          </string-name>
          , International Conference on Information &amp;
          <string-name>
            <surname>Knowl- D. Amodei</surname>
          </string-name>
          ,
          <source>Language Models are Few-Shot Learnedge Management</source>
          ,
          <year>2022</year>
          . ers, arXiv:
          <year>2005</year>
          .
          <volume>14165</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>V.</given-names>
            <surname>Borisov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Broelemann</surname>
          </string-name>
          , E. Kasneci, G. Kasneci, [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>DeepTLF: robust deep neural networks for hetero- Pre-training of deep bidirectional transformers for geneous tabular data</article-title>
          ,
          <source>Int. J. Data Sci. Anal</source>
          .
          <volume>16</volume>
          (
          <year>2023</year>
          )
          <article-title>language understanding</article-title>
          , in: NAACL„
          <year>2019</year>
          .
          <fpage>85</fpage>
          -
          <lpage>100</lpage>
          . [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Ramesh,
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Khetan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cvitkovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z. S.</given-names>
            <surname>Karnin</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , P. Mishkin,
          <string-name>
            <surname>TabTransformer: Tabular Data Modeling Using J. Clark</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Krueger</surname>
            ,
            <given-names>I. Sutskever</given-names>
          </string-name>
          , Learning TransContextual Embeddings, arXiv:
          <year>2012</year>
          .
          <volume>06678</volume>
          (
          <year>2020</year>
          ).
          <article-title>ferable Visual Models From Natural Language Su-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Khetan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cvitkovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z. S.</given-names>
            <surname>Karnin</surname>
          </string-name>
          , pervision,
          <year>2021</year>
          . arXiv:
          <volume>2103</volume>
          .00020. Tabtransformer:
          <article-title>Tabular data modeling using con-</article-title>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pavlov</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gray</surname>
          </string-name>
          , C. Voss, textual embeddings, arXiv:
          <year>2012</year>
          .
          <volume>06678</volume>
          (
          <year>2020</year>
          ). A.
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>I. Sutskever</given-names>
          </string-name>
          , Zero-shot text-
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Rabe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hutchins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          , Mem- to
          <article-title>-image generation</article-title>
          , in: ICML,
          <year>2021</year>
          . orizing transformers,
          <source>in: ICLR</source>
          ,
          <year>2022</year>
          . [20]
          <string-name>
            <given-names>H.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <article-title>BEiT: BERT pre-training</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          , J. Uszkoreit,
          <article-title>of image transformers</article-title>
          ,
          <source>ICLR</source>
          (
          <year>2022</year>
          ). L.
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>A. N.</given-names>
          </string-name>
          <string-name>
            <surname>Gomez</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Kaiser</surname>
            ,
            <given-names>I. Polosukhin</given-names>
          </string-name>
          , At- [21]
          <string-name>
            <given-names>P.</given-names>
            <surname>Berka</surname>
          </string-name>
          , Workshop notes on Discovery Challenge tention is All you Need, in: NeurIPS,
          <year>2017</year>
          . PKDD'
          <volume>99</volume>
          ,
          <year>1999</year>
          . URL: https://sorry.vse.cz/~berka/
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>I.</given-names>
            <surname>Padhi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Schif</surname>
          </string-name>
          , I. Melnyk,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rigotti</surname>
          </string-name>
          , Y. Mroueh, challenge/pkdd1999/berka.htm. P. L.
          <string-name>
            <surname>Dognin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Ross</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Nair</surname>
            , E. Altman, Tabu- [22]
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mildenhall</surname>
            ,
            <given-names>P. P.</given-names>
          </string-name>
          <string-name>
            <surname>Srinivasan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Tancik</surname>
            ,
            <given-names>J. T.</given-names>
          </string-name>
          <string-name>
            <surname>Barlar</surname>
          </string-name>
          <article-title>Transformers for Modeling Multivariate Time ron</article-title>
          , R. Ramamoorthi, R. Ng, NeRF: representing Series, in: IEEE International Conference on Acous-
          <article-title>scenes as neural radiance fields for view synthesis, tics, Speech and Signal Processing</article-title>
          , ICASSP,
          <year>2021</year>
          . in: ECCV,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Shankaranarayana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Runje</surname>
          </string-name>
          , Attention Aug- [23]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>Loan default prediction with Berka mented Convolutional Transformer for Tabular dataset</article-title>
          ,
          <year>2020</year>
          . https://towardsdatascience.com/ Time-series, in: 2021 International Conference on loan
          <article-title>-default-prediction-an-end-to-end-ml-project-</article-title>
          \
          <source>Data Mining, ICDM 2021 - Workshops</source>
          ,
          <year>2021</year>
          .
          <article-title>with-real-bank-data-part-1-1405f7aecb9e.</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>