<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Large-Scale Transformer models for Transactional Data</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Fabrizio</forename><surname>Garuti</surname></persName>
							<email>fabrizio.garuti@prometeia.com</email>
							<affiliation key="aff0">
								<orgName type="institution">Prometeia Associazione</orgName>
								<address>
									<settlement>Bologna</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution" key="instit1">AImageLab</orgName>
								<orgName type="institution" key="instit2">UNIMORE</orgName>
								<address>
									<settlement>Modena</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Simone</forename><surname>Luetto</surname></persName>
							<email>simone.luetto@prometeia.com</email>
							<affiliation key="aff2">
								<orgName type="institution">Prometeia SpA</orgName>
								<address>
									<settlement>Bologna</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Enver</forename><surname>Sangineto</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution" key="instit1">AImageLab</orgName>
								<orgName type="institution" key="instit2">UNIMORE</orgName>
								<address>
									<settlement>Modena</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Rita</forename><surname>Cucchiara</surname></persName>
							<affiliation key="aff1">
								<orgName type="institution" key="instit1">AImageLab</orgName>
								<orgName type="institution" key="instit2">UNIMORE</orgName>
								<address>
									<settlement>Modena</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
							<affiliation key="aff3">
								<orgName type="institution">IIT-CNR</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Large-Scale Transformer models for Transactional Data</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">D094DC0844F3064E3F8AB3236FBA7BE7</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T16:55+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Deep Learning</term>
					<term>Large Scale Model</term>
					<term>Representation Learning</term>
					<term>Time series prediction</term>
					<term>Transactional data</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Following the spread of digital channels for everyday activities and electronic payments, huge collections of online transactions are available from financial institutions. These transactions are usually organized as time series, i.e., a time-dependent sequence of tabular data, where each element of the series is a collection of heterogeneous fields (e.g., dates, amounts, categories, etc.). Transactions are usually evaluated by automated or semi-automated procedures to address financial tasks and gain insights into customers' behavior. In the last years, many Trees-based Machine Learning methods (e.g., RandomForest, XGBoost) have been proposed for financial tasks, but they do not fully exploit in an end-to-end pipeline all the information richness of individual transactions, neither they fully model the underling temporal patterns. Instead, Deep Learning approaches have proven to be very effective in modeling complex data by representing them in a semantic latent space. In this paper, inspired by the multi-modal Deep Learning approaches used in Computer Vision and NLP, we propose UniTTab, an end-to-end Deep Learning Transformer model for transactional time series which can uniformly represent heterogeneous time-dependent data in a single embedding. Given the availability of large sets of tabular transactions, UniTTab defines a pre-training self-supervised phase to learn useful representations which can be employed to solve financial tasks such as churn prediction and loan default prediction. A strength of UniTTab is its flexibility since it can be adopted to represent time series of arbitrary length and composed of different data types in the fields. The flexibility of our model in solving different types of tasks (e.g., detection, classification, regression) and the possibility of varying the length of the input time series, from a few to hundreds of transactions, makes UniTTab a general-purpose Transformer architecture for bank transactions.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Transactional data are time-dependent collections of financial transactions. For instance, a bank account can be seen as a time series of transactions, each composed of a tabular data entry with fields specifying the transaction amount, the transaction operation and the receiver type (see Figure <ref type="figure" target="#fig_0">1</ref>). These data can be used as training data for different Machine Learning approaches, in a variety of tasks. Some examples are:</p><p>• Customer Value Management, to support marketing or commercial actions, for instance via the creation of tailored offers; • Credit and Liquidity Risk, to assess the initial risk, and to detect early risk signals, for instance, represented by changes in expense patterns or in the regularity of incomes; • Fraud detection and Anti Money Laundering, to identify potential malicious behaviors.</p><p>For these purposes, transactional data have been so far processed with symbolic AI (e.g., rule-based expert systems) and/or with ensembles of trees-based machine learning models. RandomForest <ref type="bibr" target="#b0">[1]</ref>, LightGBM <ref type="bibr" target="#b1">[2]</ref>, XG-Boost <ref type="bibr" target="#b2">[3]</ref> and CatBoost <ref type="bibr" target="#b3">[4]</ref> are the most frequently used. Despite the success of Deep Learning methods in other application areas (e.g., Natural Language Processing and Computer Vision), trees-based models seemed to outperform deep learning models on most of the tabular datasets <ref type="bibr" target="#b4">[5]</ref>. These datasets are typically composed of tens to hundreds of features and thousands to hundreds of thousands of samples. However, the size of transactional datasets is growing rapidly, now exceeding millions of transactions in some cases. Since the performance of deeplearning models improves with dataset size, tree-based models are the best choice only for small and mediumsize datasets <ref type="bibr" target="#b5">[6]</ref>. In addition, the use of tree-based models for transactional data is limited to constructing simple aggregated features, such as calculating the average spending over recent months or determining the total income for the past year. This approach has clear limitations in fully harnessing the precise and timely information that transactional data encapsulates.</p><p>Recently, various deep learning networks have been developed for heterogeneous data, mostly for tabular datasets. Lyu et al. <ref type="bibr" target="#b6">[7]</ref> combine different modules and can represent both numerical and categorical features. Borisov et al. <ref type="bibr" target="#b7">[8]</ref> use a distillation approach to map decision trees, trained on heterogeneous tabular data, onto homogeneous vectors. Huang et al. <ref type="bibr" target="#b8">[9]</ref> represent cate- gorical features using an attribute-specific embedding which is used as a prefix, concatenated with the actual field value. Schäfl et al. <ref type="bibr" target="#b9">[10]</ref> use a non-parametric representation of the training data, which reminds the use of external networks in Transformers <ref type="bibr" target="#b10">[11]</ref>. However, these approaches do not model the temporal dynamics: each row in the table is an individual sample. A trivial solution could be to concatenate multiple rows into single samples, but none of the previous work has demonstrated that their architecture can model more than hundreds of fields as an individual sample.</p><p>To overcome these problems, we propose a custom Deep Learning architecture based on modern Transformers <ref type="bibr" target="#b11">[12]</ref>, which can uniformly represent heterogeneous time-dependent data, and which is trained on a largescale transactional dataset. We call our model UniT-Tab (Unified Transformer for Time-Dependent Heterogeneous Tabular Data), and we show that it consistently outperforms state-of-the-art approaches based on both Deep Learning and standard Machine Learning techniques.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related works</head><p>The heterogeneous nature of transactional data and the lack of large public annotated datasets, due to privacy and commercial reasons, make these data extremely difficult to be handled by deep neural networks. However, in recent years some works have started to address these challenges. For instance, Padhi et al. <ref type="bibr" target="#b12">[13]</ref> proposed one of the first deep learning architectures for heterogeneous time series (TabBERT). As a solution to data heterogeneity, the authors quantize continuous attributes so that each field is defined on its finite vocabulary.</p><p>Another recent work is TabAConvBERT proposed by Shankaranarayana &amp; Runje <ref type="bibr" target="#b13">[14]</ref>. They present an architecture that can deal with both categorical inputs (by using an embedding neural network) and numerical inputs (by using a shallow neural network).</p><p>The architecture presented by X. Huang et. al. <ref type="bibr" target="#b8">[9]</ref> provides a solution to data heterogeneity, but it cannot handle the temporal component of the data and therefore is unable to solve tasks involving transaction sequences.</p><p>A different line of work involves directly or indirectly using natural language-based "interfaces" between the tabular data and a Transformer. For instance, in LUNA <ref type="bibr" target="#b14">[15]</ref>, numerical values are represented as an atomic natural language string. Our proposal differs from the aforementioned works in different aspects. On the one hand, we deal with all the variability dimensions of the problem: numerical, categorical and temporal. On the other hand, we train our model using arbitrary length and long-range time series, which can include up to 150 transactions per sample. As a result, it is possible to deal with transactional tasks that require learning the long-term dependencies of the data. Furthermore, the learning phase is enriched with new custom masking techniques, which allow all related fields to be masked simultaneously, making the initial generalpurpose training more challenging for our model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Method</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Pre-training and Fine-tuning</head><p>The current great availability of data together with the advancement of AI research have opened the possibility of developing larger models with more general purposes. This is already a reality in almost every application regarding unstructured data, like text, images and video. Many general-purpose models have gained fame in recent years: GPT-3 <ref type="bibr" target="#b15">[16]</ref>, BERT <ref type="bibr" target="#b16">[17]</ref>, CLIP <ref type="bibr" target="#b17">[18]</ref>, DALL-E <ref type="bibr" target="#b18">[19]</ref>. All these models have been pre-trained using large datasets jointly with a self-supervised approach.</p><p>The goal of the pre-training phase is to learn a good representation of the input data. As a result, models trained within this scheme show great generalization capabilities even without further training, like the generation of text of GPT models. These capabilities can further improve after a second training phase, called fine-tuning over a specific task, for example, ChatGPT's stunning ability to conversate. Taking inspiration from these models, we use a large dataset of transactions to pre-train a Transformer network using self-supervised learning, and then we use a (smaller) labeled dataset to fine-tune the network for a specific task.</p><p>During the pre-training stage, we train our UniTTab model using the Masked Token pretext task <ref type="bibr" target="#b16">[17]</ref>. Some of the input features are masked to the model, and the model is trained to predict the masked features as a function of the "visible" ones. This method makes it possible to train a model to automatically learn the semantics and to extract relevant information contained in the sequence of transactions. Specifically, for an input sequence, we randomly replace a field value with the special symbol <ref type="bibr">[MASK]</ref>. We use a standard replacement probability value of 0.15 <ref type="bibr" target="#b16">[17,</ref><ref type="bibr" target="#b12">13]</ref>. Moreover, with probability 0.1, we also mask all the fields in a transaction, while, for the fields representing the time stamp, they are always either jointly masked or jointly unmasked. These additional masking strategies, inspired by the block masking of adjacent image patches used in BEiT <ref type="bibr" target="#b19">[20]</ref>, make the pretext task more challenging for the network.</p><p>It is important to remark that, within the pre-training phase, the training of the model solely relies on input features of transactions, eliminating the need for labels. This way we can use the entire transactional dataset, even if it is not fully labeled for the specific downstream task. This is the case of the Czech dataset <ref type="bibr" target="#b20">[21]</ref> used for loan default prediction, described in Section 4.1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">The architecture</head><p>We develop a custom model called UniTTab, designed to be suited for sequences of heterogeneous transactional data. Borrowing the techniques used in text analysis in BERT or GPT models, we use input time series with variable length. We vary the sequence length from 10 to 150 transactions, where each transaction is composed of a fixed number of 10 or 6 fields. As a result, each time series can vary in length from 100 to 900 items, a challenging length to manage even for text sentences.</p><p>Given the data structure, we propose the hierarchical architecture shown in Figure <ref type="figure" target="#fig_1">2</ref>. The architecture is composed of two different Transformers, and it is trained endto-end. The first Transformer ("Field Transformer") takes as input the 𝑘 features describing a single transaction, like transaction amount, merchant information, transaction date and time. The features are transformed in 𝑘 final embeddings, which are concatenated in a single embedding vector. This embedding is the representation of a single transaction. Then a sequence of these embeddings, each representing a transaction, is fed to the second Transformer ("Sequence Transformer"). The Sequence Transformer models the statistic dependencies between different transactions in the sequence and outputs embedding elements in the latent space. This is the general-purpose latent space where the representation can be potentially exploited for many tasks, such as classification (e.g., to classify the client behavior), detection (e.g., to detect anomalies, frauds), and prediction (e.g., to predict product churn in next few months).</p><p>As depicted in Figure <ref type="figure" target="#fig_1">2</ref>, during pre-training, for each masked field we use the corresponding output embedding to predict the field value. Instead, during fine-tuning, we add a class token [CLASS] at the beginning of the transaction embedding sequence, and we use the corresponding output embedding to solve the financial tasks. This embedding can attend to all transaction embeddings in the sequence, allowing it to exploit all field information of all transactions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Feature representation</head><p>Our model can effectively represent heterogeneous data, encoding both numerical fields (e.g., the amount), categorical fields (e.g., the type of transaction), and fields with a specific structure (e.g., the date).</p><p>The most common approach to tackle this challenge is to reduce all the features to a common representation: usually numerical for ensemble of trees or categorical for deep learning architectures like TabBERT <ref type="bibr" target="#b12">[13]</ref>. However, discretizing numerical features into a finite set of values results in a loss of information. For example, it could be important to know if an amount is precisely 20 euros or 20.50 to distinguish between a withdrawal and a grocery expense. For this reason we develop a custom representation to transform numerical values in the input vector. In particular, we represent each numerical value as a feature vector obtained by the concatenation of a battery of different frequency functions (depicted with a sine wave symbol in Figure <ref type="figure" target="#fig_1">2</ref>). Similar representations are used in NeRFs <ref type="bibr" target="#b21">[22]</ref> for 3D synthesis. Conversely, we adopt a traditional "category encoding" to represent the categorical features, by using simple embedding neural networks (as used in <ref type="bibr" target="#b12">[13]</ref>). Finally, we use a custom "time encoding" method for the timestamp attributes. The value of a timestamp is split using a combination of different field values: the year, the month, the day and, if necessary, the hour. Then each such value is represented as a categorical feature (e.g., with 12 elements for the month).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experimental results</head><p>The effectiveness of the model has been tested over two large size datasets of transactions: the PKDD'99 Financial Dataset <ref type="bibr" target="#b20">[21]</ref> and our Real Bank Account Transaction Dataset (in short, RBAT Dataset). The first dataset is public and is used as a benchmark for predicting loan default. Instead, the second dataset is private and is used to assess how well our model predicts customer churn in comparison to standard industry models. The chosen experiments are binary classification tasks with a large level of unbalance in the statistics of the two target classes.</p><p>In all the experiments, the model is first pre-trained using self-supervision (Section 3.1) and then fine-tuned on the classification task, in a standard supervised way, using the labeled data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Loan default prediction</head><p>The loan default prediction is a classification task defined on the PKDD'99 Financial Dataset, which is a public dataset of real transactions from a Czech bank <ref type="bibr" target="#b20">[21]</ref>. This dataset is composed of 1M of transactions from 4500 clients. It also includes customer information, but we use only the transactions, each composed of 6 fields (timestamp, amount, type and channel of the transaction).</p><p>The dataset presents a large fraction of unlabeled data, in fact most of the accounts don't have any loans, and they cannot be used for the classification task. This is the perfect example of the potentiality of our model: we perform the pre-training on all the accounts present in the dataset (4500) and then we fine-tune the model only on the labeled ones (478 for training and 204 in test). With such a small number of samples UniTTab has been able to obtain good results, way higher than ensemble of trees, and the possibility to exploit all the data in pre-training is its main advantage.</p><p>To evaluate our model's ability to deal with longer sequences and variable lengths, we test different sequence lengths of transactions. We define a maximum length value 𝑡𝑚𝑎𝑥 (ranging from 50 to 150), and for each account we include the entire sequence of transactions if it is shorter than the maximum. Instead, if the sequence exceeds the maximum length, we only consider the most recent 𝑡𝑚𝑎𝑥 transactions. It's important to note that, during pre-training the average sequence length is 232 transactions, whereas during fine-tuning the average sequence length is 80 transactions. This happens because, for fine-tuning, we only take transactions made before the loan begins. For this reason, if we set 𝑡𝑚𝑎𝑥 to 150, during fine-tuning almost all transaction sequences are of variable length. It's also interesting to observe that, increasing the length of the sequence, the result of the model improves. This is likely due to the information increase in the input sequence, but it demonstrates that the model is able to deal with long sequences of transactions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Effect of Pre-Training</head><p>One of the main advantages of using Deep Learning methods over traditional Machine Learning approaches is the possibility to pre-train a large network using a large unsupervised dataset, and then fine-tune the same network on the (usually scarcer) available annotated data of a downstream task. In order to quantify the contribution of the pre-training phase, and to show that this is useful also when the unlabeled dataset is not huge, we use the PKDD'99 Financial Dataset, and we pre-train the models with different portions of the pre-training dataset. Specifically, in Figure <ref type="figure" target="#fig_2">3</ref> we indicate the fraction of the pre-training dataset used for each experiment, where zero corresponds to training the models from scratch directly on the (labeled) downstream task data. The results in the figure show that both Deep Learning methods (i.e., TabBERT and UniTTab) significantly benefit from the pre-training phase, even using only a small portion of the unlabeled data (e.g., 0.25). Furthermore, when pretraining is performed, our UniTTab gets a significantly higher F1 score than traditional tree-based models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Churn prediction: comparison with industry standards</head><p>We also compare our model with a custom transactional tree-based pipeline on a churn rate prediction task. The task is defined on the private RBAT dataset, which is provided by an international bank and is composed of several hundred million real transactions of bank customers. The churn prediction task is defined as whether or not a customer churns in the next 3 months, given a 6-month history sequence of transactions performed by that customer. The history sequences provided to the models are of variable length, with an average of 192 transactions and up to a maximum of 500 transactions. Each sequence is associated with a binary target that represents the presence of the churn event for that customer in  the following 3 months. Initially, our model has been pretrained on a random sample of 1M untargeted accounts, corresponding to approximately 300 million transactions. Then, we evaluate the performance of our model and industry standards using fine-tuning datasets of different sizes, ranging from 50K transaction sequences up to 1 million sequences. Figure <ref type="figure">4</ref> shows that our UniTTab model significantly outperforms industry standards for every training dataset size. It also demonstrates the scalability of our model through an increased number of fine-tuning samples: increasing the number of training accounts yields considerably improved AUC on the churn prediction task.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>The UniTTab project presented in this paper is a step towards the creation of general-purpose architectures for bank transactions. The empirical results show that our model drastically outperforms both deep learning and standard machine learning based predictive models on different benchmarks. We believe that our work and our results can stimulate this research field and the adoption of self-supervised deep learning in banking data.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Some samples of transactions from the Transaction dataset used for fraud detection.</figDesc><graphic coords="2,162.21,84.19,270.85,74.19" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: A schematic illustration of the UniTTab architecture for financial data.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Loan default prediction task: impact of different portions of the pre-training dataset.</figDesc><graphic coords="5,183.05,282.12,229.17,133.69" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Loan default prediction task: average and standard deviation results obtained with 5 random seeds.</figDesc><table><row><cell>𝑡𝑚𝑎𝑥</cell><cell>Model</cell><cell>F1 score</cell><cell>Average Precision</cell><cell>ROC AUC</cell><cell>Accuracy</cell></row><row><cell></cell><cell>TabBERT [13]</cell><cell>0.611(±0.032)</cell><cell>0.594(±0.031)</cell><cell>0.827(±0.048)</cell><cell>90.7(±1.6)</cell></row><row><cell>50</cell><cell>LUNA [15]</cell><cell>0.604(±0.048)</cell><cell>0.613(±0.048)</cell><cell>0.869(±0.030)</cell><cell>92.5(±1.7)</cell></row><row><cell></cell><cell>UniTTab (ours)</cell><cell>0.619(±0.011)</cell><cell>0.574(±0.017)</cell><cell>0.882(±0.021)</cell><cell>90.2(±1.5)</cell></row><row><cell></cell><cell>TabBERT [13]</cell><cell>0.636(±0.024)</cell><cell>0.625(±0.036)</cell><cell>0.874(±0.019)</cell><cell>91.6(±0.9)</cell></row><row><cell>100</cell><cell>LUNA [15]</cell><cell>0.624(±0.075)</cell><cell>0.601(±0.018)</cell><cell>0.846(±0.025)</cell><cell>92.5(±1.7)</cell></row><row><cell></cell><cell>UniTTab (ours)</cell><cell>0.654(±0.032)</cell><cell>0.653(±0.033)</cell><cell>0.903(±0.006)</cell><cell>91.4(±1.2)</cell></row><row><cell></cell><cell>TabBERT [13]</cell><cell>0.620(±0.024)</cell><cell>0.603(±0.016)</cell><cell>0.857(±0.026)</cell><cell>91.6(±1.1)</cell></row><row><cell>150</cell><cell>LUNA [15]</cell><cell>0.637(±0.043)</cell><cell>0.589(±0.017)</cell><cell>0.851(±0.030)</cell><cell>92.6(±1.2)</cell></row><row><cell></cell><cell>UniTTab (ours)</cell><cell cols="2">0.673(±0.038) 0.690(±0.030)</cell><cell cols="2">0.912(±0.018) 92.3(±1.1)</cell></row><row><cell></cell><cell>Random Forest [23]</cell><cell>0.2667</cell><cell>-</cell><cell>0.6957</cell><cell>89.27</cell></row><row><cell>-</cell><cell>XGBoost</cell><cell>0.608(±0.079)</cell><cell cols="2">0.700(±0.040) 0.894(±0.019)</cell><cell>92.8(±1.8)</cell></row><row><cell></cell><cell>CatBoost</cell><cell>0.527(±0.065)</cell><cell>0.617(±0.079)</cell><cell>0.866(±0.043)</cell><cell>92.0(±1.1)</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Random forests</title>
		<author>
			<persName><forename type="first">L</forename><surname>Breiman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Machine learning</title>
		<imprint>
			<biblScope unit="volume">45</biblScope>
			<biblScope unit="page" from="5" to="32" />
			<date type="published" when="2001">2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Lightgbm: A highly efficient gradient boosting decision tree</title>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">M</forename><surname>Ke</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Advances in neural information processing systems</title>
				<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Xgboost: A scalable tree boosting system</title>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">.</forename><surname>Chen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining</title>
				<meeting>the 22nd acm sigkdd international conference on knowledge discovery and data mining</meeting>
		<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="785" to="794" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<author>
			<persName><forename type="first">L</forename><forename type="middle">G</forename><surname>Prokhorenkova</surname></persName>
		</author>
		<title level="m">Advances in neural Figure 4: Customer churn rate prediction task: comparison with industry standard on different portions of the fine-tuning dataset</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note>information processing systems</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Deep neural networks and tabular data: A survey</title>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">E</forename><surname>Borisov</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE transactions on neural networks and learning systems</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Why do tree-based models still outperform deep learning on tabular data?</title>
		<author>
			<persName><forename type="first">L</forename><surname>Grinsztajn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Oyallon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Varoquaux</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2207.08815</idno>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">OptEmbed: learning optimal embedding table for click-through rate prediction</title>
		<author>
			<persName><forename type="first">F</forename><surname>Lyu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Liu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 31st ACM International Conference on Information &amp; Knowledge Management</title>
				<editor>
			<persName><forename type="first">M</forename><forename type="middle">A</forename><surname>Hasan</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Xiong</surname></persName>
		</editor>
		<meeting>the 31st ACM International Conference on Information &amp; Knowledge Management</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">DeepTLF: robust deep neural networks for heterogeneous tabular data</title>
		<author>
			<persName><forename type="first">V</forename><surname>Borisov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Broelemann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Kasneci</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Kasneci</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Int. J. Data Sci. Anal</title>
		<imprint>
			<biblScope unit="volume">16</biblScope>
			<biblScope unit="page" from="85" to="100" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<author>
			<persName><forename type="first">X</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Khetan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cvitkovic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><forename type="middle">S</forename><surname>Karnin</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2012.06678</idno>
		<title level="m">TabTransformer: Tabular Data Modeling Using Contextual Embeddings</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<author>
			<persName><forename type="first">X</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Khetan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Cvitkovic</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><forename type="middle">S</forename><surname>Karnin</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2012.06678</idno>
		<title level="m">Tabtransformer: Tabular data modeling using contextual embeddings</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">N</forename><surname>Rabe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Hutchins</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Szegedy</surname></persName>
		</author>
		<title level="m">Memorizing transformers</title>
				<imprint>
			<publisher>ICLR</publisher>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Vaswani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Shazeer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Parmar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Uszkoreit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Jones</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">N</forename><surname>Gomez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Kaiser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Polosukhin</surname></persName>
		</author>
		<title level="m">Attention is All you Need</title>
				<imprint>
			<publisher>NeurIPS</publisher>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Tabular Transformers for Modeling Multivariate Time Series</title>
		<author>
			<persName><forename type="first">I</forename><surname>Padhi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Schiff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Melnyk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Rigotti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Mroueh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">L</forename><surname>Dognin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ross</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Nair</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Altman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE International Conference on Acoustics, Speech and Signal Processing</title>
				<imprint>
			<publisher>ICASSP</publisher>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Attention Augmented Convolutional Transformer for Tabular Time-series</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">M</forename><surname>Shankaranarayana</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Runje</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">2021 International Conference on Data Mining, ICDM 2021 -Workshops</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Shao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Han</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Zhang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2212.02691</idno>
		<title level="m">LUNA: Language Understanding with Number Augmentations on Transformers via Number Plugins and Pre-training</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">B</forename><surname>Brown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ryder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Subbiah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kaplan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dhariwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Neelakantan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Shyam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sastry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Askell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Herbert-Voss</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Krueger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Henighan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Child</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ramesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">M</forename><surname>Ziegler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Winter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hesse</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Sigler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Litwin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gray</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Chess</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Berner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mccandlish</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Amodei</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2005.14165</idno>
		<title level="m">Language Models are Few-Shot Learners</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">BERT: Pre-training of deep bidirectional transformers for language understanding</title>
		<author>
			<persName><forename type="first">J</forename><surname>Devlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-W</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Toutanova</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">NAACL</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hallacy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ramesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Goh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sastry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Askell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Mishkin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Krueger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2103.00020</idno>
		<title level="m">Learning Transferable Visual Models From Natural Language Supervision</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<title level="m" type="main">Zero-shot textto-image generation</title>
		<author>
			<persName><forename type="first">A</forename><surname>Ramesh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Pavlov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Goh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gray</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Voss</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Radford</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Sutskever</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2021">2021</date>
			<publisher>ICML</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Bao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Dong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wei</surname></persName>
		</author>
		<title level="m">BEiT: BERT pre-training of image transformers</title>
				<imprint>
			<publisher>ICLR</publisher>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<monogr>
		<author>
			<persName><forename type="first">P</forename><surname>Berka</surname></persName>
		</author>
		<ptr target="https://sorry.vse.cz/~berka/challenge/pkdd1999/berka.htm" />
		<title level="m">Workshop notes on Discovery Challenge PKDD&apos;99</title>
				<imprint>
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">NeRF: representing scenes as neural radiance fields for view synthesis</title>
		<author>
			<persName><forename type="first">B</forename><surname>Mildenhall</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><forename type="middle">P</forename><surname>Srinivasan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Tancik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">T</forename><surname>Barron</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Ramamoorthi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Ng</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ECCV</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<author>
			<persName><forename type="first">Z</forename><surname>Xu</surname></persName>
		</author>
		<ptr target="https://towardsdatascience.com/loan-default-prediction-an-end-to-end-ml-project-\with-real-bank-data-part-1-1405f7aecb9e" />
		<title level="m">Loan default prediction with Berka dataset</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
