Deep Learning Models for Estimating Number of Lambda- Term Reduction Steps Oleksandr Deineha, Volodymyr Donets and Grygoriy Zholtkevych V.N. Karazin Kharkiv National University, 4 Svoboda Sqr, Kharkiv, 61022, Ukraine Abstract This research delves into the utilization of execution strategy-dependent program state information for the enhancement of compilers and interpreters across functional and object-oriented programming languages. Focusing on Lambda Calculus as a versatile representation, we employ Machine Learning models to ascertain reduction step counts for both the Leftmost Outermost and Rightmost Innermost strategies. Leveraging Convolution-based, LSTM-based, and Transformer-based models, we estimate reduction steps within a defined range. Employing a one-hot-encoded simplified term representation streamlines input dimensions, though at the expense of variable information. Our experiments reveal that this simplified representation suffices for estimating reduction counts within a significant range. Remarkably, predictions for the Rightmost Innermost strategy outperform those for the Leftmost Innermost strategy by nearly 25%. In conclusion, employing trained models to determine reduction step counts for specific terms shows great promise in automating the search for optimal reduction strategies tailored to individual terms. Our findings underscore the viability of simplified term representations for predicting reduction steps within a specified strategy, albeit with considerations for depth elimination. Experiments highlight the challenge of handling lengthy sequences and complex terms in models. Keywords 1 Pure Lambda Calculus, Deep Learning, Large Language Models, Convolutional NN, LSTM, Transformer, computational optimization. 1. Introduction Information about program execution state depending on execution strategy helps in choosing reasonable, effective, in terms of computational resource amount, and robust execution strategies that would be implemented in improvements for compilers and interpreters, not only for functional but also for object-oriented programming languages. This work is a part of our series of works, where we focus on researching average optimal reduction procedures for Lambda Calculus terms. We used Lambda Calculus as a simple representation for any programming language, which in fact could implement any of them [1]. In this study, we consider using Machine Learning models to determine the number of reduction steps for the Leftmost Outermost (LO) and Rightmost Innermost (RI) strategies. We chose three Machine Learning models (Convolution-based ANN, LSTM-based ANN, and Transformer-based ANN) for the estimation count reduction steps from 0 to 30. As input, we use one-hot-encoded simplified term representation, which allows us to significantly shrink input dimensions into the model while saving terms tree structure, but we lose information about term variables. Losing this type of information introduces some sort of uncertainty to Lambda terms, but we are going to determine how much impact it has. Results of our experiments show that simplified term representation is enough for estimating the number of reductions with sufficient diapason. Interestingly, predicting count reduction steps 1ProfIT AI 2023: 3rd International Workshop of IT-professionals on Artificial Intelligence (ProfIT AI 2023), November 20–22, 2023, Waterloo, Canada oleksandr.deineha@karazin.ua (O. Deineha); v.donets@karazin.ua (V. Donets); g.zholtkevych@karazin.ua (G. Zholkevych) 0000-0001-8024-8812 (O. Deineha); 0000-0002-5963-9998 (V. Donets); 0000-0002-7515-2143 (G. Zholkevych) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings for the RI strategy was almost 25% more precise than predicting count reduction steps for the LO strategy. In addition, we must admit the computational contradiction: is it more effective to just keep running Lambda term reduction in parallel for different strategies, and choose more preferable in some defined sense or convert a Lambda term to the simplified representation and feed to ANNs, and choose the most preferable strategy. 2. Background and related work In previous works, we showed our Pure Lambda Calculus environment [2, 3], which we used for the Lambda Calculus normalization process research. In [3], we focused on investing in a new approach for estimating the reduction step complexity by estimating the computational resources required for this step using the basic parameters that characterize the term tree and its redex. The investigation of Lambda Calculus term reduction has been the main topic of many studies. Some studies have analyzed the influence of reduction strategies on count term reductions and detailed analysis of some standard strategies [4, 5, 6]. In the research [7], the author considers the computational aspects of reduction strategies. However, none of them try to extract some term features that signal the preference for some strategies over others. In addition, promising results were achieved in the case of solving mathematical problems [8], compile optimization, code execution [9], and code compilation optimization [10] with Large Language Models based on Transformer models. Type Inference in the Simply Typed Lambda Calculus with Transformer model was considered [11], although this article was about predicting a term type it still required a deep term understanding by the model, which is a key point in our work. The authors of [11] considered full-term representation via token sequences by encoding special symbols for Applications, Abstractions, and 32 possible term variables. 3. Problem Statement In modern compilers, interpreters implement specific methods for optimizing the code execution process [10]. The problem with this approach is choosing the right optimization method for an infinite number of program variants. We investigated the automatization of the process by choosing the appropriate optimizer. As a popular method of automatization, we stopped on Artificial Neural Networks (ANN) methods, which are de facto automatic statistical collectors [12]. The program is a sequence of keywords, operators, and variables, so we consider state-of-art methods [13, 14, 15] for processing sequences: Convolutional Neural networks, Long-Short Term Memory (LSTM) networks, and Transformer networks. Another problem with testing exactly on any modern programming language is the large number of possible keywords, and as a result, it is difficult to collect or generate training data. Thus, the obvious solution is to use a programming language with simple syntax, such as Lambda Calculus, where only one operator Application exists; however, Lambda Calculus is Turing sufficient [1], which means that it can imitate any computational process available for other programming languages. In Lambda Terms, a choosing rule of a specific application called redex defines a reduction strategy (or, in other words, execution sequence), and as shown in articles [2, 4, 5], can have an impact on count reduction steps. The main advantages of synthetic training data are that it is cheaper to obtain, covers as many data combinations as possible, and does not require special tuning for exact feature extraction. This allows the training of robust and widely tasked models. However, it might not cover real task combinations and introduce unnecessary complexity into models. Combining these, we aim to solve the following problems: 1. Configure the hyperparameters of Machine Learning models and train them for as low as possible error rates for prediction count reduction steps for a specific strategy. 2. Generate a sufficient amount of Lambda terms for training robust models, which can operate with real-term data with the same error rate. 4. Hypothesis We deal with Lambda terms data, and the main issue with them is an infinite number of variables, which can be presented in Lambda terms. A possible solution is to limit the maximum number of variables, such as 64, 128, and 512, to save variable information. On the other hand, using variable information significantly increases the model input size, and as a result, increases the total weight requirements for ANNs, it also leads to the necessity to carefully configure training data in order of proper model coverage. Considering all this we decided to simplify term representation for models in the following way: mark all Abstraction and Atom variables as ‘x’, so we save a term as a tree without any additional information about inner relationships, losing information about detailed redexes representation. Therefore, we hypothesize that simplified term representation is sufficient for estimating count reduction steps for the LO and RI strategies using modern Machine Learning models for sequence processing. 5. Methodology We assume that a simplified term representation is sufficient to estimate the count reduction steps. Our problem here is to show the relationship between the model's inputs and outputs and real values, and it has three solutions: 1. Solving this as a regression problem, we cannot make the model as precise as possible, and we cannot collect terms with too large count reduction steps to save the normal distribution of expected and output data. 2. Solving this as a classification problem allows us to make the model more precise. However, we should set a strict limit on count reduction steps for selected strategies, and it also imposes restrictions on the nature of the terms entering the training data. 3. Solving this as a bin problem allows us to combine a regression problem and a classification problem and obtain some bins for defined term distributions such as 0–4, 5–10, up to infinity. This solution does not guarantee a precise solution, but it gives us the possibility to analyze any kind of term. Taking into account our findings (as illustrated in Figure 5) that the majority of the generated terms are reduced within 0 to 30 steps, we opted to approach the problem as a classification task. In addition, we must admit that the accuracy metric is not suitable for the evaluation number of reduction steps prediction quality, because low accuracy does not mean that the count step error should be high; therefore, we are also considering the following regression metrics [16]: Mean Absolute Error (MAE), which measures the average absolute difference between the actual and predicted reduction steps, and Root Mean Squared Error (RMSE), which is the root of the squared difference between the actual and predicted reduction steps. Building on the hypothesis that terms requiring more reduction steps would exhibit a greater error value, we analyzed the MAE concerning the actual count of reduction steps. 6. Models for sequence processing To process sequential data, we can consider the following methods: 1. Hidden Markov models (HMMs) are a type of probabilistic model that can be used to model sequential data [17]. HMMs are often used for tasks, such as speech recognition and machine translation. Pros: relatively simple to train and deploy; able to learn long-range dependencies in sequential data. Cons: difficult to design for specific tasks; computationally expensive on large datasets. 2. Support vector machines (SVMs): SVMs are a type of machine learning algorithm that can be used for various tasks, including classification and regression [18]. SVMs with Sequence Kernels are suitable for sequence-processing tasks, such as text classification and sentiment analysis. Pros: effective for the variance of sequence processing tasks, including text classification and efficient training and deployment. Cons: difficult to tune for specific tasks; prone to overfitting; computationally expensive to train. 3. Rule-based systems are a type of expert systems that use a set of rules to make decisions [19]. Rule-based systems can be used for a variety of tasks, including sequence processing tasks such as text filtering and machine translation. Pros: easy to interpret and can be very accurate for tasks where the rules can be determined. Cons: difficult and time-consuming to develop and maintain, and scale to large datasets, and difficult to develop systems with good generalization capabilities. 4. Artificial neural networks (ANNs) are powerful tools for processing sequential data such as text, audio, and video. They show a state-of-the-art level of productivity for many sequence-processing tasks [13, 14, 15]. ANNs can learn to extract patterns from sequential data and use these patterns to make predictions or generate new sequences. Pros: very flexible to variety sequences; can learn long-term dependencies in data; able to generalize unseen data. Cons: computationally expensive to train; difficult to interpret; hard to tune hyperparameters for proper generalizing and prevent overfitting. Because many ANN models show state-of-the-art productivity for many sequence processing tasks (especially tasks related to text processing), we decided to consider the three most advanced models for this purpose: 1. Convolutional neural networks (CNNs) are a type of ANN that is particularly well suited for processing sequential data. CNNs use a series of convolutional layers to extract features from data. Convolutional layers work by sliding a filter over the data and calculating the dot product between the filter and the data. This process is repeated multiple times using different filters to extract different types of features from the data [13]. Pros: It is good at extracting spatial features from data, relatively efficient to train and deploy, and highly parallelizable. Cons: difficult to design and tune for specific tasks; prone to overfitting; can lose parallelizability due to the nature of sequence tasks. 2. Long short-term memory (LSTM) networks are another type of ANN that is well suited for processing sequential data. LSTMs have a special architecture that allows them to learn long-term dependencies in data. LSTMs are often used in such a kind of task as machine translation and speech recognition [14]. Pros: very good at learning long-range dependencies in sequential data; relatively robust to noise in the data; shown state-of-the-art effectiveness on a wide variety of text processing tasks. Cons: computationally expensive to train, especially on long sequences; difficult to tune for specific tasks; and cannot be processed in parallel because of the nature of the LSTM node. 3. Transformers are a newer type of ANN that has recently achieved state-of-the-art results on a variety of sequence-processing tasks, including machine translation and text summarization. Transformers use a self-attention mechanism to learn long-term dependencies in data [15]. Pros: achieve state-of-the-art results on a variety of sequence processing tasks, including those related to programming code and logical tasks; can learn long-range dependencies at once owing to the self-attention mechanism; relatively efficient to train and deploy, especially on large datasets. Cons: difficult to tune for specific tasks; hard restrictions on input sequence length. 7. Selected Deep Learning models We decided to compare completely different ANN models and determine whether our hypothesis could be proven using these models. Let us consider the tuning of the three ANN models for this task, its overall architecture, and its hyperparameter configurations. The base principle for building Deep Learning ANN models involves segmenting the ANN architecture into distinct layers: the input layer, which receives the initial data; the feature extraction layers, which are some specific neuron layers that identify patterns and features within the data; the fully-connected layers, which synthesize the extracted features into a form suitable for making predictions or decisions; and finally, the output layer, which presents the final results of the model's computations [12]. Using this principle, all our models contain the required parts and mostly differ in the feature extraction part, which is based on the selected model architecture. By configuring the feature extraction part, one turns the complexity and volume of the feature vector. Similarly, by configuring the fully connected part, we can set both the output type and problem type to solve the model, also this part is important because of its capability to analyze extracted features. Consider our tuned convolutional-based model shown in Fig 1. There are two options for defining the entire model for processing sequences: the first is to use trunking and paddings to set the input sequence length, as shown in our practice, this option struggles due to a variety of term lengths, but trains much faster because data can be split in batches; the second is to use the entire sequence, but in this case, the feature extraction part always gives different output shapes to solve this problem, we use GlobalAveragePooling1D, which is shown in Fig 1. Our convolutional-based model has four Conv1D layers, each of which has a stride value of two, imitating the pooling layer. Thus, the basic idea of this solution is to extract features from N 1-d vectors representing terms, distributing them to a 128-dim feature vector with increasing feature complexity. Subsequently, we applied a 256-dim dense layer for collecting and transforming as much feature information as possible to classify the Softmax layer for 31 units, which represent count reduction steps from 0 to 30. Figure 1: Architecture of the convolutional-based model used to estimate the count term reduction steps for the selected strategy Fig 2 shows the LSTM-based model used in our experiments. As the convolution-based model LSTM-based models for sequence processing have two variants of implementation, we choose the second with allows different sequence lengths, which also decreases the required weights because only the last LSTM layer output is in the fully connected layer. Experiments show that the LSTM layer with 256 nodes is sufficient for extracting and saving required features. Applying the same principles as the convolutional-based model led to the use of the same fully connected part. Figure 2: Architecture of the LSTM-based model used to estimate the count term reduction steps for the selected strategy In Fig 3 you can find a Transformer-based model based on the Bert layer [15], which is an encoder-only or autoencoding model that can process limited sequences only, based on our sequence length distribution and classical Bert model [15], we set the input sequence length limit to 512 tokens. Consider the transformer architecture in Fig 3 in more detail. It has two input vectors: the input vector of IDs that are our trunked or padded input vector and the attention mask, which is simply a boolean vector, which indicates how the input IDs are filled. Our Bert layer consists of 2 hidden layers with 84 units each, the intermediate layer contains 64 units, and this module contains 6 self-attention heads which are kind of independent feature extractors. The output of the Bert layer is a sparse matrix 512 x 84 to prevent overfitting and minimize required computational resources we apply AveragePooling1D with pooling size and strides 50 which gave us a matrix 10 x 84 or after applying Flatten layer gives 840-dim feature vector, and Dropout with rate 0.1. As the fully connected we leave the Dense layer with 31 units. Figure 3: Architecture of the Transformer-based model used to estimate the count term reduction steps for the selected strategy The final model complexities and trainable weights are listed in Table 1. The LSTM-based model has the largest number of weights. The Convolution-based model has evenly distributed weights between the feature extractor and the fully connected parts, which is explained by the necessity of analyzing the collected features and the specificity of the convolutional feature extractor part, which does not require as many weights as fully connected layers. The smallest fully connected part has a transformer-based model owing to the high feature complexity of the Transformer-based feature extractor. Table 1 Comparison of model weights Convolution-based LSTM-based Transformer-based Feature extractor weights 43504 43504 43504 Fully connected weights 267264 267264 267264 All trainable weights 130832 130832 130832 8. Experiments and data collection 8.1. Data generation For the term generation procedure, we use a recursive random-based algorithm, which allows us to evenly distribute Variables, Applications, and Abstraction across the term body, which allows us to collect as many possible term variants for the defined distribution of variables. The algorithm consists of the following steps: 1. Set probabilities for generation Abstraction, Application, and Atom terms. 2. Set a list of variables used for the generation of terms. 3. Set the maximum limit of vertices in the generated term. 4. Considering probabilities, perform one of the three actions: a. Generate an Atom term based on a random variable from the list of variables. b. Generate an Abstraction term based on a head, which is a random variable from the list of variables, and a body, which is a randomly generated term with options 1 and 2 and a decreased number of vertices. c. Generate an Application term based on the subject and object, which are randomly generated with terms with options 1 and 2, and a decreased number of vertices. 5. Decrease the limit of vertices on one and return to 4 until the limit of the vertices reaches 0. After generating a batch of terms, we filter them out on actual count vertices, similar terms, and reduction filtering; for example, in the LO strategy, we can filter out terms represented by quite long sequences appearing while reducing. 8.2. Datasets and metrics To increase the collection of terms and obtain terms with different counts of steps for the selected strategies, we normalize the collected terms with the LO strategy. For each term, we obtain its simplified form and count steps for the LO and the RI strategies. After filtering out similar terms (in simplified terms) and formulating datasets for each strategy, we cut out terms that have 31 or more steps on the reduction left. To increase the complexity of the test/validation set, we introduced more filters into the term generation procedure, which gave us an independent dataset. Table 2 Generated dataset sizes with simplified terms and its LO and RI reduction steps LO datasets RI datasets Train set Samples 38272 (~91%) 34851 (~91%) Tokens ~4.5M ~3.8M Test / Samples 3691 (~9%) 3431 (~9%) Validation set Tokens ~405k ~340k Total 41963 38282 It should be noted that generating and collecting current count training samples (shown in Table 2) was not random. We focused on the Chinchilla low recommendation for training the transformer model, which states that for training a transformer model, we should have the number of training tokens at least 20x from count trainable weights [21]. So as shown in Table 1 our current Transformer-based model has 156k trainable weights, so we need at least 3M training tokens, and as shown in Table 2 we have a minimum of 3.8M training tokens which should be sufficient for training current models. Figure 4: Term sequences length distribution for (a) LO dataset, (b) RI dataset Consider analyzing the sequence distribution of collected terms, although datasets for LO and RI steps are built on the same terms, they can still have different reduction lengths, which impact on filtered out records in both datasets. The length sequences for LO terms and RI terms are shown in Fig 4, as you can see these plots are approximately the same, but LO terms have longer terms (see Fig 4(a)). In addition, we must admit that choosing 31 classes as model outputs stipulates count reduction step distributions, as shown in Fig 5. For distribution LO steps, as you can see in Fig 5(a) most terms are in boundaries 30 steps, and the same is true for distribution RI steps shown in Fig 5(b). Figure 5: Reduction steps number distribution for (a) LO dataset and (b) RI dataset 9. Model Training We decided to solve this problem as a classification problem. The classical strategy for training a Machine Learning model for this problem is to select the categorical cross entropy as the loss function. We also determine a training procedure by training the model with checkpoints to save the best by the validation accuracy model to a file, so that we can train the model to an overfitting state of still having the most robust solution. It is also necessary to emphasize that we use the Adam optimizer, which allows automatic adjustment of the learning rate and takes into account the gradient momentum [21]; practically, it is the best for training models for sequential processing, especially transformers [11, 15]. Figure 6: Transformer-based model training curves on the LO dataset: (a) loss curve, (b) accuracy curve In total we trained 6 models: two models of each selected type were trained on LO and RI datasets. Typical curves for the training and validation sets representing the learning process are shown in Fig 6(a) for loss value and accuracy in Fig 6(b). The curves shown in Fig 6 represent the training of the Transformer-based model for the prediction of count LO steps. In these plots (both loss and accuracy), we should consider moments when the training and validation curves begin to diverge, which means that the model tends to overfit and cannot learn any more common features for the train and validation datasets; therefore, we should stop model training from here or save current model weights. Therefore, as a training strategy, we choose to set more count training epochs than required and validation accuracy tracking: saving model weights that provide the best validation accuracy. The saved models were used for further testing. 10. Model performance and generalization As mentioned earlier, we evaluated model performance not only using accuracy metric, which is typical for classification problems but also using Mean Absolute Error and Root Mean Squared error. The results of the training Convolution-based, LSTM-based, and Transformer-based models on the LO-specific data are shown in Table 3. The overall performance did not seem high, and despite the high accuracy of the LSTM-based model, it still had the same MAE and RMSE levels, indicating that the model had many overestimations. In addition, we must admit that the high difference between MAE and RMSE values for all models indicates that for some cases, models made big mistakes like a term had 30 reduction steps, but the models predicted 0. In addition, the inherent performance of the Transformer-base model can be explained by the fact that it has limited input sequence lengths and cannot provide a full sequence view for some samples. Table 3: Result of training selected models on the LO dataset Accuracy MAE RMSE train test train test train test Convolution- 33.6% 31.7% 2.86 2.75 4.84 4.7 based LSTM-based 40.0% 37.8% 2.91 2.74 5.28 5.16 Transformer- 29.7% 27.2% 3.1 2.36 5.31 5.55 based The results of training the same models on RI-specific data are shown in Table 4. Comparing the results achieved on identical models on the LO and RI datasets shows that models trained on the RI dataset have better performance, which can be explained by the fact that, on average, LO term data have more tokens than RI term data (shown in Fig 4). The LSTM-based model trained on RI data exhibited the best performance and lowest error rate. In addition, we should emphasize that the models trained on RI had the lowest overall MAE and RMSE values, which indicates a low step prediction error. Table 4: Result of training selected models on the RI dataset Accuracy MAE RMSE train test train test train test Convolution- 48.7% 47% 1.33 1.44 2.59 2.76 based LSTM-based 68.8% 54.3% 0.509 1.29 1.25 2.7 Transformer- 33.4% 28.6% 1.77 2.25 2.98 3.63 based A detailed performance comparison is presented in Fig 7, here are plots with comparison MAE values depending on the actual reduction step count (term complexity, in other words). These plots are achieved in the following way: all data is split into 31 sets by actual reduction count and for each set tested on a corresponding model and using real and predicted values calculated MAE, that gives us one dot on the plot with coordinates: (step, MAE). This approach allows the analysis of model performance on different term complexities. As shown in Fig 7 a simpler term means a lower error, and the maximum error value is higher for LO-data trained models. For all plots except Fig 7(e), the training error is close to the test error, which indicates that our strategy for preventing overfitting works well on these datasets. Figure 7: MAE depending on term complexity for: (a) Convolution-model and LO-data, (b) LSTM-model and LO-data, (c) Transformer-model and LO-data, (d) Convolution-model and RI- data, (e) LSTM-model and RI -data, (f) Transformer-model and RI -data 11. Conclusions and Discussion Using trained models to define the number of reduction steps for a given term is a very promising approach for automating the search for the optimal reduction strategy for each term independently. In combination with our approach for estimating the computational resources required for one reduction step [3], it may decrease the time for the entire term normalization process. With these results, we can expand the scope of programming language optimization. Our findings show that a simplified term representation is sufficient for predicting count reduction steps for a given strategy, but for eliminated depth. The experiments demonstrate that models struggle to process long sequences, and terms that are more complex in terms of the number of reduction steps tend to yield less reliable predictions. Furthermore, it is crucial to consider the removal of information regarding term variables, as this has a substantial effect on the behavior of term reduction. 12. Future work The high error rate for models trained on the LO dataset with longer token sequences and error growth due to increasing term complexity indicate that using these models for determining the optimal choice between the Leftmost Outermost and the Rightmost Innermost strategies is not sufficiently robust. Therefore, in future research, we can consider term representation with variables or term combinatorial representation [22], which eliminates the use of variables in some cases. However, training independent models for each reduction strategy and choosing an optimal strategy from a given strategy batch is not an effective solution. We can consider training a single model that provides the answer to which strategy is the most efficient. Furthermore, it would be beneficial to create generative models that identify the optimal redex for each term reduction step, which can help develop a balanced strategy in terms of reducing complexity and count reduction steps. References [1] Alan M. Turing. Computability and λ-Definability. The Journal of Symbolic Logic. 2 (December 1937): 153–163. DOI:10.2307/2268280. [2] Oleksandr Deineha, Volodymyr Donets, and Grygoriy Zholtkevych. On Randomization of Reduction Strategies for Typeless Lambda Calculus. Proceedings of conference ICTERY 2023, in press. URL: https://icteri.org/icteri-2023/proceedings/preview/01000021.pdf. [3] Oleksandr Deineha, Volodymyr Donets, and Grygoriy Zholtkevych. Estimating Lambda-Term Reduction Complexity with Regression Methods. CEUR Workshop Proceedings 2023, in press. [4] Clemens Grabmayer. Linear Depth Increase of Lambda Terms along Leftmost-Outermost Beta- Reduction. November 2019. URL: https://doi.org/10.48550/arXiv.1604.07030. [5] Kazuyuki Asada, Naoki Kobayashi, Ryoma Sin'ya, and Takeshi Tsukada. Almost Every Simply Typed Lambda-Term Has a Long Beta-Reduction Sequence. Logical Methods in Computer Science, February 2019, Volume 15. DOI:10.23638/LMCS-15(1:16)2019. [6] Ugo Dal Lago, Gabriele Vanoni. On randomised strategies in the λ-calculus. Theoretical Computer Science, Volume 813, Pages 100-116 (2020). DOI:10.1016/j.tcs.2019.09.033. [7] Xiaochu Qi. Reduction Strategies in Lambda Term Normalization and their Effects on Heap Usage. 2004. URL: https://arxiv.org/abs/cs/0405075. [8] Zhen Yang, Ming Ding, Qingsong Lv, Zhihuan Jiang, Zehai He, Yuyi Guo, Jinfeng Bai, Jie Tang. GPT Can Solve Mathematical Problems Without a Calculator. Machine Learning (September 2023). URL: https://arxiv.org/abs/2309.03241. [9] Chenxiao Liu1, Shuai Lu, Weizhu Chen, Daxin Jiang, Alexey Svyatkovskiy, Shengyu Fu, Neel Sundaresan, Nan Duan. Code Execution with Pre-trained Language Models. Accepted to the Findings of ACL 2023. URL: https://arxiv.org/abs/2305.05383. [10] Chris Cummins, Volker Seeker, Dejan Grubisic, Mostafa Elhoushi, Youwei Liang, Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Kim Hazelwood, Gabriel Synnaeve, Hugh Leather. Large Language Models for Compiler Optimization. (September 2023). URL: https://arxiv.org/abs/2309.07062. [11] Brando Miranda, Avi Shinnar, Vasily Pestun, Barry Trager. Transformer Models for Type Inference in the Simply Typed Lambda Calculus: A Case Study in Deep Learning for Code. Computer Science (March 2023). URL: https://arxiv.org/abs/2304.10500. [12] Yann LeCun, Yoshua Bengio, and Geoffrey E. Hinton. Deep Learning. Nature 521: 436-444 (2015). DOI:10.1038/nature14539. [13] Liang Yao, Chengsheng Mao and Yuan Luo. Graph Convolutional Networks for Text Classification. AAAI Conference on Artificial Intelligence (September 2018). DOI:10.1609/aaai.v33i01.33017370. [14] Beakcheol Jang, Myeonghwi Kim, Gaspard Harerimana, SangUk Kang, and Jong Wook Kim. Bi- LSTM Model to Increase Accuracy in Text Classification: Combining Word2vec CNN and Attention Mechanism. Applied Sciences (2020). DOI:10.3390/app10175841. [15] Jimmy J. Lin, Rodrigo Nogueira, and Andrew Yates. Pretrained Transformers for Text Ranking: BERT and Beyond. Proceedings of the 14th ACM International Conference on Web Search and Data Mining (October 2020). DOI:10.1145/3437963.3441667. [16] M. M. Krell, Bilal Wehbe. A First Step Towards Distribution Invariant Regression Metrics. 2020. URL: https://arxiv.org/abs/2009.05176. [17] Yoshua Bengio, Paolo Frasconi. Input-output HMMs for sequence processing. IEEE transactions on neural networks 7 5: 1231-49 (1996). DOI:10.1109/72.536317. [18] William M. Campbell, Elliot Singer, Pedro A. Torres-Carrasquillo, and Douglas A. Reynolds. Language recognition with support vector machines. The Speaker and Language Recognition Workshop (2004). [19] Sin-Wai Chan. Routledge Encyclopedia of Translation Technology. Routledge. p. 454 (November 2014). ISBN 978-1-317-60815-8. [20] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals and L. Sifre. Training Compute-Optimal Large Language Models. (2022). URL: https://arxiv.org/abs/2203.15556. [21] Sebastian Ruder. An overview of gradient descent optimization algorithms. Vestnik kompiuternykh i informatsionnykh tekhnologii (2016). URL: https://arxiv.org/abs/1609.04747. [22] Hendrik Pieter Barendregt. The Lambda Calculus, Its Syntax and Semantics. Studies in Logic and the Foundations of Mathematics. Vol. 103 (1984). ISBN 0-444-87508-5.