Performance evaluation and analysis with code benchmarking and generative AI Andrii Berko1,†, Vladyslav Alieksieiev1,† and Artur Dovbysh1,∗,† 1 Lviv Polytechnic National University, 12, S.Bandery str., Lviv, 79014, Ukraine Abstract This paper explores code benchmarking techniques and their integration with advanced generative Artificial Intelligence (AI) models, emphasizing the need for continuous performance optimization in data-driven industries. Benchmarking is essential for evaluating and comparing hardware and software performance, identifying bottlenecks, and developing improvement strategies. The study reviews benchmarking theory and practice, detailing key parameters, steps, challenges, and solutions for researchers and practitioners. It examines benchmarks for various sorting algorithms, highlighting the implications for algorithm selection and implementation. Innovatively, the study uses AI models like GPT-3.5-turbo and Gemini 1.5 Pro to analyze algorithmic benchmarks, categorizing efficiency and redefining performance evaluation. The effectiveness of this approach is assessed using the F-score metric, providing insights into AI model performance. The research demonstrates the potential of integrating benchmarking techniques with generative AI, marking a significant advancement in automated code analysis and offering valuable implications for software development and AI applications. Keywords Benchmarking, performance analysis, generative AI models, Machine Learning, testing, statistics 1 1. Brief overview of performance benchmarking terms Performance benchmarking is a critical exercise that provides key insights into the operating efficiency and overall competitiveness of a business within an industry. It is the practice of comparing the performance processes, operations, or strategies against those of relevant and comparable items, often referred to as a "leaders", to identify strengths, weaknesses, and opportunities for improvement. This article section aims to offer a brief overview of the most used terms in the realm of performance benchmarking. For those who are new to the concept, understanding these terms can help in deciphering the complex language of benchmarking, making it easier to implement within a business scenario. The goal here is to broaden your knowledge and MoMLeT-2024: 6th International Workshop on Modern Machine Learning Technologies, May, 31 - June, 1, 2024, Lviv-Shatsk, Ukraine ∗ Corresponding author. † These authors contributed equally. artur.v.dovbysh@lpnu.ua (A. Dovbysh); vladyslav.i.alieksieiev@lpnu.ua (V. Alieksieiev); andrii.y.berko@lpnu.ua (A. Berko) 0009-0004-1912-4887 (A. Dovbysh); 0000-0003-0712-0120 (V. Alieksieiev); 0000-0001-6756-5661 (A. Berko) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings equip you with the necessary vocabulary that would facilitate an effective and efficient benchmarking process. 1.1. Basic terms and benchmark-related statistical metrics Starting from the very basics, in computer science, a benchmark is the act of running a computer program, a set of programs, or other operations, in order to assess the relative performance of an object, normally by running a number of standard tests and trials against it. Following, computer performance is the amount of useful work accomplished by a computer system. Outside of specific contexts, computer performance is estimated in terms of accuracy, efficiency and speed of executing computer program instructions. Now let's recall primary statistical metrics used in benchmarking: mean, standard deviation and median [1]. To explain why we focus that much on these specific metrics in benchmarking context - consider the following arguments. Firstly, the mean offers a straightforward measure of central tendency, giving a clear indication of the average value within a dataset. This makes it a fundamental metric for understanding typical performance or characteristics. Secondly, standard deviation complements the mean by providing a measure of the spread or dispersion of data points around the mean. It offers insights into the variability within the dataset, crucial for assessing consistency and identifying outliers. Lastly, the median provides a robust alternative to the mean, particularly in datasets with skewed distributions or outliers. It represents the middle value when data points are sorted, making it less sensitive to extreme values compared to the mean. Together, these metrics offer a comprehensive view of data distribution, making them essential tools for benchmark analysis, where accurate comparisons and reliable insights are paramount [1]. Mean is the average value of all data points in a dataset. In the realm of code benchmarking, the mean is often used as a representative value to indicate average performance. A lower mean can generally signify better performance. However, it's crucial to interpret the mean alongside other measures, as one abnormally high or low value in the dataset (outlier) could significantly skew the mean, presenting a misleading picture of performance. Standard Deviation (StdDev) deviation measures the dispersion of a dataset relative to its mean. If the standard deviation of a code's performance is low, it means that the results are close to the average, signifying consistent performance. If the standard deviation is high, then the results vary considerably, indicating inconsistent performance. Therefore, in code benchmarking, a lower standard deviation is typically more desirable as it implies reliability and consistency in the code's performance. Median is the middle value in a dataset when the values are arranged in ascending or descending order. In code benchmarking, median values can be particularly relevant because, unlike the mean, they are not affected by outliers. Hence, the median can provide a more accurate representation of the 'typical' performance of a code. This metric can be very useful in situations where there is significant variability, providing a more stable central tendency. Therefore, when we talk about improving performance, we often look at reducing the 'median' time. Basically, the median is the middle element in a sorted dataset [1-3]. In sum, the mean, standard deviation, and median are all statistical measures that provide complementary insight into the structure and tendencies of your data when benchmarking the performance of code. By understanding all three, you can gain a much more comprehensive understanding of your code's efficiency and consistency. 1.2. Importance of performance testing and benchmarking and problems Performance testing and benchmarking are critical aspects in the development process of systems, applications, and networks. They serve fundamental roles in ensuring that these digital assets meet their intended functionality and service quality. Performance testing is a methodical practice that helps developers examine how a system behaves or responds under different circumstances, including workload, operational speed, and stability thresholds. It can help identify bottlenecks, understand system limitations and confirm whether a system can meet its expected performance criteria. On the other hand, benchmarking is a way to compare the efficiency, speed, and quality of the system, software, or hardware against industry standards or competitor products. Tools and metrics used in benchmarking enable developers and stakeholders to gain valuable comparisons and insights into where their product stands in the competitive landscape. It provides an objective way to identify areas for improvement and explore opportunities for enhancement. Ultimately, both performance testing and benchmarking contribute to delivering a high-quality, efficient, and user-friendly product to end-users. Performance testing and benchmarking are not just about finding faults and addressing them, they're also about preemptively creating a better user experience. In today's digital age, users have a short tolerance for slow and non-responsive applications. Systems that have been thoroughly performance tested and benchmarked tend to have fast load times, better functionality, and can seamlessly manage high traffic. These characteristics help in boosting user engagement, which in turn positively impacts customer loyalty and overall business success. Moreover, these processes are also vital in risk mitigation. System crashes and failures resulting from previously undetected issues can lead to considerable financial and reputational damage. Performance testing helps uncover those potential problems before deployment, decreasing the chance of unforeseen system downtime. Benchmarking, in the meanwhile, ensures the product is matching or outperforming market standards, enforcing the reliability and credibility of the product among its users [4, 5]. Furthermore, consistent performance testing and benchmarking allow developers to stay informed about innovative techniques, tools, or methods that leading competitors implement to maintain their exceptional performance levels. It provokes the drive for continuous system improvement in a competitive environment. Ultimately, the combination of performance testing and benchmarking form a basis for delivering superior quality products, which meet user needs and standout in the competitive digital market. These tools offer a clear pathway for product enhancements, as they not only reveal opportunities for improvement but also facilitate a data-driven approach towards seizing these opportunities. In sum, the importance of performance testing and benchmarking cannot be underestimated, given that they are essential contributors to the development of an efficient, competitive, and user-friendly digital product. 2. Benchmarks analysis with AI models One of the main problems I faced and face in modern work with benchmarks is the fact that many people simply do not understand how to work with these metrics. That is, programmers can easily write and carry out appropriate performance measurements in the form of benchmarks, but they usually do not know how to work with the results. One of the tasks of this paper is to check whether modern models of artificial intelligence can help us with this task. For this, it was decided to conduct an experiment with a clear goal to verify AI models capabilities in this area [6]. The experiment aims to evaluate ten well-known sorting algorithms, ranging from inefficient ones such as Bubble Sort to more efficient ones like Quick Sort and Tree Sort, among others [7]. This selection is grounded on several factors: • Each algorithm's clear efficiency rate • Applicability for benchmarking in various contexts • Ability to support the required level of abstraction to represent both efficient and inefficient code Here is a list of selected algorithms with their time complexities [8]: Table 1 Sorting algorithms with their time complexity Algorithm Time Complexity Algorithms Code Bubble Sort 𝑂(𝑛2 ) T1 Selection Sort 𝑂(𝑛2 ) T2 Quick Sort 𝑂(𝑛 log 𝑛) T3 Merge Sort 𝑂(𝑛 log 𝑛) T4 Insertion Sort 𝑂(𝑛2 ) T5 Heap Sort 𝑂(𝑛 log 𝑛) T6 Bucket Sort 𝑂(𝑛 + 𝑘) T7 Radix Sort 𝑂(𝑛𝑘) T8 Cube Sort 𝑂(𝑛 log 𝑛) T9 Tree Sort 𝑂(𝑛 log 𝑛) T10 Next, we conducted benchmark collections for each algorithm using various datasets, each comprising arrays of different sizes denoted as 'n'. These scenarios included: 1. 100 elements 2. 10000 elements 3. 200000 elements This approach ensured accurate benchmark measurements across a range of dataset sizes. Subsequently, we proceeded with the selection of AI models. While numerous open- source and freely available models exist, accessible through platforms like "HuggingFace" [9] our selection was based on the following criteria: • Availability of documentation • API accessibility • Support • Model size Thus, we opted for two opposing models from OpenAI and Google: GPT 3.5-turbo and Gemini 1.5 Pro, respectively. It is worth mentioning that we utilized "raw" models without any fine-tuning processes for the benchmark analysis scenario. Following this, benchmark analyses were conducted with the assistance of both models, wherein each AI assistant classified algorithms as either 'E-efficient' or 'I-inefficient' Additionally, it should be mentioned that the models solely provided "raw" benchmark outputs without specific algorithm names. Consequently, all tested methods were renamed using random identifiers such as T1, T2, ..., T10. Finally, to evaluate the efficiency of each model in this analysis, we calculated the F-score value and provided feedback [10]. Let's delve into each step in more detail. 2.1. Basic terms and benchmark-related statistical metrics For the task of amassing algorithm benchmarks, we opted to utilize the Benchmark .NET library [9], in conjunction with the corresponding algorithms executed using the C# language. This preference was guided by my past experience as a .NET developer, marking C# as my proficient programming language. We set up three iterations of tests, each encompassing an assortment of datasets consisting of arrays of varying sizes, represented by the variable 'n'. The three iterations comprise 100 elements, 10,000 elements, and 200,000 elements, respectively. 2.1.1. Benchmark collection on 100 randomly generated elements In the initial stage of our experimentation, a deliberate decision was made to randomly generate a compact dataset. This dataset, characterized by its diminutive size and inherent randomness, presents a unique testing ground where a wide array of outcomes can manifest. Surprisingly, inefficient algorithms, which may typically be dismissed in larger and more structured datasets due to their suboptimal performance, can paradoxically showcase efficacy within the confines of this modest dataset. The nuanced interplay between algorithm efficiency and dataset size unveils an intriguing phenomenon where inefficiency, under certain circumstances, can translate into heightened performance outcomes on smaller datasets. One key reason why inefficient algorithms may exhibit favorable performance on smaller datasets is their ability to navigate the reduced complexity and variability inherent in such datasets. Inefficient algorithms, often characterized by their suboptimal computational efficiency or algorithmic design, might inadvertently align with the simplified features and reduced intricacies of compact datasets. Unlike their performance on larger datasets, where inefficiencies lead to significant performance bottlenecks, these algorithms can capitalize on the streamlined nature of smaller datasets to produce viable results. This phenomenon underscores the importance of contextualizing algorithmic efficiency within the specific characteristics and constraints of the dataset under consideration, highlighting the potential adaptability and unexpected advantages that inefficient algorithms can offer in certain scenarios [11]. Following the execution of the benchmarking procedure, the ensuing data outputs were assembled in Table 1. To visually display the correlation of benchmarks the respective bar chart was built (Figure 1). Table 2 Sorting algorithms benchmarking results on 100 elements array (us = 0.000001 sec) Method Array Mean StdDev Median T1 Int32[100] 108.34 us 180.52 us 24.400 us T2 Int32[100] 114.01 us 189.24 us 28.050 us T3 Int32[100] 35.78 us 106.47 us 2.100 us T4 Int32[100] 62.91 us 141.53 us 18.650 us T5 Int32[100] 132.14 us 224.84 us 60.300 us T6 Int32[100] 42.85 us 127.07 us 2.600 us T7 Int32[100] 89.07 us 225.09 us 18.250 us T8 Int32[100] 35.88 us 106.51 us 2.000 us T9 Int32[100] 428.81 us 1,081.19 us 81.950 us T10 Int32[100] 80.43 us 239.06 us 4.500 us Figure 1: Benchmarks Mean outputs for sorting algorithms on 100 elements array The mean execution times range from 35.78 us to 428.81 us. T3(Quick Sort) has the lowest mean execution time while T9 has the highest mean execution time. That is expected as far as Cube(T9) has its own complexity issues which are very visible on small dataset. It should be more efficient on bigger datasets. The standard deviation of execution times ranges from 106.47 us to 1,081.19 us. T3 has the lowest standard deviation while T9(Cube Sort) has the highest. Interestingly, even algorithms labeled as 'inefficient' demonstrate satisfactory performance with small datasets. Conversely, algorithms expected to be effective yielded contrasting results. Even Bubble Sort(T1) and Selection Sort(T2) were quite efficient on such small distance. They are in the top 5 according to the mean execution time. 2.1.2. Benchmark collection on 10000 randomly generated elements Essentially, we have generated 10,000 random integer values that are distributed over an interval [-2147483648, 2147483647] (int.MinValue, int.MaxValue in .NET respectively) [12]. This scenario aims to simulate a common occurrence characterized by a substantial volume of data, albeit without exceeding the usual workload thresholds for the algorithms involved. After collecting benchmarks on a 10K array of random integer values, we obtained the following outputs: Table 3 Sorting algorithms benchmarking results on 10000 elements array (us = 0.000001 sec) Method Array Mean StdDev Median T1 Int32[10000] 57,869.7 us 1,994.9 us 57,162.45 us T2 Int32[10000] 45,603.0 us 2,984.7 us 44,987.60 us T3 Int32[10000] 142.4 us 338.0 us 29.95 us T4 Int32[10000] 2,126.8 us 2,189.3 us 1,238.85 us T5 Int32[10000] 48,049.3 us 31,207.5 us 31,704.05 us T6 Int32[10000] 548.0 us 920.3 us 132.85 us T7 Int32[10000] 974.7 us 1,517.3 us 540.00 us T8 Int32[10000] 182.6 us 460.2 us 38.35 us T9 Int32[10000] 32,819.3 us 7,868.0 us 29,060.65 us T10 Int32[10000] 315.0 us 646.0 us 106.15 us And visual representation is given via bar chart (Figure 2). Figure 2: Benchmarks Mean outputs for sorting algorithms on 10000 elements array Starting with the Mean, which is the average time taken, the most efficient method was T3(Quick Sort), taking an average of only 142.4 microseconds (us), followed by T10(Tree Sort) and T8(Radix Sort), taking 315.0 us and 182.6 us respectively. On the end of the spectrum, T1 (Bubble Sort) had the worst performance with its Mean at 57,869.7 us, trailing behind T2(Selection Sort) and T5(Insertion Sort) with means of 45,603.0 us and 48,049.3 us, respectively. In addition, let’s consider Standard Deviation (StdDev) which is an important metric which indicates variance in performance. A higher StdDev means the method's performance varies greatly. T5(Insertion Sort) showed the most inconsistency with a StdDev of 31,207.5 us, significantly higher than the next highest, T4, with a StdDev of 2,189.3 us. On the other hand, T1(Bubble Sort) proved to be the most consistent with a StdDev of only 1,994.9 us. Persistent correlation errors are discernible even with the utilization of adequately large datasets. Notably, the shortcomings inherent in inefficient algorithms, exemplified by Bubble Sort (T1), Selection Sort (T2), and T5 (Insertion Sort), manifest conspicuously. Conversely, proficient algorithms exhibit a discernible divergence in execution times, as indicated by the Mean execution time metric. The performance of Cube Sort (T9) presents an intriguing aspect, demonstrating a noteworthy enhancement in efficiency, albeit within a conditional context vis-à-vis its competitive counterparts. Further investigation into the underlying factors contributing to these observed efficiencies could provide valuable insights into algorithmic optimizations and computational efficiency. Finally, let's take a look of the biggest dataset results. 2.1.3. Benchmark collection on 200000 randomly generated elements In this particular scenario, a substantial workload is anticipated for each algorithm under examination. To provide context, the generation of 200,000 random integers imposes a significant computational burden, as evidenced by the duration of approximately 30 minutes per benchmark execution on a laptop equipped with 64GB of RAM and a highly efficient CPU architecture. It is crucial to note that this time interval is distinct from the sorting algorithm's execution time. The forthcoming results are anticipated with great interest, as they will shed light on the performance characteristics of the algorithms under such demanding computational conditions. After running benchmarks on an enormous dataset comprising 200K elements of randomly generated integers, we acquired the outputs presented in Table 4 and resulting the following bar chart presentation (Figure 3). From the benchmark results table, we can see that the mean execution times for the different methods vary significantly. T3(Quick Sort) is the most efficient method with a mean execution time of only 291.7 us. This is followed by T8(Radix Sort) with a mean execution time of 409.9 us and T6(Heap Sort) with a mean execution time of 3,450.2 us. On the other hand, T1(Bubble Sort) has the highest mean execution time of 17,480,568.3 us, followed by T2(Selection Sort) and T5(Insertion Sort) with a mean execution time of 8,694,258.2 us. It was quite expected result, because these algorithms are we well knows inefficient sorting algorithms. When looking at the standard deviation (StdDev) values, we can see that T3(Quick Sort) has the lowest standard deviation of 219.3 us, indicating that the execution times for this method are consistent. T8(Radix Sort) and T10(Tree Sort) also has a relatively low standard deviation of 507.2 us. These results prove that these algorithms are not only considered to be most efficient ones, but they actually are. Table 4 Sorting algorithms benchmarking results on 200000 elements array (us = 0.000001 sec) Method Array Mean StdDev Median T1 Int32[200000] 17,480,568.3 us 1,178,671.9 us 17,234,825.9 us T2 Int32[200000] 11,477,353.2 us 567,577.3 us 11,348,932.8 us T3 Int32[200000] 291.7 us 219.3 us 214.6 us T4 Int32[200000] 29,027.5 us 9,875.0 us 28,415.8 us T5 Int32[200000] 8,694,258.2 us 1,337,944.7 us 8,410,619.9 us T6 Int32[200000] 3,450.2 us 784.9 us 3,237.9 us T7 Int32[200000] 30,471.5 us 3,352.4 us 30,000.7 us T8 Int32[200000] 409.9 us 507.2 us 244.0 us T9 Int32[200000] 780,770.4 us 124,539.2 us 819,716.8 us T10 Int32[200000] 1,754.2 us 801.1 us 1,375.3 us Figure 3: Benchmarks Mean outputs for sorting algorithms on 200000 elements array In contrast, T1(Bubble Sort) has a very high standard deviation of 1,178,671.9 us, suggesting that the execution times for this method vary significantly. Overall, based on the benchmark results, T3, T8, and T6 appear to be the most efficient methods in terms of mean, standard deviation, and median execution times. These methods consistently perform well in terms of speed and variability in execution times. In addition, T9(Cube Sort) finally showed its capabilities. This algorithm is among top performer algorithms and proved itself as an efficient one. On the other hand, methods like T1 and T5 appear to be less efficient, with higher mean execution times and standard deviations. This suggests that these methods may not be as reliable or consistent in their performance compared to the more efficient methods. Now let's check whether modern AI models are able to process "raw" benchmark data. Since not all people know statistics, and even fewer people know how to analyze the data that we collect when measuring benchmarks - the help of AI models will be very appropriate. The task of the next section is to test the capabilities of AI models in this direction. 2.2. AI-based analysis of performance statistical data First, let me briefly describe the AI models we have chosen for this experiment. OpenAI's GPT-3.5-turbo (GPT - Generalized Pre-trained Transformer) is part of the third generation of large transformer-based language models with 175 billion parameters. It adheres to the autoregressive language model, which can process and generate text based on the preceding context. GPT-3.5-turbo is often referred to as a "chat model", as it is formatted to generate responses in a conversational context. The model is capable of composing emails, writing code, answering questions, creating written content, tutoring in a variety of subjects, translating languages, simulating characters for video games, and much more. To interact with GPT-3.5-turbo, one sends a series of messages, and the model returns a model-generated message. A conversation typically begins with a system message which sets the behavior of the assistant, followed by alternating user and assistant messages. The model completes its predictive task based on the entire conversation history [13, 14, 15]. GPT’s contender is this battle is Google’s Gemini 1.5 Pro model. The "Gemini AI" model, also known as Bard, refers to an experimental conversational AI developed by Google. Gemini is designed to provide users with high-quality, contextually relevant information and insights. It draws from Google's vast data resources, enhancing user interactions by delivering relevant and precise answers. Gemini aims to leverage Google's LaMDA (Language Model for Dialogue Applications) technology, featuring capabilities in understanding and generating human-like text-based responses. This positions it as a competitor to other advanced AI models like OpenAI's ChatGPT. Gemini is part of Google's broader efforts to integrate advanced AI into everyday user interactions, helping to simplify information retrieval, enhance learning, and provide a more interactive and intuitive user experience through natural language processing. The development and refinement of Gemini emphasizes Google's commitment to leading in AI- driven technologies and their applications across various sectors [16]. In short, these are the most modern generative AI models that exist in public access and can be used for various purposes. Additionally, based on the AI models documentations [13, 14, 15, 16, 17], let’s review face-to-face comparison of these models (see Table 5). The comparison between the OpenAI and Google models unveils distinct characteristics and functionalities that distinguish these two prominent AI providers in the realm of natural language processing. OpenAI's model, boasting an input context window size of 4096 tokens and a corresponding maximum output token capacity of 4096 tokens, positions itself as a contender in the field with a balanced focus on contextual understanding and output generation. In contrast, Google's offering showcases a significantly larger input context window of 32.8K tokens, enhancing its capacity to process extensive textual data, coupled with a maximum output token allowance of 8192 tokens, thereby catering to the requirements of complex text generation tasks. Table 5 GPT 3.5-turbo versus Gemini 1.5 Pro models comparison Characteristic GPT 3.5 Turbo Gemini 1.5 Pro Model provider Open AI Google Input context window 4096 tokens 32.8K tokens Maximum output tokens 4096 tokens 8192 tokens Release date November 28th, 2022 December 13th, 2023 Input pricing $ 0.5 per 1M tokens $ 0.13 per 1M characters Output pricing $ 1.5 per 1M tokens $ 0.38 per 1M characters Furthermore, the release dates of these models shed light on their respective timelines of entry into the market. OpenAI introduced its model earlier on November 28th, 2022, providing an earlier foothold in the evolving landscape of AI-driven solutions for natural language processing. On the other hand, Google unveiled its model at a later date on December 13th, 2023, indicating a subsequent entrance into the competitive arena of AI model deployment. This temporal discrepancy may influence user preferences based on the models' maturity, refinement, and adaptability to evolving industry standards and demands. In terms of pricing structure, OpenAI and Google present divergent approaches in their fee arrangements for input and output processing. OpenAI offers input pricing at $0.5 per 1 million tokens, reflecting a cost-effective model for users aiming to leverage token-based input mechanisms. Conversely, Google adopts a pricing scheme of $0.13 per 1 million characters for input data, catering to users inclined towards character-based input strategies. Similarly, the output pricing contrasts between the providers, with OpenAI charging $1.5 per 1 million tokens and Google offering output processing at a rate of $0.38 per 1 million characters. These pricing differentials underscore the nuanced considerations that users must weigh when selecting a model provider based on their specific budgetary constraints and processing requirements [19]. 2.2.1. GPT 3.5 Turbo We've attached the benchmarks for the models' context using CSV files and outlined our target scenarios. To enhance the accuracy and intrigue of our experiment, we decided to task the AI model with classification based on: 1. Benchmarks outputs for 100 elements - Algorithms_Benchmarks_100.csv 2. Benchmarks outputs for 10,000 elements - Algorithms_Benchmarks_10000.csv 3. Benchmarks outputs for 200,000 elements - Algorithms_Benchmarks_200000.csv 4. Utilizing all available outputs Note. For those, who want to repeat the experiment all mentioned above files could be found by the following link: https://github.com/d0vbysh/MoMLeTDS2024_ModelsContext The directive given to the model was: "Having attached benchmark files for ten methods representing different algorithms to this chat context, could you please classify these methods as 'I-inefficient' or 'E-efficient'? Please provide the answer in a simple table format like T1 - I, T2 - E, etc. Additionally, I kindly ask you to perform this evaluation four times— once using only the 'Algorithms_Benchmarks_100.csv', once using 'Algorithms_Benchmarks_10000.csv', once using 'Algorithms_Benchmarks_200000.csv', and once using all available files in the context." The output from the model was as follows: “Based on the provided data, we can classify the algorithms as:” Table 6 GPT 3.5-turbo benchmark analysis outputs Method Answer – 100 Answer - 10K Answer - 200K Answer with full context T1 I I I I T2 I E E I T3 E I I I T4 I E E E T5 I E E E T6 E E E E T7 I E E I T8 E E E E T9 I I I I T10 I E E E As you can see, the GPT 3.5 Turbo produced quite accurate classification results with a small error in the answers. Let's calculate the F-score for the model classification results. Firstly, the F-score, also known as the F1-score or F-measure, is a measure of a model’s accuracy that considers both precision(P) and recall(R). It's used to express the balance between precision (which is the proportion of true positive results among positive results predicted by a model) and recall (which is the ability of a model to find all the relevant cases in a dataset). The F-score is particularly useful when you want to compare two classifiers directly, as it combines these two metrics into a single number [20, 21]. The formula for the F-score is: 𝑃⋅𝑅 𝐹1 = 2 𝑃+𝑅 , where: • Precision is the number of true positive results divided by the number of all positive results, including those not identified correctly. Its formula is 𝑇𝑃 𝑃 = (𝑇𝑃+𝐹𝑃). • Recall is the number of true positive results divided by the number of all samples that should have been identified as positive. Its formula is 𝑇𝑃 𝑅 = (𝑇𝑃+𝐹𝑁). In these formulas: - 𝑇𝑃 is the number of true positives (the number of items correctly identified as positive), - 𝐹𝑃 is t he number of false positives (items incorrectly identified as positive), - 𝐹𝑁 is the number of false negatives (items incorrectly identified as negative). Let’s proceed with the simple example of F score calculation for the GPT model classification results based on 100 elements array benchmarks. According to the model's outputs we have two correctly identified Inefficient algorithms (TN), and the same amount of correctly classified Efficient algorithms (TP). On the other hand, we have five wrongly recognized algorithms as Inefficient (FN) and one inaccurate classification of algorithms as Efficient (FP). Following these numbers, let's get P and R: 𝑇𝑃 2 𝑇𝑃 2 𝑃 = = = 0.6667; 𝑅 = = = 0.2857 𝑇𝑃 + 𝐹𝑃 2 + 1 𝑇𝑃 + 𝐹𝑁 2 + 5 And having P and R we can calculate the F score value: 𝑃𝑅 𝐹1 = 2 𝑃+𝑅 = 0.4. Having all these data in place, we received F score for the GPT 3.5 Turbo model (see Table 7). Table 7 GPT 3.5-turbo F score Metric Answer – 100 Answer - 10K Answer - 200K Answer with full context P 0.6667 0.7143 0.8571 1 R 0.2857 0.7143 0.8571 0.6667 F score 0.4 0.7143 0.8571 0.8 The model is performing very well in correctly identifying both positive and negative instances, with few false positives and false negatives. This high F-score in "All Context" scenario suggests strong overall performance and reliability of the model. Only in the case of 100 elements array, model was not that efficient. I assume the reason for that is that the dataset is quite tiny. In addition, we must to recall that models process only raw benchmarking data without any additional context of it. I assume that such results are really nice and definitely this AI model must be considered as a solid option to be used in scenario of performance analysis. Let’s evaluate Gemini 1.5 Pro model to have some data for efficiency comparison. 2.2.2. Gemini 1.5 Pro Repeating the whole experiment with the GPT’s opponent we have obtained the following results (Table 8) and it’s F scores (Table 9). The Gemini model demonstrates commendable performance, particularly evident in scenarios with smaller data sets. Notably, the model exhibits exceptional accuracy when operating on a dataset containing 10,000 elements. However, as the dataset size increases to 200,000 elements, encompassing all execution results, the accuracy diminishes significantly. Despite this drawback, Gemini endeavors to elucidate its decision-making process in the realm of classification. Through a comprehensive analysis of the available data, the model provides insightful arguments: “Algorithms characterized by low standard deviation, such as T3 and T8, are deemed preferable when consistency in performance is paramount. On the other hand, algorithms like T4 and T7 may find applicability in niche scenarios where the input size remains relatively modest. The variable efficiency of T9 renders it a specialized algorithm, potentially well-suited for specific data types or distributions. Conversely, algorithms such as T1, T2, and T5 are advised against for large datasets or time-critical applications due to their persistent inefficiency.” Table 8 Gemini 1.5 Pro benchmark analysis outputs Method Answer – 100 Answer - 10K Answer - 200K Answer with full context T1 E I I I T2 I I I I T3 E I I I T4 E E E E T5 I E E E T6 E E E E T7 I E E E T8 E E E E T9 I E I I T10 I E E E Table 9 Gemini 1.5 Pro F score Metric Answer – 100 Answer - 10K Answer - 200K Answer with full context P 0.6 0.8571 0.8333 0.8333 R 0.4286 0.8571 0.7143 0.7143 F score 0.5 0.8571 0.7692 0.7692 This detailed examination of various benchmarks exemplifies the model's capacity to generate insights without necessitating any fine-tuning. Such capabilities bode well for the continued evolution and enhancement of the Gemini model within the scientific community. Further expansion of the text could delve into the implications of these findings for the field of machine learning, the significance of accurate classification algorithms in various industries, and potential avenues for future research to optimize the performance of models like Gemini in diverse real-world applications. 3. Summary Drawing on rigorous experiments, we conclude that artificial intelligence models can be considered viable tools to assist with our performance testing routines. Both models deliver impressive and accurate results, without the need for model fine-tuning or consideration of additional context pertaining to algorithm characteristics - they work solely using raw benchmark data. And this is great. I am sure it field to be explored and developed. Introducing fine-tuning for models will be a game changer. However, it is out of scope for this paper work. In my professional experience, not all developers fully grasp how to implement this method, though a notable majority are certainly employing it. Gemini 1.5 demonstrates greater precision over a broader range, whereas GPT3.5-turbo is particularly proficient at processing smaller datasets. No matter which model you opt for, out investigation strongly posits that with thorough and patient preparation, we have the potential to achieve even greater results in terms of code performance analysis. Potentially, we can achieve autodetection of the problematic code, based on trained model experience. This underlines a new possibility in how we approach and comprehend the efficiency of coding. The future of performance testing and statistics with AI models holds significant promise and potential for innovation in the field of technology. As artificial intelligence continues to advance, the integration of AI models in performance testing is poised to revolutionize the analysis and optimization of code performance. By harnessing the power of AI algorithms, developers can expect a paradigm shift in how performance testing is conducted, moving towards more efficient and automated processes. One key aspect that the future holds for AI models in performance testing is the utilization of machine learning techniques to enhance the accuracy and reliability of testing results. Through the application of sophisticated algorithms, AI models can autonomously detect performance bottlenecks, identify areas of inefficiency, and provide actionable insights for optimization. This capability not only streamlines the testing process but also augments the overall quality of code performance analysis. Furthermore, the future trajectory of AI models in performance testing envisions the potential for autodetection of problematic code segments based on the accumulated experience and training of the models. By leveraging this predictive capacity, developers can proactively address coding inefficiencies, enhance code quality, and ultimately improve the overall performance of software applications. This transformative approach heralds a new era in software development, where AI-driven insights play a pivotal role in optimizing code efficiency and performance. As this field continues to evolve, the collaborative efforts of researchers, developers, and data scientists will be crucial in unlocking the full potential of AI models in performance testing and statistics. By fostering interdisciplinary collaborations and embracing innovative technologies, the future landscape of performance testing with AI models holds a wealth of possibilities for advancing the efficiency, reliability, and scalability of software systems. The objectives set for this paper have been effectively achieved, delivering clear and promising results. As we conclude this project, we find ourselves with a deeper understanding and vision for further investigation. We eagerly anticipate what the new phase of our investigation holds as we endeavor to uncover more insights in this captivating and complex field. We are confident that our ongoing work will extend the knowledge we have established within this fascinating topic. References [1] Andrey Akinshin, Pro .NET Benchmarking: The Art of Performance Measurement 1st ed, 2019, ISBN-10 1484249402: 97-345. [2] Thomas Nield, Essential Math for Data Science. Take Control of Your Data with Fundamental Linear Algebra, Probability, and Statistics. 1st Edition, 2019, ISBN-13 9781098102937: 42-271. [3] Samuel Kounev, Klaus-Dieter Lange, Jóakim von Kistowski, Systems Benchmarking: For Scientists and Engineers 1st. Edition, 2020, ISBN-13 978-3030417048 [4] Brendan Gregg, Systems Performance (Addison-Wesley Professional Computing Series) 2nd Edition, 2020, ISBN-13 978-0-13-682015-4. [5] K. Singh, K. Singh Dhindsa, B. Bhushan, Performance Analysis of Agent Based Distributed Defense Mechanisms Against DDOS Attacks. Intern. J. of Computing, 17(1), 15-24. (2018). URL: https://doi.org/10.47839/ijc.17.1.945 [6] Olivier Caelen, Marie-alice Blete, Developing Apps With GPT-4 and ChatGPT: Build Intelligent Chatbots, Content Generators, and More 1st Edition, 2023, ISBN-13 978- 1098152482 [7] Aditya Bhargava, Grokking Algorithms, Second Edition 2nd Edition, 2024, ISBN-13 978-1633438538. [8] Jay Wengrow, A Common-Sense Guide to Data Structures and Algorithms, 2020, ISBN- 13 9781680507225 [9] HugginFace forum. URL: https://huggingface.co/posts [10] Benchmark.NET documentation.URL: https://benchmarkdotnet.org [11] George Heineman, Learning Algorithms: A Programmer's Guide to Writing Better Code. 1st Ed, ISBN-13 9781492091066 [12] .NET docs. URL: https://github.com/dotnet/docs [13] Ali Aminian and Alex Xu, Machine Learning System Design Interview, 2023, ISBN-13 978-1736049129 [14] Open AI platform documentation. URL: https://platform.openai.com/docs/overview [15] Open AI API documentation. URL: https://platform.openai.com/docs/api-reference [16] Open AI Chat GPT documentation. URL: https://openai.com/chatgpt [17] Gemini API documentation. URL: https://ai.google.dev/gemini-api/docs [18] Gemini documentation. URL: https://deepmind.google/technologies/gemini/#gemini-1.5 [19] AI models analysis. URL: https://artificialanalysis.ai/models [20] Nathalia Nascimento, Paulo Alencar, Donald Cowan “Comparing Software Developers with ChatGPT: An Empirical Investigation” arXiv, May. 25, 2023, doi: https://doi.org/10.48550/arXiv.2305.11837 [21] I. Gorbenko, A. Kuznetsov, Y. Gorbenko, S. Vdovenko, V. Tymchenko, and M. Lutsenko, Studies On Statistical Analysis and Performance Evaluation for Some Stream Ciphers, Intern. J. of Computing, 18(1), 82-88. (2019). URL: https://doi.org/10.47839/ijc.18.1.1277