Classification of Mobile Price Using Machine Learning Nisha Sunariya1, Avinash Singh1, Mehtab Alam1,∗ and Vibha Gaur1 1 Department of Computer Science, Acharya Narender Dev College, University of Delhi, New Delhi-110019, India Abstract It's critical to comprehend predicted and forecasted prices to develop a successful consumer strategy. The market performance of a product depends on proper pricing. The goal of this research is to determine a pricing range for mobile phones based on specifications including storage, display, battery life, RAM, camera, and more. It would assist consumers in making informed decisions when buying a phone that suits their needs and budget. Making the best choices might be difficult with so many resources at hand. A model that offers guidance using important aspects of mobile phones was developed to deal with this problem. To classify and estimate the price range of a mobile phone, this study maneuvers five machine learning (ML) techniques: Support Vector Machine (SVM), Random Forest (RF), Decision Tree (DT), Logistic Regression (LR), and K-nearest neighbors (KNN). The models are trained to create outcomes that fall into the low, medium, high, or extremely high categories. The data for this paper was obtained from Kaggle.com. The findings are assessed to achieve the highest level of precision while choosing the most desired features of mobile phones. The findings of this research will have practical implications for both consumers and manufacturers. Consumers can make informed decisions based on the identified influential features, considering their preferences and budget constraints. Manufacturers can use the insights to optimize product offerings, emphasizing features that contribute significantly to higher price ranges. This strategic alignment can enhance market competitiveness and consumer satisfaction. This paper also identifies the best option with the most features of mobiles at the lowest price. Keywords Support Vector Machine (SVM), Mobile Price, Random Forest (RF), Decision Tree (DT), Logistic Regression (LR), and K-nearest neighbors (KNN) 1 Symposium on Computing & Intelligent Systems (SCI), May 10, 2024, New Delhi, INDIA ∗ Corresponding author. † These authors contributed equally. raonisha0908@gmail.com (N. Sunariya); ac-1255@andc.edu.du.ac.in (A. Singh); mahiealam@gmail.com (M. Alam); vibhagaur@andc.du.ac.in (V. Gaur) 0000-0001-7554-2160 (M. Alam) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 1. Introduction Pricing is the most beneficial characteristic in business and marketing. A decision regarding pricing regulations has significant effects on management. It establishes the profit margin on products and is one of the first assessments made by many purchasers. Before making a purchase, consumers are indeed concerned about whether they can afford the item and want to verify the price. The success of a product can be affected by a variety of elements, including pricing, product appropriateness, return rates, and profitability [1]. This study takes the first step towards achieving this objective. The main aim of this paper is to identify the most reliable and appropriate ML classification model for the classification of mobile phone prices. The motivation for this paper stems from the challenges faced by individuals unfamiliar with machine learning while purchasing a mobile phone. It's often difficult for them to discern the crucial features influencing the phone's price. Instead of predicting the exact price, the focus is on establishing a price range that reflects the overall pricing level and identifies the dominant features affecting mobile phone prices. This research adds value to discussions on pricing strategies, consumer decision-making, and the application of machine learning in predicting product prices, offering insights applicable across diverse industries. Machine learning is a pathway to Artificial Intelligence (AI) [2]. The most recent AI technologies, such as classification, regressions, and supervised and unsupervised learning, are accessible through machine learning [3]. Data analysis and visualization can be aided by a variety of ML tools, such as MATLAB, Python, Cygwin, and others. The categorization of data using ML algorithms is very likely to yield accurate results [4]. A mobile phone, sometimes called a cell phone, portable phone, or phone, uses radio frequency links to place and receive calls [5]. Human’s daily lives have become completely dependent on our mobile devices, which keep us linked and even come in handy in crises. With smartphones currently outpacing older mobile devices in usage, mobile phones have grown to be one of the most widely used consumer products ever created [6]. New mobile phone models with improved features are released every year. A mobile price class prediction model is essential for making the optimum product choice. Additionally, this model can be used to examine pre-owned cars, generators, gold, food, medicines, residences, and many other items. Several aspects are important to consider when estimating a mobile device's price. These consist of the device's CPU type, battery life, and capability to set reminders for significant occasions. The device's dimensions and weight are frequently crucial factors for users. Several criteria, such as the amount of internal memory, the quality of the touch screen, the pixel size, and the amount of RAM, affect a mobile device's pricing [7]. This study divides mobile devices into four price ranges based on a variety of features and specifications: low, medium, high, and very high. These price ranges help in consumer decision-making, competitive pricing, and budget planning. Price ranges can be used as indicators of economic conditions. This paper is divided into multiple sections. Section 2 provides information on the background work related to this work, while Section 3 briefly defines various prediction models used in the study. The methodology approach and the findings of experimental prediction are presented in Section 4. The conclusion and future directions are provided in Section 5. 2. Background Work This section describes the findings on projecting and estimating the cost of various goods. Sameer Chand-Pudaruth projects the price of used cars in Mauritius. This study discovered that Nave Bayes and DT are ineffective at managing, categorizing, and forecasting numerical values because there were fewer occurrences and incredibly low prediction accuracy was reported [8]. M. Asim and Z. Khan [9] forecasted the price of mobile phones. They strived to have the most accurate predictions while maintaining the lowest cost and highest feature model. Using the DT, 78% accuracy was obtained. Due to the lack of characteristics and algorithms, extremely low prediction accuracy was observed. Menghan Chen [10] predicted the prices of smartphones with fewer features. Principal Component Analysis (PCA) and Pearson's correlation were used as two feature reduction techniques. Without using any feature reduction approaches, Multi-Layer Perceptron (MLP) had a 92.84% accuracy rate. However, accuracy suffered as features were reduced, falling to 93.22% for the top 15 and 34.06% for the top 5. K Noor and Sadaqat J [11] predicted the automobile prices using multiple linear regression. They projected prices from independent variables such as the vehicle's model, make, city, version, color, mileage, alloy rims, and power steering. Kuo- Kun Tseng et al. [12] worked on foretelling e-commerce goods prices using online sentiment analysis. They developed a price prediction algorithm after analyzing news that had an impact on product prices. Aidin Zehtab-Salmasi et al. [13] developed a Multimodal Price Prediction for mobile phone pricing based on its specifications. Neural networks are more accurate in estimating a house's price, according to Limsombunchai's research [14]. This study offered strong support for prediction superiority without comparing the forecasting abilities of the of hedonic price model and Neural Networks. A smartphone app for stock prediction was created by Abidatul Izzah et al. utilizing enhanced multiple linear regression [15]. Their mobile app's accuracy forecast outperformed the traditional method. Al-Dhuraibi et al. predicted the price of gold. They predicted whether the price of gold would rise or go down in the future with the help of various ML models. They found that only the K-NN algorithm had an acceptable performance with an accuracy of 60.26% [16]. Mohapatra et. al. predicted the possibility of having breast cancer in a woman using various ML algorithms. They achieved the highest accuracy of 98.7 % using the XGBoost model [17]. While existing studies have explored mobile price prediction using machine learning, a notable gap persists in addressing the needs of non-expert consumers struggling to discern crucial features influencing mobile phone prices. Most research has focused on predicting exact prices, neglecting the practical challenges faced by consumers unfamiliar with machine learning intricacies. Our study uniquely addresses this gap by concentrating on establishing a price range rather than exact figures, providing consumers with a more accessible understanding of pricing levels. Additionally, we aim to identify and highlight the key features influencing mobile prices, offering a user-friendly perspective. By doing so, our research contributes to making mobile price prediction more transparent and consumer- centric. 3. Prediction Models This section provides an overview of the prediction models used in this study to predict the pricing of mobile phones depending on their features. A basic explanation of the models is given below. 3.1. Decision Tree It is a tree-like model where an input is processed through a series of decisions based on features, leading to a predicted output. Decision trees are easy to understand and interpret, making them particularly valuable in various applications [18]. The decision tree makes predictions by asking a series of questions about the input features and eventually reaching a leaf node that provides the predicted outcome. Decision trees find applications in various fields, aiding in classification and regression tasks [19]. 3.2. Logistic Regression Logistic regression, also referred to as the logit model, is a statistical technique used to determine the probability of an event occurring based on a group of independent variables. This method is particularly helpful for determining the correlation between the target variable and one or more other variables. Logistic regression is often employed when dealing with categorical dependent variables. However, the model may be vulnerable to overfitting when numerous predictor variables are present [20]. 3.3. K-Nearest Neighbour (KNN) The KNN algorithm is considered non-parametric due to its lack of assumptions regarding underlying data. Instead, it relies on the similarity between existing and new data to categorize new cases. During the training phase, the algorithm simply stores the available data and classifies new data or cases based on a similarity measure. The classification of data points is based on how their neighbours are classified [21].For each proceedings volume published with CEUR-WS, the titles of its papers should either all use the emphasizing capitalized style or the regular English (or native language) style. Check with the editors of your volume which style you should adopt. 3.4. Random Forest This classification system uses multiple decision trees on different data sets to improve prediction accuracy. This classifier uses bagging, which involves training many models using distinct subsets of data, as opposed to depending just on a single decision tree [22]. The outcome is determined by combining the results of all the models and using the majority vote approach [23]. 3.5. Support Vector Machine The SVM model is useful for solving both classification and regression problems [24]. It is widely used in tasks involving machine learning classification. The major goal of this strategy is to locate the best decision boundary for a group of points that belong to the same class [25]. 4. Experiment This section outlines the proposed procedure used in the experiment carried out for the investigation. Figure 1 describes and illustrates the essential phases of the procedure. Figure 1. Outline of the Experiment. 4.1. Data Collection The data set for this paper was obtained from Kaggle.com [26], and it includes details like battery life, CPU speed, weight, RAM, and other factors of mobile phones. The dataset comprised 2000 instances and 24 attributes of mobile phones. A sample of the distinct values from the used data set is shown in Figure 2. The characteristics used in the proposed study are explained below. Figure 2. Overview of data set with well-defined values i. Battery Power: The battery's power output directly affects how long it can be used. A battery with a higher capacity will be able to hold more energy and function for a longer period of time. It is expressed as mAh. ii. Clock Speed: The number of cycles completed by a CPU in a second defines clock speed. iii. Dual Sim: It permits the use of two unique sims in the same device. iv. Four_g: It defines the generation of mobile network connectivity. v. Internal Memory: The amount of data storage that is available on the phone's drive. It is measured in gigabytes. vi. Front Camera (FC): It indicates whether or not the smartphone has a front camera. The resolution of FC is measured in megapixels. vii. Bluetooth (blue): It indicates whether or not the mobile phone has the Bluetooth feature. viii. Mobile weight: It stands for the weight of cell phones. Nowadays, consumers prefer using lighter phones. ix. Mobile depth: It reflects the thickness of a mobile phone in millimeters. 4.2. Dimensionality Reduction It gets harder to construct a training set and use it efficiently as the number of features/attributes rises. In the initial stage of dimensionality reduction, attributes with missing values were examined and their average or mean was substituted by eliminating or including rows. The dataset initially contained 2000 rows across 21 columns, but after pre- processing, it was discovered that the attribute "mobile depth" cannot have a value lower than 0.6mm, thus it was removed. Additionally, two entries had "pixel height" values of 0, which was unacceptable. To choose the most relevant feature from the initial collection, the feature selection approach was used. This involved removing any unimportant, irrelevant, or distracting information [27]. In addition to model accuracy, understanding the importance of individual features in predicting mobile phone prices is crucial. Feature importance analysis provides insights into which characteristics significantly contribute to the pricing model. This analysis can guide manufacturers and consumers in recognizing key attributes that influence the cost of a mobile device. Therefore, the final dataset was reduced to 1998 rows and 20 columns and divided the data into two parts i.e. training data and test data. Figure 3. Correlation between attributes and price range Furthermore, the correlation was chosen because it makes it clear how variables relate to one another, making it straightforward to forecast one variable using data from another. In the realm of mobile phones, it is often observed that there exists a strong correlation between features. By utilizing these features as input data, it is possible to draw inferences regarding the target variable. It will be easy to forecast one variable using data from another if there is a strong correlation between the variables [28]. The correlation between camera specifications and price range is shown in Figure 3. and emphasizes the importance consumers place on mobile phone cameras. Higher megapixel counts and advanced features contribute to higher pricing, reflecting the growing significance of photography in consumer choices. In this paper, the highest correlation was found between the following attributes: 1. pc and fc, which represent the primary and front camera in pixels respectively. 2. 3G and 4G, which represent the generation of the mobile phone respectively, 3. px width and px height, which represent the pixel width and height respectively. 4.3. Classification Classification is a machine learning technique that is frequently applied in the context of supervised learning. It involves leveraging labeled training data to distinguish unique values and classify new observations into specific categories. It classifies unknown items according to what it has learned from the dataset and assigns them to particular classes. A dataset with labels is necessary for classification, and both the labeled training set and the corresponding test set are used for testing. To establish the relationship between actual and expected values, accuracy score, precision, recall, and F1-score were employed in the study and are explained below. Precision, recall, and F1-score provide a more detailed evaluation of model performance, especially in multi-class classification scenarios. Precision measures the accuracy of positive class predictions, recall measures the ability to capture positive instances, and F1-score balances both precision and recall. Accuracy is a metric that measures the correctness of the model's predictions compared to the actual outcomes. It is often used in classification problems, where the goal is to assign a label to each instance from a set of predefined labels. TP + TN Accuracy = 𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁 TP - true positives FN - false negatives TN - true negatives FP - false positives. 2 X Precision X Recall F1 Score = 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙 These metrics are utilized to compare the performance of the machine learning models employed in the study. The SVM model achieves highest precision, recall, and F1-score among the ML algorithms studied in this work reinforcing its effectiveness for prediction. 4.4. Data Analysis The analysis of the findings is presented in this section. 4.4.1. Distribution of battery power by price range Figure 4 illustrates how battery life affects a mobile phone's pricing range. The price of the mobile phone is represented by the Y-axis, while battery life is displayed on the X-axis. With the use of the distribution indicated, mobile phones can be divided into low, medium, high, and extremely high categories. Higher battery capacity often leads to higher pricing, aligning with consumer expectations for longer-lasting devices. 4.4.2. Distribution of clock speed by price range The relationship between a mobile device's price with clock speed is depicted in Figure 4. X-axis shows the price while clock speed is displayed on the Y-axis. It facilitates the classification of the test data set into low, medium, high, and extremely high categories that are provided. Faster processors generally contribute to higher-priced smartphones, catering to consumers seeking high-performance devices. Similarly, the price range of Internal memory (measured in gigabytes) is also computed as it which plays a significant role in pricing. Devices with larger storage capacities are positioned in higher price ranges, addressing the demand for increased data storage. Figure 4. Matrix for battery power by price range and clock speed and price range 4.5. Results This research work obtained substantial accuracy rates and supplied pertinent confusion matrices for reference after thoroughly examining numerous machine-learning models. The outcomes have been compared and evaluated with great attention. The distribution of attributes in this article is divided into two types depending on categorical or numerical values. The results 0, 1, 2 and 3 show the range of prices for mobile phones: where 0 denotes low-range mobile phones, 1 denotes medium-range mobile phones, 2 denotes high-range mobile phones, and 3 denotes very high-range mobile phones, respectively. The price range for mobile devices is shown in Figure 5. Price prediction falls into the low, medium, high, and very high categories when considering all of the features of mobile devices. Figure 5. Price range of Mobile Phones The SVM model works best for classification problems. SVM may incorrectly classify some examples in the training set, but it aims to create a model that is sufficiently generic to provide accurate predictions for new data. Based on how well the model performed on the test dataset, its accuracy was determined. The data needed to be trained in order to construct the RF. This model learns from training data and uses more training data than testing data. In contrast, testing data was utilized to compare the trained model to the predicted dataset. The RF model constructs numerous trees on various sub-samples, choosing the best feature from a random group of features. It uses the average to increase the prediction accuracy and reduce overfitting. The DT produces a set of rules from the given set of labeled data that are further used to classify the data. The accuracy of the SVM model was obtained as 98 percent, whereas the accuracy of the RF model was 88.8 percent. The decision tree's accuracy was 80.5 percent, compared to 82.6 percent for K-NN and 85.5 percent for LR. The accuracy, precision, recall, and F1-score for each of the five strategies are presented in Table 1. The accuracy scores of the five strategies examined in the paper are compared in Figure 6. The graph used to display learning progress is referred to as the learning curve. Learning curves illustrate the performance of models concerning the size of the training dataset. Examining learning curves helps identify underfitting or overfitting issues and provides insights into the model's stability and generalization capabilities. 100 95 98 Accuracy Score 90 88.8 85.5 85 82.6 80 80.5 75 70 SVM Decision Tree Random Forest K-Nearest Logistic Neighbour Regression Figure 6. Accuracy graph representation of models Table 1: Comparison of Accuracy, precision, recall and F1-score of the 5 ML techniques ML Technique Accuracy Precision Recall F1-Score SVM 98 0.981 0.98 0.982 Decision Tree 80.5 0.804 0.805 0.801 Random Forest 88.8 0.899 0.896 0.89 K-Nearest Neighbour 82.6 0.827 0.821 0.828 Logistic Regression 85.5 0.854 0.859 0.852 The learning curves for all five ML techniques are shown in Figure 7. It shows consistent improvement with increasing data size, indicating that the models ben-efit from larger datasets. SVM maintains a consistently high level of performance throughout, indicating its robustness. LEARNING CURVE 1 SVM Decision Tree 0.5 Random Forest K-Nearest Neighbour Logistic 0 Regression 128 256 384 512 640 768 896 1024 1152 1280 Figure 7. Learning curve for the five ML techniques 5. CONCLUSION The primary component of any marketing strategy is cost forecasting. Finding the right solution with the best specifications at the lowest price is the best marketing strategy. Products can be evaluated based on the needs, brand, and other aspects. Data mining and analysis are the most effective ways to specify the price range recommendations of premium goods to a customer. The comprehensive analysis of ML models for mobile price classification, along with feature importance and performance metrics, provides valuable insights into the dynamics of mobile phone pricing. In this work, a variety of models were trained using mobile features, and a considerable prediction of the range of mobile prices was made. With a 98 percent accuracy rate, the SVM model was shown to be the most accurate. The proposed work may be used to anticipate costs for many things, including vintage automobiles, healthcare products, homes, etc. A premium product might be suggested by specifying the price range that the customer can afford. Future work could explore additional features or refine existing ones to improve model performance further. The dataset's size and diversity play a crucial role in model training. Future research may benefit from larger datasets encompassing a wider range of mobile devices, manufacturers, and geographic regions. Addition-ally, more sophisticated AI algorithms may be used to forecast a product's actual pricing. This study can also be extended with the implementation of a decision matrix and performance scores may be calculated for assigning ranks to the mo-bile devices. A list of mobile devices within the specified price range and with the desired features will assist consumers in making decisions. References [1] M. Alloghani, D. Al-Jumeily, J. Mustafina, A. Hussain and A. J. Aljaaf, "A System-atic Review on Supervised and Unsupervised Machine Learning Algorithms for Data Science," in Supervised and Unsupervised Learning for Data Science, Springer, Cham, 2020. [2] S. K. Das, "Introduction to Mobile Terminals”, Mobile Terminal Receiver Design: LTE and LTE-Advanced, Wiley Telecom, 2017, pp. 1-8. [3] I. Sim, "Mobile Devices and Health”, New England Journal of Medicine, vol. 381, no. 10, pp. 959-968, 2019. [4] K. Mayuri, "Importance of Pricing," [Online]. Available: https://www.economicsdiscussion.net/marketing-management/pricing/importance- of-pricing/31838. [5] M. Alam and I. R. Khan, "Application of AI in smart cities," in Industrial Transfor-mation, Taylor & Francis Group, 2022, pp. 61-86. [6] C. Janiesch, P. Zschech and K. Heinrich, "Machine learning and deep learning," Electron Markets, vol. 31, pp. 685-695, 2021. [7] M. Alam, E. R. Khan, A. Alam, F. Siddiqui and S. Tanweer, "The DIABACARE CLOUD: predicting diabetes using machine learning," Acta Scientiarum Technology, vol. 46, no. 1, 2023. [8] S. Pudaruth, "Predicting the Price of Used Cars using Machine Learning Tech-niques," International Journal of Information & Computation Technology, vol. 4, no. 7, 2014. [9] M. Asim and Z. Khan, "Mobile Price Class prediction using Machine Learning Techniques," International Journal of Computer Applications, vol. 179, no. 29, pp. 6-11, 2018. [10] M. Chen, "Mobile Phone Price Prediction with Feature Reduction," Highlights in Sci- ence, Engineering and Technology, vol. 34, pp. 155-162, 2022. [11] K. Noor and S. Jan, "Vehicle price prediction system using machine learning tech- niques," International Journal of Computer Applications, vol. 167, no. 9, pp. 27-31, 2017. [12] K.-K. Tseng, R. F.-Y. Lin, H. Zhou, K. J. Kurniajaya and Q. Li, "Price prediction of e- commerce products through internet sentiment analysis," Electronic Commerce Research, vol. 18, no. 1, pp. 65-88, 2017. [13] A. Zehtab-Salmasi, A.-R. Feizi-Derakhshi, N. Nikzad-Khasmakhi, M. Asgari-Chenaghlu and S. Nabipour, "Multimodal Price prediction," Annals of Data Science, vol. 10, no. 3, pp. 619-635, 2021. [14] V. Limsombunc, C. Gan and M. Lee, "House price prediction: Hedonic price model vs. Artificial Neural Network," American Journal of Applied Sciences, vol. 1, no. 3, pp. 193- 201, 2004. [15] A. Izzah, Y. A. Sari, R. Widyastuti and T. A. Cinderatama, "Mobile app for stock prediction using Improved Multiple Linear Regression”, International Conference on Sustaina-ble Information Engineering and Technology (SIET), Malang, Indonesia, 2017. [16] W. A. Al-Dhuraibi and J. Ali, "Using classification techniques to predict gold price movement,", 4th International Conference on Computer & Technology Applications, Istanbul, Turkey, 2018. [17] S. K. Mohapatra, A. Jain, Anshika and P. Sahu, "Comparative Approaches by using Machine Learning Algorithms in Breast Cancer Prediction," in 2nd International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE), Greater Noida, 2022. [18] B. Charbuty and A. Abdulazeez, "Classification based on Decision Tree Algorithm for Machine Learning," Journal of Applied Science and Technology Trends, vol. 2, no. 1, pp. 20-28, 2021. [19] F.-J. Yang, "An Extended Idea about Decision Trees," in International Conference on Computational Science and Computational Intelligence (CSCI), Las Vegas, NV, USA, 2019. [20] C. G. Raju, V. Amudha and S. G, "Comparison of Linear Regression and Logistic Regression Algorithms for Ground Water Level Detection with Improved Accuracy," in Eighth International Conference on Science Technology Engineering and Mathematics, Chennai, In-dia, 2023. [21] M. Zong, X. Zhu and D. Cheng, "Learning k for kNN Classification," ACM Transac-tions on Intelligent Systems and Technology, Volume 8, vol. 8, no. 3, pp. 1-19, 2017. [22] M. Schonlau and R. Y. Zou, "The random forest algorithm for statistical learning," The Stata Journal, vol. 20, no. 1, pp. 3-29, 2020. [23] A. Sekulic, M. Kilibarda, G. B. M. Heuvelink, M. Nikolic and B. Bajat, "Random Forest Spatial Interpolation," Remote Sensing, vol. 12, no. 10, p. 1687, 2020. [24] S. Y. Chaganti, I. Nanda, K. R. Oandi, T. Prudvith and N. Kumar, "Image Classifica-tion using SVM and CNN”, International Conference on Computer Science, Engineering and Applications, Gunupur, India, 2020. [25] J. Cervantes, F. Garcia-Lamont, L. Rodriguez-Mazahua and A. Lopez, "A compre-hensive survey on support vector machine classification: Applications, challenges and trends," Neurocomputing, vol. 408, pp. 189-215, 2020. [26] A. Sharma, "Mobile Price Classification," Kaggle, [Online]. Available: https://www.kaggle.com/datasets/iabhishekofficial/mobile-price-classification. [27] A. Yaicharoen, K. Hashikura, M. A. S. Kamal and I. Murakami, "Effects of Dimen-sionality Reduction on Classifier Training Time and Quality," 3rd International Symposium on Instrumentation, Control, Artificial Intelligence, and Robotics (ICA-SYMP), Bangkok, Thai-land, 2023. [28] R. Han, Rodriguez-Mayorga and S. Luber, "A Machine Learning Approach for MP2 Correlation Energies and Its Application to Organic Compounds," Journal of Chemical Theory and Computation, vol. 17, no. 2, pp. 777-790, 2021.