Entity Embedding in Artificial Neural Networks: A Novel Approach to Sales Data Analysis and Forecasting Christina Jayakumaran1, Sengole Merlin2, Vaishali R Kulkarni3,*, Thompson Stephan4 and Punitha S5 1Department of Computer Science and Engineering, Loyola-ICAM College of Engineering and Technology, Chennai, Tamil Nadu, India - 600034 2Strategic Education Inc., Texas, USA. 3Department of Computer Science and Engineering, Graphic Era (Deemed to be University), Dehradun, Uttarakhand, India - 248002 4Department of Computer Science and Engineering, Graphic Era (Deemed to be University), Dehradun, Uttarakhand, India - 248002 5Department of Computer Science and Engineering, Graphic Era (Deemed to be University), Dehradun, Uttarakhand, India - 248002 Abstract This study delves into the realm of sales forecasting, a critical component for strategic business planning, encompassing staff scheduling, inventory management, and supply chain optimization. At its core, the research investigates the efficacy of advanced predictive analytics in sales forecasting, leveraging historical and current data to unearth patterns that guide business decision-making. The focus is primarily on the application and comparative analysis of two sophisticated algorithms: eXtreme gradient boost (XGBoost) and deep neural networks (DNNs). These methods are explored for their potential to enhance forecasting accuracy using sales data. A notable aspect of this research is exploring entity embedding within the artificial neural network (ANN), highlighting its relevance and application in the context of sales data analysis. This comprehensive approach aims to offer insights into the most effective predictive models for sales forecasting, contributing to the broader field of predictive analytics in business. Keywords ANN, Entity Embedding, Sales Forecasting, Time Series Analysis, XGBoost 1. Introduction In the rapidly evolving business landscape, the ability to accurately forecast sales has become a cornerstone for strategic planning and operational efficiency. Sales forecasting, a critical component in the vast domain of supply chain management, significantly impacts areas like inventory management, staff scheduling, and supply chain optimization. The accuracy of sales forecasts directly influences a company’s ability to make informed decisions, manage resources effectively, and maintain a competitive edge in the market [1]. Despite its critical importance, traditional sales forecasting methods often fall short in today’s dynamic and complex market environments. These methods, typically grounded in statistical analysis of historical data, struggle to adapt to the nonlinear and evolving patterns of consumer behavior and market trends. The limitation of traditional approaches in handling large and varied datasets underscores the need for more advanced and sophisticated forecasting techniques. In response to these challenges, this paper introduces machine learning as a transformative solution for enhancing the accuracy and dynamism of sales forecasts. Machine learning, with its capability to process and learn from vast amounts of data, presents an opportunity to develop more robust and adaptable forecasting models. Among the various machine learning algorithms, XGBoost and DNNs are notable for their efficacy in predictive modeling tasks. This research aims to explore and evaluate Symposium on Computing & Intelligent Systems (SCI), May 10, 2024, New Delhi, INDIA * Corresponding author. christina@licet.ac.in (C. Jayakumaran); sengolemerlin@gmail.com (S. Merlin); vaishali@ieee.org (V. R. Kulkarni); thompsoncse@gmail.com (T. Stephan); punitharesearch@gmail.com (P. S) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop ceur-ws.org ISSN 1613-0073 1 Proceedings the application of these two algorithms in the context of sales data analysis, offering a comparative analysis of their performance and suitability. A novel aspect of this study is the incorporation of entity embedding within ANN for sales data analysis. Entity embedding, a technique for transforming categorical variables into numerical forms, is particularly relevant in sales forecasting, where data often comprises a mix of categorical and numerical information. By integrating entity embedding, the research aims to enhance the predictive capability of ANNs, enabling them to handle the intricacies of sales data more effectively The primary objective of this paper is to present a comprehensive analysis of XGBoost and DNNs in sales forecasting, highlighting the benefits and limitations of each approach. Additionally, the paper explores the innovative application of entity embedding within ANNs, aiming to contribute to the broader field of predictive analytics in business. The structure of the paper includes a detailed examination of existing sales forecasting techniques, the methodology for implementing the machine learning models, and a comparative analysis of the performance of XGBoost and DNNs, culminating in a discussion of the implications and potential applications of entity embedding in sales data analysis [2]. Through this research, we aim to provide valuable insights and methodologies for businesses seeking to improve their sales forecasting capabilities in the face of rapidly changing market dynamics. 2. Literature Review The use of machine learning (ML) techniques in predictive analytics for businesses has been widely studied. For instance, a Machine Learning Model and Rule Engine can be used to predict sales forecasts on historical data, which can be fed to a front-end application for processing [3]. This application will show the predicted data for the following week. The XGBoost Sales prediction model created for the Walmart dataset was used as a reference point for the model presented in this chapter [4] . This paper outlines the advantage of using an XGBoost model as the corresponding error scores are 16.3% and 15.4% lower than Linear Regression and Ridge regression models respectively. In [5] the application of the XG Boosting algorithm is implemented. The insights on feature engineering through the use of XGBoost feature importance ranking, have been implemented in the proposed model. XGBoost is frequently cited as a highly effective ML model that can be applied to many problems. However, the key to developing an efficient model is to choose the right features beforehand, by performing data analysis and feature engineering techniques [6]. In recent years, XGBoost has been widely used in forecasting as it performs comparatively better than classic regression models. In other words, XGBoost outperforms the existing models by performing better in a shorter period [7]. In comparison to traditional models such as Linear regression models and Support Vector Machines, ANN also perform well. On applying ANN to Walmart Sales data ANN were found to have the lowest RMSE scores when compared to the previously mentioned traditional models [8]. In fast-paced industries, quick and reliable sales forecasting models are an invaluable resource. One such example is the fashion or retail industry. Sales predictions using NN models in this industry have proven to perform well, with low root mean square percent error (RMSPE) and mean square error (MSE) scores [9]. Another essential practice in the retail industry is promotional sales, which can be cold-start forecasted using gradient boosting algorithms [10]. 3. Existing Systems There are two types of sales forecasting used by most companies: quantitative and qualitative Sales Forecasting. Quantitative sales forecasting requires numerical data, so some commonly used data includes consumer spending and economic trends. In the simplest form, quantitative sales forecasting uses linear equations to calculate predicted values as given as given in (1). 𝑌 = 𝑎0 + 𝑎1 + 𝑎2 2 + . . . + 𝑎 (1) 2 Figure 1: Proposed Architecture Table 1 Distribution of Stores over Assortments Store Type/Assortment Level Level ’a’ Level ’b’ Level ’c’ Total Stores Store Type A 381 0 221 602 Store Type B 7 9 1 17 Store Type C 77 0 71 148 Store Type D 128 0 220 348 Total Stores 593 9 513 1115 As more data is collected, forecasting models become more complex, so generic linear models will not predict reliable forecast values. Quantitative models are also not useful for businesses with no historical data, or with limited historical data. These models perform much better with data from established businesses. Qualitative Sales forecasting is used in multiple applications with a bigger scope in comparison with quantitative sales forecasting. Qualitative sales forecasting is classified as the jury of executive opinion, delphi technique, sales force composite, and surveyor of buyer intentions. 4. Proposed Architecture For the proposed system, sales data was collected from 1115 Rossman stores in Germany. This data was wrangled and analyzed. This system will predict the sales of each store in the company for the next 6 weeks. The different activities in promoting, distributing, and consumption of various products and services are covered under economic features. The forecasting of sales is based on various economic features, and it covers several visual and display options called temporal features. The other parameters include promotion policies and steps in promoting the product, the season of the sale that comprises school holidays, locations and other competing stores, the different times of the year, etc. The overall proposed architecture for the sales data forecasting and analysis is depicted in Figure 1. Commonsense reasoning and analytic knowledge methods are used for building the model.These two methods are used during the analysis of the data and are helpful in the development of a definitive conclusion. Two machine learning models are developed, one named XGBoost specific to the Rossmann company, and the other a neural network generic to all companies for their sales forecasting. The details of distribution of stores over assortments is as given in Table 1. 3 Figure 2: Distribution of Store Models and Average Sales Figure 3: Day of the Week Versus Sales in each store 5. Implementation 5.1. Data The data used here has been collected from Kaggle, a popular website for Data Science related projects. The dataset of Rossmann Drugstore that was published for a competition for predicting sales is used. The Google Trends data, weather data, and state (location of the store) data of each store of Rossmann on each day were used as external factors to predict the sales. The training data contains 1, 017, 210 records and the testing data contains 41089 records. On performing Exploratory Data Analysis (EDA), there are some significant correlations. Figure 2 denotes the distribution of the store types and its effects on the sales value. Figure 3 depicts the day of the week and its correlation with the sales value for each of the store type. The number of sales and the number of customers have a strong positive correlation, which means that more customers lead to better sales. This, however, was an expected trend. If the store offers a promotion (Promo 1), then the number of customers increases, which leads to more sales. Adversely, if the store offers a consecutive promotion (Promo 2), then the number of customers remains the same or even decreases slightly. The Pearson’s correlation matrix of the attributes is depicted in Figure 4. The weekly sales status trend and yearly sales trend is depicted in Figure 5 and Figure 6 respectively. The following points give a general overview of the data as a whole: • The most selling and crowded Store Type is A. • The best Sale per Customer (Store Type D) shows the higher Buyer Cart. We could also assume that the stores of this type are in rural areas, so customers prefer buying more but less often. 4 • Low Sale_Per_Customer amount for Store Type B shows the possibility that people shop there essentially for small things. This can also indicate the label of this store type - “urban” - as it is more accessible to the public, and customers don’t mind shopping there from time to time during the week. • Customers tend to buy more on Mondays when there is one promotion running (Promo) and on Figure 4: Pearson’s Correlation Matrix Figure 5: Weekly Status Figure 6: Yearly Status 5 Figure 7: Autocorrelation and Partial Correlation of Each Store Sundays when there is no promotion at all. • Promo2 alone does not seem to be correlated to any significant change in the Sales amount. 5.2. Time Series Analysis of Data A time series analysis is done using the different store types but not by refereeing the individual store. Overall sales seem to increase, but not for Store Type C (a third from the top). Even Though Store Type A is the most selling store type in the dataset, it can follow the same decreasing trajectory as Store Type C did. The non-randomness of the time series and high lag-1 are common things for each pair of plots. The probability is that these two entities may probably need a higher order of differencing d/D. Type A and type B: Both types show seasonality at certain lags. For type A, it is each 12th observation with positive spikes at the 12 (s) and 24(2s) lags and so on. For type B it’s a weekly trend with positive spikes at the 7(s), 14(2s), 21(3s) and 28(4s) lags. Type C and type D: Plots of these two types are more complex. The auto correlation and partial correlation of each store is shown in Figure 7. It seems like each observation is correlated to its adjacent observations. 5.3. XGBoost Boosting is considered to be an optimization method. Gradient boosting is a machine learning algorithm that is frequently used in regression and classification. It produces prediction models, mostly in the form of decision trees. Weak learners can be combined to create one strong learner. XGBoost, in particular, is used for supervised learning tasks. The XGBoost makes use of the term regularization to manage the complexity of the model. With the help of regularization, the problem of overfitting can be prevented. A popular tree ensemble model is an XGBoost and it consists of a set of classification and regression trees (CART). The tree includes the family members as distinct leaves and each leaf is allotted a score value. The CART includes a real score along with the decision value. The use of real scores helps in interpreting value in a better way than just the classification. XGBoost is optimized for boosting tree algorithms. Figure 8 indicates the order of importance of each feature and the RMPSE values against the iterations are denoted. The XGBoost provides a better framework than gradient boosting and it is faster than existing gradient boosting algorithms. XGBoost is based on the linear solver model, and it includes various objective functions. The functions make use of regression, classification, and ranking methods. The objective function used is given in (2). 𝑂𝑏(𝜃) = 𝐿(𝜃) + 𝜆(𝜃) (2) where 𝐿(𝜃) is a training loss and 𝜔(𝜃) refers to regularization. 6 (a) Feature Importance based on XGBoost (b) RMSPE - During the Training Phase of the XGBoost Figure 8: XGBoost Details 5.4. Artificial Neural Network In a neural network, the inputs are multiplied by their corresponding weights and then summed together. This is depicted in 9. The sum is passed through an activation function which selects which neurons to activate within the network. The corresponding output is a single value.The output function 𝑌 is represented in (3). 𝑌 = 𝐹 (0 + 11 + 22 + 33 + . . . + ) (3) The activation functions considered are the sigmoid and ReLU function. The sigmoid function maps values between 0 to 1. The sigmoid function is frequently used in machine learning models that work on probabilities. This function is both monotonic and differentiable. Rectified Linear Unit, or ReLU, is a widely used activation function. In ReLU, the gradient is positive for any positive input values, but there is no gradient (or gradient = 0) for negative values. ReLU ranges from 0 to infinity. Both the 7 Figure 9: ANN Figure 10: Total Sales for Store ID = 1 function and its derivative are monotonic. 5.5. Entity Embedding Entity embedding involves the mapping of categorical variables into Euclidean spaces. The neural network learns this mapping during the training phase. Embedding reduces memory usage and increases the speed of neural networks. Similar values are mapped in the embedding space for disclosing the intrinsic properties of the categorical variables. When the data sets have large high cardinality features, the other methods are overfit and cannot be used. In this study, the data sparsity problem is overcome by representing the discrete category features in a continuous space. The category similarity is reflected using the distance between category points. The idea is to utilize the data points close by, which is used for approximating missing data points. In Figure 10 and Figure 11, the focus is on the overall sales performance of a single store (Store ID = 1). It can be seen that Saturdays are the most profitable days of the week. This figure highlights sales Table 2 Performance Evaluation RMSPE Algorithm Metric Time (seconds) XGBoost 0.094663 541.53 s (9 mins) Deep Neural Networks 0.1015 1839.65 s (30 mins) 8 Figure 11: Contribution to Overall Sales By Store Type optimization techniques at the granular level (in a single store, on a particular day). Also the focus is given to how each store type performs. Store Type A has the highest percentage of overall sales with an overwhelming 53.9%. The performance measurement based on the RMSPE metric and time is expressed in Table 2. 6. Conclusion Two models based on machine learning models namely XGBoost and entity embedding neural network are used for sales forecasting in stores. The task involved predicting the sales on any given day at any store. The studied previous work is adapted in the domain including time series algorithms in machine learning. The patterns and outliers are identified using analysis of the data. The analysis has boosted the prediction algorithm. XGBoost has performed best at prediction and slightly better than neural networks. However, a neural network is suggested to be used for forecasting sales of those companies whose sales trend deviates from that of Rossmann’s sales used in this project as an experimental dataset. The major parameter used for fitting is the measurement of the overall prediction error rather than the specific decomposition of error into bias and variance. The uncorrelated sales responses in various data stores are presented using RMSPE. Acknowledgement Authors acknowledge the support received from Graphic Era Deemed to be University, Dehradun, India. References [1] A. Ahlemeyer-Stubbe, S. Coleman, A Practical Guide to Data Mining for Business and Industry, 1st ed., Wiley Publishing, 2014. [2] M. Seyedan, F. Mafakheri, Predictive big data analytics for supply chain demand forecasting: Methods, applications, and research opportunities, Journal of Big Data (2020). [3] M. A. Khan, S. Saqib, T. Alyas, A. Ur Rehman, Y. Saeed, A. Zeb, M. Zareei, E. M. Mohamed, Effective demand forecasting model using business intelligence empowered with machine learning, IEEE Access 8 (2020) 116013–116023. [4] X. dairu, Z. Shilong, Machine learning model for sales forecasting by using xgboost, in: 2021 IEEE International Conference on Consumer Electronics and Computer Engineering (ICCECE), 2021, pp. 480–483. 9 [5] Y. Niu, Walmart sales forecasting using xgboost algorithm and feature engineering, 2020 Interna- tional Conference on Big Data & Artificial Intelligence & Software Engineering (ICBASE) (2020) 458–461. [6] S. Ghosh, C. Banerjee, A predictive analysis model of customer purchase behavior using modified random forest algorithm in cloud environment, in: 2020 IEEE 1st International Conference for Convergence in Engineering (ICCE), 2020, pp. 239–244. [7] R. P, S. M, Predictive analysis for big mart sales using machine learning algorithms, in: 2021 5th International Conference on Intelligent Computing and Control Systems (ICICCS), 2021, pp. 1416–1421. [8] J. Chen, W. Koju, S. Xu, Z. Liu, Sales forecasting using deep neural network and shap techniques, 2021 IEEE 2nd International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE) (2021) 135–138. [9] C. Giri, Y. Chen, Deep learning for demand forecasting in the fashion and apparel retail industry, Forecasting 4 (2022) 565–581. [10] C. Aguilar-Palacios, S. Muñoz-Romero, J. L. Rojo-álvarez, Cold-start promotional sales forecasting through gradient boosted-based contrastive explanations, IEEE Access 8 (2020) 137574–137586. 10