Random Forest and XGBoost Based Fingerprinting Using MMSE: An Approach to Data-Centric AI to Enhance Indoor Wi-Fi Localization Systems. Mariame Niang 1, Philippe Canalda2, François Spies2, Massa Ndong 3, Ibra Dioum 4, Idy Diop4, and Mohamed A.El Ghany 5 1 University Cheikh Anta Diop of Dakar, 5005, Dakar, Senegal 2 Department of FEMTO-ST Institute/UMR CNRS 6174 Montbeliard, France 3 University Virtual of Senegal, Dakar, Senegal 4 Higher Polytechnic School Cheikh Anta Diop University of Dakar, 5005, Dakar, Senegal 5 German University in Cairo, 3611, Cairo, Egypt Abstract The indoor localization problem consists in identifying the Cartesian coordinates of an object or a personal asset in the buildings, malls, hospitals, campuses, factories, etc. To solve this problem, we consider a Wi-Fi-based localization method called fingerprinting, a two-step process, where a radio map of the monitored area is constructed by collecting signal strength from known locations. An unknown location is then predicted using this radio map as a reference. In this paper, we first propose an adapted Random Forest (RF) and Extreme Gradient Boosting (XGB) algorithms. This adaptation, combined with Minimum Mean Square Error (MMSE), improves the accuracy problem caused by the change of environment and extends the concept by adding a signal processing functionality as an edge cloud feature to address a dynamic cooperation clustering. By embedding the Wi-Fi Access Point (WAP) with multiple antennas, the signals sent by the Mobile User Equipment (MUE) can be processed to improve the accuracy of the bootstrap. Adding Minimum Mean Square Error (MMSE) is a kind of data- centric approach because it yields high-quality data as input. The noise inherent in the location data is reduced and thus the performance of the MMSE-aided RF and XGB improved. This enhancement is further extended by sharing data between WAPS. Thus, the MMSE processing and the sharing of such processed data between WAPS enhance the positioning model performance. The performance of these methods is evaluated through robust and extensive experiments in real-time indoor areas, with regular and reproducible scenarios. We found an interesting outcome that the proposed approach can offer better time-2-market compared to the traditional, non-Machine-Learning-based indoor positioning system approach. Keywords 1 Indoor Positioning, Wi-Fi signals, Fingerprinting approach, Machine Learning, Extreme Gradient Boosting (XGB), Random Forest (RF), Received Signal Strength Indicator (RSSI), Data-Centric Artificial Intelligence, Minimum Mean Square Error (MMSE). 1. Introduction The rapid growth of the Internet of Things (IoT), resulted in a wide range of services, including Location Based Services (LBS). Generally, localization refers to the process of obtaining the same region or the geographical location of a user or a device. Enabling accurate location-based services depends on the availability of location information. Localization systems can be categorized into IPIN 2022 WiP Proceedings, September 5 - 7, Beijing, China EMAIL: mariame.niang@gmail.com (M. Niang); philippe.canalda@femto-st.fr (P. Canalda); francois.spies@univ-fcomte.fr (F. Spies); massandong@mail.com (M. Ndong); ibra.dioum@esp.sn (I. Dioum); idy.diop@esp.sn (I. Diop); moh_salim@hotmail.com (M. A. El Ghany) ORCID:0000-0003-2577-1437 (M. Niang); 0000-0002-6477-3673 (P. Canalda); 0000-0002-9964-2745 (F. Spies); 0000-0001-5773-7589 (M. Ndong); 0000-0002-2586-3908(I. Dioum); 0000-0002-9143-196X (I. Diop); 0000-0002-6282-773 (M. A. El Ghany). ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) outdoor localization and indoor localization. The Global Positioning System (GPS) is the main technology used to determine the position in outdoor localization. However, its accuracy deteriorates in the indoor environment due to the poor penetration of GPS signals inside buildings, a lot of power consumption, and the multipath effects on the propagating signals [1]. There is an urgent need to address precise indoor localization. Nowadays, indoor localization is highly used in our daily life. It is used in tracking the location inside a building, malls, hospitals, campuses, factories, etc. Several techniques are employed for localization parameter measurements, including Time of Arrival (ToA) [2], Time Difference of Arrival (TDoA) [3], Received Signal Strength Indicator RSSI [4], Angle of Arrival (AoA), and Time of Flight (ToF) [5]. These approaches suffer from many challenges, including poor accuracy, high computational complexity, multipath effect, shadowing, fading, and delay distortion. The fingerprints method achieves great attention recently due to its promising results with various ways of making predictions. In fingerprinting, a database is first built with data collected from a thorough measurement of the field in the offline stage. Then, the position of a mobile user can be estimated by comparing the newly received test data with that in the database, the online phase. Besides, Wi-Fi fingerprinting localization is one of the methods based on RSSI [6,7,8,9,10], Euclidean distance [11], based on RSSI ranging [12], trilateration [13], etc. Compared to other indoor localization methods, Wi-Fi fingerprinting localization technology has some advantages including low hardware requirements and wide scope of application. At the same time, the technology needs to cooperate with more advanced algorithms to ensure higher positioning precision [14]. However, indoor localization using Wireless Local Area Network (WLAN) fingerprinting faces several challenges including propagation effects, which degrades the localization accuracy [15]. The rest of the paper is organized as follows. Section 2 gives a brief about the state of the art. Section 3 presents our proposed localization methods. Section 4 presents the localization performance of the algorithm in different ways, and section 5 the conclusion of the work. 2. State of the art of the previous works With the rapid growth in Machine Learning (ML) systems, similar approaches need to be developed in the context of ML engineering, which handles the unique complexities of the practical applications of ML. This is the domain of MLOps. It is a set of standardized processes and technology capabilities for building, deploying, and operationalizing ML systems rapidly and reliably. In recent years, ML algorithms such as K-Nearest Neighbor (KNN) [16], Random Forest (RF) [17], XGB [18,19], Support Vector Machine (SVM) [ 20,21], KNN, a rules-based classifier (JRip), Decision Tree (DT), RF, and SVM [22], KNN, WKNN [23], RF, and XGB [24] have been applied to the RSSI fingerprinting positioning technique and have achieved better location results. When the structure and layout of the indoor environment change, the indoor wireless communication environment also changes, which leads to a large gap between the new environment and the established positioning fingerprint. However, the establishment process of the fingerprint is very time-consuming and laborious. It is not economical or realistic to update all positioning fingerprints regularly and frequently, which will greatly improve the maintenance cost of the RSSI location fingerprinting system. Several methods to reduce the inaccuracies in location measurements are proposed in the literature [25]. There is no regular test in the work we have seen. In our previous work [24], reproducing these tests can bias the experiments. To assess the bias of machine learning methods, carrying out more regular and reproducible tests will make it possible to resolve these questions. It is possible to improve the position system performance by using fingerprint techniques that employ multipath information in an ML framework, which operates a dataset generated in real-time using MMSE. In this work, we consider the RSSI between the transmitter and the receiver as the localization attribute. This is because the RSSI-based approach poses minimum requirements on the Wi-Fi technology of the requisite modules. RF and XGB algorithms combined with MMSE are proposed to minimize both the measurement noise and resolve the accuracy problem caused by the change of environment for indoor localization tasks. The method first uses RF and XGB algorithms to establish an indoor positioning model, which can achieve indoor positioning. When the environment changes, a further MMSE method is used to improve the initial positioning. However, Data-centric approaches to solving AI problems have been dominant in applications where large and high-quality datasets are available. Such approaches aim to improve model performance through the development of more complex architectures. 3. Experiments The fingerprint map is built where it contains the data points covering the whole area to be used by the algorithms to predict the position. Each data point has the RSSI values from four fixed APs and their position. The whole area is 9.5 m x 9.25 m, as shown in Figure 1. A point was taken every 0.2 m from the x-axis and every 0.5 m from the y-axis starting from the origin unless there were obstacles like walls or furniture that prevented taking the point. This approach for the fingerprint map resulted in having 700 data points covering the whole area. Our approach was to increase the number of data points and decrease the spacing between them to increase the accuracy in predicting the location. We have as input a list of 700 points. For each measurement point, we have 20 RSSI values then we calculate the mean of the 20 points as RSSI. (m). However, the RSSI values are very fluctuating so the mean is not enough to characterize the precision. To improve accuracy, the mean (m) and the MMSE are combined. We performed a point density analysis for the different scenarios. For this, we carried out different scenarios depending on the size of the training and testing. First, we divided our data at 10 %, we have 70 for training and 630 testing points evenly distributed along with x coordinates at 0.2 m doing 1 of 2 along x and by doing 1 out of 5 according to the y coordinates at 0.5 m to respect the pitch homogeneously, that is to say, take the diagonal. At 33 %, we divided our database by 3 by doing 1 out of 3 along x and 1 out of 1 along y which gives 233 for training and 467 testing points respecting the step between the coordinates x and y. At 66 %, we used 2/3 of our database, i.e.467 for training and 233 testing points. At 80 %, we divided our base by 4/5 using the fourth points for training and fifth points for testing resulting in 560 for training and 140 testing points. Then, we added a random positioning algorithm as a reference algorithm to compare the quality of our proposal compared to the random one. For this, we took a random point among the 700 and we calculate the distance of this point from real coordinates which gives us a distance of 7.5 m. We also used the midpoint algorithm, another benchmark algorithm. The midpoint is the central point which corresponds to the 350 points of our database and we calculate the distance from this point to the 699 remaining points then we calculate the average. We found a distance of 3.5 m for the midpoint. Finally, we calculated the Confidence Interval (IC) for each test point, a statistical result by calculating the mean and the standard deviation. For this, we give a confidence interval on these values. We used the following formula to calculate the IC. If X is a random variable defined on Ω of unknown expectation m and standard deviation б and if 𝑥̅ is the mean of the values observed on a sample of size n, IC at the confidence threshold α for the parameter m is: 𝜎 𝜎 𝛼+1 𝐼𝛼 = [𝑥̅ - t , 𝑥̅ + t ] where 𝜋(𝑡) = , (1) √𝑛 √𝑛 2 In MLOps, the model training lets efficiently and cost-effectively run powerful algorithms for training RF and XGB with MMSE models. Model training should be able to scale with the size of both the models and the datasets that are used for training. The testing model capability lets us understand how newly trained models perform. It enhances the reliability of our ML releases by helping to decide whether to reject poorly performing models and promote well performing ones. In the process of serving predictions, once our model is deployed to an indoor environment, the model service starts accepting prediction requests and providing responses with predictions.The testing data is used to evaluate the predictions generated by the ML model. The predicted locations will be compared to the actual positions of the test points able to evaluate the performance of different algorithms. Figure 1: Area of indoor localization test:700 Point of Reference (PoR)/ Point of Test (PoT), real indoor evaluation room based on ML with various ratios (for example here 75 % training and 25 % testing with fairly regular topology). 3.1. Hardware The offline phase is divided into different parts. Firstly, the RSSI reading was taken using an android app called Wi-Fi Fingerprint installed on HTC One X9. This RSSI value can be fluctuated due to the shadowing effect. Adding MMSE an approach of data-centric AI at each WAP mitigate the effect of environmental variation by reducing the noise in the data. This new fingerprint map was saved in an excel sheet CSV file to be used by the algorithm and sent to Python. Secondly, in the online phase, a Wi-Fi module ESP can read the values from APs and send this value to Firebase. Firebase database is specifically used because it is easy to be integrated with the Wi-Fi module and has also a library defined in Python making it easy to deal with the Firebase [26] database. Finally, Python IDE ‘Spyder’was used to access the data in the excel sheet. The dataset is divided into training and testing. The training data is used to train the machine learning model to predict the position and the testing data is used to evaluate the predictions generated by the machine learning model, as shown in Figure 2. Figure 2: Steps of Fingerprint positioning using machine learning. 3.2. MMSE Estimation A variety of speech enhancement approaches have been proposed. They differ in the statistical model, distortion measure, and in the manner in which the signal estimators are being implemented. Perhaps the simplest scenario is obtained when the signal and noise are assumed statistically independent Gaussian processes, and the MSE distortion measure is used. For this case, the optimal estimator of the clean signal is obtained by the Wiener filter. Since speech signals are not strictly stationary, a sequence of Wiener filters is designed and applied to vectors of the noisy signal. MMSE estimation under Gaussian assumptions leads to linear estimation in the form of Wiener filtering. Noise Reduction using MMSE can be used where the enhancement of noisy speech signals is essentially an estimation problem in which the clean signal is estimated from a given sample function of the noisy signal. The goal is to minimize the expected value of some distortion measure between the clean and estimated signals. For this approach to be successful, a perceptually meaningful distortion measure must be used, and a reliable statistical model for the signal and noise must be specified. At present, the best statistical model for the signal and noise, and the most perceptually meaningful distortion measure, are not known. Due to the shadowing effect which deteriorates the MSE of localization. The MMSE estimation of Wireless Sensor Networks (WSN) is investigated. This MMSE algorithm can be used to locate the coordinates of unknown node values and also minimize location errors. Their simulation results show that the distance variance of distances between reference nodes and unknown nodes increases the MSE of localization [27]. In this paper, to calculate the MMSE, we use the method proposed in [28] by using for APs with their coordinates such as 𝐴𝑃1 (𝑥1 , 𝑦1 ), 𝐴𝑃2 (𝑥2 , 𝑦2 ), 𝐴𝑃3 (𝑥3 , 𝑦3 ), 𝐴𝑃4 (𝑥4 , 𝑦4 ) and M (𝑥, 𝑦) the coordinates of the mobile user (𝑥 − 𝑥1 )2 + (𝑦 − 𝑦1 )2 = 𝑑12 (𝐴) 2 2 (𝑥 − 𝑥2 ) + (𝑦 − 𝑦2 ) = 𝑑2 2 (𝐵) 2 2 2 , (2) (𝑥 − 𝑥3 ) + (𝑦 − 𝑦3 ) = 𝑑3 (𝐶) (𝑥 − 𝑥4 )2 + (𝑦 − 𝑦4 )2 = 𝑑42 (𝐷) After subtraction of the equations (𝐴) 𝑒𝑡 (𝐵) then (𝐶) 𝑒𝑡 (𝐷), we will have the following systems: {𝑥12 − 𝑥22 − 2𝑥 (𝑥1 −𝑥2 ) + 𝑦12 − 𝑦22 − 2𝑦 (𝑦1 − 𝑦2 )= 𝑑12 − 𝑑22 {𝑥22 − 𝑥32 − 2𝑥 (𝑥2 −𝑥3 ) + 𝑦22 − 𝑦32 − 2𝑦 (𝑦2 − 𝑦3 )= 𝑑22 − 𝑑32 This can be written as a linear equation and becomes: 𝑥 𝑥 2 − 𝑥 2 + 𝑦12 − 𝑦22 − 𝑑12 − 𝑑22 2 (𝑥1 − 𝑥2 ) 2y (𝑦1 − 𝑦2 ) bX=a such as b= [𝑦]; a=[ 1 2 2 2 ];X=[ ] 𝑥2 − 𝑥3 + 𝑦22 − 𝑦32 − 𝑑22 − 𝑑32 2 (𝑥2 − 𝑥3 ) 2 (𝑦2 − 𝑦3 ) 𝑥 2 (𝑥1 − 𝑥2 ) 2 (𝑦1 − 𝑦2 ) −1 𝑥12 − 𝑥22 + 𝑦12 − 𝑦22 − 𝑑12 + 𝑑22 [𝑦]=[ ] [ 2 ] ,(3) 2 (𝑥2 − 𝑥3 ) 2 (𝑦2 − 𝑦3 ) 𝑥2 − 𝑥32 + 𝑦22 − 𝑦32 − 𝑑22 + 𝑑32 Distance measurements can be disturbed by noise or obstacles, which makes distances instead, distances are used with measurement errors and the equation becomes: 𝑑̂𝑖 = √( 𝑥𝑖 − 𝑥̂)2 + ( 𝑦𝑖 − 𝑦̂)2 , (4) for i=1,.,n. n is the number of AP. The Squaring and rearranging these terms yields the following equation for each access point measurement (𝑥̂ − 𝑥1 )2 + (𝑦̂ − 𝑦1 )2 = 𝑑̂2 (1) 1 (𝑥̂ − 𝑥2 )2 + (𝑦̂ − 𝑦2 )2 = 𝑑 ̂2 (2) 2 (𝑥̂ − 𝑥3 )2 + (𝑦̂ − 𝑦3 )2 = 𝑑̂2 (3) ,(5) 3 ̂2 (4) (𝑥̂ − 𝑥4 )2 + (𝑦̂ − 𝑦4 )2 = 𝑑 4 𝑥̂ 2 (𝑥1 − 𝑥2 ) 2 (𝑦1 − 𝑦2 ) −1 𝑥 − 𝑥2 + 𝑦1 − 𝑦22 − 𝑑 2 2 2 ̂2 + 𝑑 ̂2 [ ]=[ ] [ 1 1 2 ], (6) 𝑦̂ 2 (𝑥2 − 𝑥3 ) 2 (𝑦2 − 𝑦3 ) 2 2 2 2 𝑥 − 𝑥 + 𝑦 − 𝑦 −𝑑 + 𝑑 ̂2 ̂2 2 3 2 3 2 3 The difference between equations (6) and (3) gives: ̂2 − 𝑑2 ) + (𝑑2 − 𝑑̂ 2 (𝑥1 − 𝑥2 ) 2 (𝑦1 − 𝑦2 ) −1 ( 𝑑 2 𝑥̂ − 𝑥 1 2 2) [ ]= [ ] [ 1 ] , (7) 𝑦̂ − 𝑦 2 (𝑥2 − 𝑥3 ) 2 (𝑦2 − 𝑦3 ) ̂2 − 𝑑2 ) + ( 𝑑2 − 𝑑 (𝑑 ̂2 2 2 3 3 𝑥̂ 2 (𝑥1 − 𝑥2 ) 2y (𝑦1 − 𝑦2 ) 𝑥1 − 𝑥2 + 𝑦1 − 𝑦22 − 𝑑 2 2 2 ̂2 + 𝑑 ̂2 1 2 A=[ ], W= [ ],Z=[ ] 𝑦̂ 2 (𝑥2 − 𝑥3 ) 2 (𝑦2 − 𝑦3 ) 2 2 2 2 ̂2 ̂2 𝑥2 − 𝑥3 + 𝑦2 − 𝑦3 − 𝑑2 + 𝑑3 A is solved using the Moore-Penrose generalized matrix inverse solution for the MMSE [29], [30]. 𝐴 = (𝑊 𝑇 𝑊)−1 𝑊 𝑇 Z (8) However, Federated learning (FL) is a distributed learning framework. As described in [31], FL requires end-users’ devices with low computation power to send in their local pretrained machine learning model to a sink. The sink will concatenate the models into a global model to perform ML tasks. The models received at the sink are affected by noise, and the sink needs to mitigate the noise to effectively use the local models. Similarly, MMSE is used in our proposed approach to Data-centric AI to suppress the noise of the received measurement used in the fingerprinting. 3.2.1. Data-centric AI with MMSE Due to the training datasets which impact the performance of the ML, this paper explores the concept of data-centric explanations for ML systems that describe the training data to the end-user. Their results show that data-centric explanations have the potential to impact how users judge the trustworthiness of a system and to assist users in assessing fairness [32]. A data-centric approach to AI provides a systematic way to improve data, build data consensus, and clean up inconsistent data. This is usually overlooked and data collection is treated as a one-time task. The data-centric approach is more rewarding and calls for a move towards data centrism. To make MLOps systematic, it uses firstly a model-centric view to collect what data it can develop a model good enough to deal with the noise in the data and hold the data fixed and iteratively improve the model. Secondly, it uses a data-centric view witch the consistency of the data is paramount. However, using tools to improve the data quality will allow multiple models to do well but to hold the code fixed and iteratively improve the data. MLOps’ most important task is to make high-quality data available through all stages of the ML project lifecycle example prediction serving [33]. In wireless signal processing applications, where the RSSIs values are usually noisy, a potentially more fruitful approach is MMSE as an approach to data-centric AI one that focuses on improving the data to make simpler wireless network locations perform better. The idea is to enhance signal data by improving removing noise. This idea can be extended to include transforming signals into a wireless network where key features become more prominent and easier to use. However, with a data-centric view, there is significant room for improvement in problems with noise. 3.2.2. Random Forest MMSE RF contains several DTs on various subsets of the given dataset and takes the average to predict the location and the accuracy of the dataset compared to other algorithms in ML such as SVM, KNN, etc. During training, a set of labeled training points can be used to optimize the parameters of the tree, and for testing the same unlabeled test input data is pushed through each component tree. At each internal position, a test is applied and the data point is sent for a prediction. To extend the concept by adding a signal processing functionality as an Edge cloud feature to implement a dynamic cooperation clustering, the MMSE algorithm at each WAP to enhance the quality of the bootstrapped data and share that enhanced bootstrap with the neighboring WAPS in demand, and this MMSE is combined to the random forest. Proposed RF. (MMSE) algorithm for dynamic cooperation clustering. 1. For k=1 to B • Draw N sample points from the collected data from the MUEs and the neighboring WAPS to form a bootstrap at the designated WAP • Applied the MMSE to the data collected from the MUEs to reduce the noise • Grow a random forest tree 𝑇𝑏 to the bootstrapped data by recursively repeating the following steps for each terminal node of the tree until the minimum size 𝑛𝑚𝑖𝑛 is reached • Select m variables at random from the p variables • Pick the best variable/split-point among the m (iii) Split the node into two daughter nodes 2. Output the ensemble of trees {𝑇𝑏 }1𝐵 . The prediction of a new location from the u=input data x is given by the regression 1 𝑓̂𝑅𝐹 𝐵 (𝑥) = 𝐵 ∑𝐵𝑏=1 𝑇𝑏 (𝑥) The classification is given by the majority vote as follows: Let 𝐶𝑏 (𝑥) be the class prediction of the b-th random forest tree, then 𝐵 𝐶𝑅𝐹 (𝑥) = 𝑚𝑎𝑗𝑜𝑟𝑖𝑡𝑦 𝑣𝑜𝑡𝑒 {𝐶𝑏 (𝑥)}1𝐵 . With the proposed RF. (MMSE) the algorithm, each WAP applies the RF locally using its data and the data received from the neighboring WAPS to construct the bootstrap. The contribution to this scheme is the sharing of data by the WAPS which enables a dynamic cooperation clustering. The data shared between WAPs is already processed with MMSE to reduce the noise. It further makes the size of the bootstrap variable at each WAP. The cluster of WAPS exchanging data is of a variable size too. 3.2.3. XGBoost MMSE XGB is a software library. It split the X and Y data into a learning and testing set. The training set will be used to prepare the XGB model and the testing set will be used to make the predictions, from which it can evaluate the performance of the model. For this, it will use the train test split function from the scikit-learn library. It also specifies a seed for the random number generator so that we always get the same split of data each time. The format of the positions of the training data also needs to be modified for the fit function to work Finally. To improve the location accuracy caused by the change in environment, we propose to use XGB. (MMSE). The method first uses the XGB algorithm to establish an indoor positioning model. When the environment changes, further combine the MMSE method to improve the initial positioning. 4. Evaluation of performance The performance of our developed system is evaluated in terms of localization accuracy. In MLOps, to evaluate the performance capability let’s assess the effectiveness of our model, interactively during experimentation. For this, we need to visualize and compare performances of different models, compute pre-defined or custom evaluation metrics for our model on different slices of the data and track trained- model predictive performance across different continuous-training executions. This can help to enable model behavior interpretation using various explainable AI techniques. To evaluate the performance, the different localization algorithms are tested in simulation and compared, as shown in table 1. In all cases, the same training data was used to make the machine learning model. The MSE is used to measure the accuracy of the localization algorithms. 1 MSE= 𝑛 ∑ (Y − 𝑌̂) 2 ,(9) where Y and 𝑌̂are the actual and estimate coordinates at n-th references point. 4.1.1. Simulation description For the simulation, we took all the test points for each percentage to sweep the whole space. That is to say take 630 test points for 10 %, as shown in Figure 3, 467 test points for 33 %, as shown in Figure 4, 233 testing points for 66 %, as shown in Figure 5 and 140 test points for 80 %, as shown in Figure 6. So, for testing, we have other possibilities for each percentage. We have 9 possibilities at 10 %, 3 possibilities at 33 %, 2 possibilities at 66 %, and 2 possibilities at 80 %. These experimental results show that at 10 %, the accuracy between RF. (m) and RF. (MMSE) is improved by 66 % and 48 % between XGB. (m) and XGB. (MMSE). At 33 %, there is a 79 % improvement in accuracy between RF. (m) and RF. (MMSE) and 80 % between XGB. (m) and XGB. (MMSE). At 66 %, there is a 22 % improvement in accuracy between RF. (m) and RF. (MMSE) and 28 % between XGB. (m) and XGB. (MMSE). At 80 %, there is a 27 % improvement in accuracy between RF. (m) and RF. (MMSE) and 29 % between XGB. (m) and XGB. (MMSE). Figure 3: CDF of RF.(m), XGB. (m), RF.(MMSE), XGB. (MMSE) at 10 % Figure 4:CDF of RF.(m), XGB. (m), RF.(MMSE), XGB. (MMSE) at 33 % Figure 5: CDF of RF.(m), XGB. (m), RF.(MMSE), XGB. (MMSE) at 66 % Figure 6: CDF RF, XGB, RF-MMSE, XGB-MMSE at 80 % Table 1 Representation of positioning errors for the Egypt room from the data of a composition of elements % Scenario RF.(m) XGB. (m) RF.(MMSE) XGB.(MMSE) 10 % T=70 2.26 2.36 1.60 1.88 A=630 2.33 2.38 152 1.75 2.21 2.31 1.55 1.78 2.28 2.35 1.59 1.86 2.23 2.30 1.61 1.80 2.25 2.37 1.50 1.84 2.24 2.39 1.57 1.79 2.32 2.41 1.54 1.77 2.34 2.40 1.56 1.73 𝐼α at 95 % [2.19;2.35] [2.27;2.45] [1.50;1.62] [1.72;1.90] 33 % A=233 2.01 2.17 1.22 1.37 T=467 2.09 2.20 1.19 1.40 2.02 2.19 1.17 1.35 𝐼α at 90 % [2;2.09] [2.16;2.20] [1.12;1.26] [1.29;1.45] 66 % A=467 1.25 1.30 1.03 1.08 T=233 1.22 1.33 1.01 1.05 𝐼α at 97 % [1.20;1.25] [1.28;1.33] [1;1.04] [1.04;1.08] 80 % A=560 1 1.11 0.73 0.82 T=140 1.02 1.13 0.70 0.79 1.01 1.15 0.71 0.81 1.03 1.12 0.72 0.80 𝐼α at 97 % [1;1.04] [1.10;1.15] [0.70;0.74] [0.80;0.83] Figure 7: Percentages of different tests using density per interval. 4.1.2. Discussion of the experimental results Analysis of our experimental data revealed that most location errors occurred due to attribution of too much relevance for low RSSI values, that is to say, corresponding to a weak reception, which would present fluctuations that can be further amplified by the presence of interior obstacles, can cause the coordinates of a point of distant affect the estimation. We compared the performance of the XGB and RF algorithm by using MMSE with the state-of-the-art in terms of accuracy. The experiment is done in a real-time environment, with a regular and reproducible scenario. Different scenarios of the test are done with different training and testing with regular distribution. The accuracy of RF.(m), XGB. (m), RF. (MMSE) and XGB. (MMSE) are respectively 2.26 m, 2.36 m, 1.60 m, and 1.88 m at 10 %. At 33 %, we have 233 for training and 467 for testing, the accuracy of RF.(m), XGB. (m), RF. (MMSE) and XGB. (MMSE) are 2.01 m, 2.17 m,1.22 m, and 1.37 m respectively. At 66 %, we have 467 for training and 233 for testing. The accuracy of RF.(m), XGB. (m), RF. (MMSE) and XGB. (MMSE) are respectively 1.25 m,1.30 m,1.03 m, and 1.08 m. At 80 %, this means that we divided our data into 560 for training and 140 for testing. The accuracy of RF.(m), XGB. (m), RF. (MMSE) and XGB. (MMSE) are respectively 1 m, 1.11 m, 0.73 m and 0.82 m. These results show that RF. (MMSE) and XGB. (MMSE) give the highest accuracy than RF.(m), XGB. (m). These results confirm the interest of ML. But, the analysis of knowing which is the most efficient algorithm varies according to the training set compared to the testing set that is needed at 70 %, this is where we obtain the best result. such algorithms using RF or XGB vary, we do not have the same performance and above all the quality of the accuracy is really different. What seems more reasonable is the results we obtain today rather than in the initial test which according to the non-reproducible tests we have a bias which is very important of 2 % compared to the previous paper. 5. Conclusion In this work, we performed an implementation, evaluation, and analysis of machine learning algorithms such as Random Forest and Extreme Gradient Boosting in an indoor environment. These algorithms are combined with MMSE, a data-centric approach to AI, to reduce the noise data and improve accuracy. This indoor location approach resulted in having 700 data points by using an app called Wi-Fi Fingerprint installed on the phone. Various regular and reproducible test sets were carried out. These regular tests are useful to evaluate the ML algorithms and to have a more real and reproducible. As part of an indoor experiment, XGB and RF combined with MMSE give better results at 80 % or 560 learning data and 140 test data with an accuracy of 0.72 m and 0.80 m respectively. The experimental results show that the proposed algorithms RF. (MMSE) and XGB. (MMSE) still achieve good positioning effect even in environmental changes compared to other algorithms, which makes it a good algorithm for the indoor location. 6. Acknowledgements This work is the results of the research project funded by the International Development Research Centre (IDRC) and Swedish International Development Cooperation Agency (SIDA), Artificial Intelligence for Development (AI4D) Africa Scholarship Fund Manager- Africa Center for Technology Studies (ACTS). This work was supported by the French government’s “Eiffel excellence scholarship”, program. [grant number N° P769615J-2021] 7. References [1] A. S. Paul, and E. A. Wan, “Wi-Fi Based Indoor Localization and Tracking Using Sigma-Point Kalman Filtering Methods,” Position, Location and Navigation Symposium, 2008 IEEE/ION, pp. 646-659, United States of America, 5-8 May 2008. [2] D. Liu, Y. Wang, P. He, Y. Zhai, and H. Wang, “TOA localization for multipath and NLOS environment with virtual stations,” EURASIP Journal on Wireless Communications and Networking, 2017. [3] W. Gerok, J. Peissig, “TDOA assisted RSSD based localization using UWB and directional antennas,” Leibniz Universität Hannover, Thomas Kaiser, Universität DuisburgEssen, Germany, 2013. [4] A. Kokkinis, L.Kanaris, A.Liotta, S.Stavrou, “RSS Indoor Localization Based on a Single Access Point, ” Sensors 2019, 19, 3711. https://doi.org/10.3390/s19173711. [5] A.U.Ahmed, R.Arablouei, F.D.Hoog, B.Kusy, R.Jurdak, and N. Bergmann, “Estimating Angle- of-Arrival and Time-of-Flight for Multipath Components Using WiFi Channel State Information, ” Sensors 2018, 18, 1753. https://doi.org/10.3390/s18061753. [6] Y. Duan, K.Y. Lam, V.C. S. Lee, W. Nie, K. Liu, H. Li, and C. J. Xue, “Data Rate Fingerprinting: A WLAN-Based Indoor Positioning Technique for Passive Localization,” IEEE Sensors Journal, Aug. 2019. [7] A. Kokkinis, L. Kanaris, A. Liotta and S.Stavrou “RSS Indoor Localization Based on a Single Access Point,” Department of Electrical Engineering, Eindhoven University of Technology, 5600 Eindhoven, The Netherlands, Journal Sensors 2019. [8] A.Zhang, Y.Yuan, Q. Wu, S.Zhu and J.Deng, “Wireless Localization Based on RSSI Fingerprint Feature Vector,’’ College of Computer and Information Engineering, Xiamen University of Technology, China, Hindawi Publishing Corporation International Journal of Distributed Sensor Networks Volume 2015, Article ID 528747, 7 pages http://dx.doi.org/10.1155/2015/528747. [9] P.Bahl, V.N Padmanabhan, “ Radar: An in-building RF-based user location and tracking system ’’, In Proc. IEEE Infocom, Israel; 2000. p. 775–784. [10] M.Youssef and A. Agrawala, “The Horus WLAN locationdetermination system’’, Conference: Proceedings of the 3rd International Conference on Mobile Systems, Applications, and Services, Seattle, Washington, USA, June 2005. 7] [11] W.Xue, Q. Li, X. Hua, K. Yu, W. Qiu, and B. Zhou, “A New Algorithm for Indoor RSSI Radio Map Reconstruction,” Department of Shenzhen Key Laboratory of Spatial Smart Sensing and Services, Shenzhen University, School of Geodesy and Geomatics and Collaborative Innovation Center for Geospatial Technology, Wuhan University, School of Environmental Science and Spatial Informatics, China University of Mining and Technology, 2018 [12] Y. Huang, J. Zheng, Y. Xiao, and M. Peng, “Robust Localization Algorithm Based on the RSSI Ranging Scope,” School of Electronic Information Engineering, Suzhou Vocational University, Publishing Corporation International Journal of Distributed Sensor Networks, China, Jan.2015. [13] O.Pathak, P.Palaskar, R.Palkar, M.Tawari,“ Wi-Fi Indoor Positioning System Based on RSSI Measurements from Wi-Fi Access Points –A Tri-lateration Approach,’’ International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April-2014. [14] Q.Yang, S.Zheng, M. Liu and Y. Z.Yang, “Wi-Fi indoor positioning in a smart exhibition hall based on received signal strength indication,’’ EURASIP Journal on Wireless Communications and Networking’’ (2019) 2019:275 https://doi.org/10.1186/s13638-019-1601-3 [15] A.Khalajmehrabadi, N.Gatsis and D.Akopian, IEE, “Modern WLAN Fingerprinting Indoor Positioning Methods and Deployment Challenges,’’ IEEE Communications Surveys & Tutorials, Oct. 2016. [16] M.S. Choi, B.Jang, “An Accurate Fingerprinting based Indoor Positioning Algorithm,’’ Department of Computer Science, Sangmyung University, Seoul, South Korea, International Journal of Applied Engineering Research ISSN 0973-4562 Volume 12, Number 1 (2017) pp. 86- 90. [17] E. Jedari, Z. Wu, R. Rashidzadeh, M. Saif, “Wi-Fi Based Indoor Location Positioning Employing Random Forest Classifier,” Department of Electrical and Computer Engineering, University of Windsor 401 Sunset Ave. Windsor, Alberta, International Conference on Indoor Positioning and Indoor Navigation (IPIN), Canada, Oct.2015. [18] M.Luckner, B.Topolski, M.Mazurek, “ Application of XGBoost Algorithm in Fingerprinting Localisation Task, ” 16th IFIP International Conference on Computer Information Systems and Industrial Management (CISIM), Bialystok, Poland, Jun 2017, pp. 661- 671, ff10.1007/978-3-319- 59105-6_57ff. ffhal-01656240. [19] W.Qiao, X.Kang, M.Li, “ An Improved XGBoost Indoor Localization Algorithm,’’ International Conference on Computer Intelligent Systems and Network Remote Control (CISNRC 2020), 2020. [20] E.Schmidta, D.Akopiana, “Indoor Positioning System Using WLAN Channel Estimates as Fingerprints for Mobile Devices,’’ Department of Electrical Engineering, One UTSA Circle, San Antonio, TX 78249, Preprint of the 2015 IS&T/SPIE Electronic Imaging Conference, CA, February 8 - 12, 2015. [21] Y. Tifani, B. Lee, E. Jeong, “A Patient’s Indoor Positioning Algorithm Using Artificial Neural Network and SVM,” Department of Computer Engineering, Catholic Kwandong University, Journal of Theoretical and Applied Information Technology, South Korea, Aug.2017. [22] A.H. Salamah, M. Tamazin, M.A. Sharkas, M.Khedr, “An Enhanced Wi-Fi Indoor Localization System Based on Machine Learning,” Department of Electronics and Communications Engineering, Collège of Engineering and Technology, Arab Academy for Science, Technology and Maritime Transport, Alexandria, University of Alcalá, Madrid, Spain, International Conference on Indoor Positioning and Indoor Navigation (IPIN), Egypt,4-7 Oct.2016. [23] O.G. Coast, Z. Kai, L. Binghao, A. Dempster, “A Comparison of algorithms adopted in fingerprinting indoor positioning systems,” School of Surveying and Spatial Information Systems University of New South Wales Sydney, International Global Navigation Satellite Systems Society (IGNSS) Symposium Australia, 16-18 Jul. 2013. [24] M. Niang, M. Ndong, I. Dioum, I. Diop, M. Mashaly and M. A. A. E. Ghany, “Comparison of Random Forest and Extreme Gradient Boosting Fingerprints to Enhance an indoor Wifi Localization System’’, International Mobile, Intelligent, and Ubiquitous Computing Conference (MIUCC),2021, pp.143,148, doi:10.1109/MIUCC52538.20 21.9447676. [25] Q.Yang, S.Zheng, M. Liu, and Y.Zhang, “ Research on Wi-Fi indoor positioning in a smart exhibition hall based on received signal strength indication, ” J Wireless Com Network 2019, 275 (2019). https://doi.org/10.1186/s13638-019-1601-3. [26] https://pypi.org/project/python-firebas [27] Y.F. Huang, Y.T.Jheng and H.C.Chen, “Performance of an MMSE based indoor localization with wireless sensor networks, ’’ The 6th International Conference on Networked Computing and Advanced Information Management, 2010, pp. 671-675. [28] J. Arnold, N. Bean, M. Kraetzl and M. Roughan, “Node Localisation in Wireless Ad Hoc Networks,’’ 2007 15th IEEE International Conference on Networks, 2007, pp. 1-6, DOI: 10.1109/ICON.2007.4444052. [29] L. Hogben, “Handbook of Linear Algebra’’, N.W.: Chapman and Hall/CRC, 2007. pp 5.12-5.16. [30] E. W. Weisstein, “Moore-Penrose matrix inverse,” From Math World– A Wolfram Web Resource,2002, https://mathworld.wolfram.com. [31] Q. Lan, D. Wen1, Z. Zhang, Q. Zeng, X. Chen, P. Popovski, K. Huang, “What is Semantic Communication? A View on Conveying Meaning in the Era of Machine Intelligence’’, Department of Electrical and Electronic Engineering, University of Hong Kong, Journal of Communications and Information Networks 1 Oct 2021 [32] A.I. Anik, A. Bunt, “Data-Centric Explanations: Explaining Training Data of Machine Learning Systems to Promote Transparency,’’ University of Manitoba, Winnipeg, Canada, CHI ’21, Yokohama, Japan, May 08–13, 2021. [33] A. Ng, “MLOps-From-Model-centric-to-Data-centric-AI’’