LoRaWAN Fingerprinting with K-Means: the Relevance of Clusters Visual Inspection JoaquΓ­n Torres-Sospedra1 , Michiel Aernouts2 , Adriano Moreira1 and Rafael Berkvens2 1 ALGORITMI Research Centre, University of Minho, 4800-058 GuimarΓ£es, Portugal 2 IDLab – Faculty of Applied Engineering, University of Antwerp – imec, Antwerp, Belgium Abstract LoRaWAN-based positioning is emerging as an alternative positioning solution for battery-constrained IoT devices or GNSS-denied areas in urban environments. The data collected at the LoRaWAN Base Stations, such as the RSSI of received messages, can be merged to generate an RF fingerprint. Unsupervised crowdsourcing can be leveraged to build a large radio map covering a urban area at the expense of introducing noise of around tens of meters when labelling the reference data. As fingerprinting may have a low efficiency in a such a dense radio map, we propose to use 𝐾-Means clustering to make the position estimation faster. During our study, we found that clustering can also be used to detect large outliers in the radio map that can be subject to be removed. The rationale is to identify those samples within the cluster that are far from the geometric centroid of the cluster. This paper introduces the analysis of introducing 𝐾-Means clustering with outlier detection and the benefits it might bring. Although removing outliers have not had an outstanding increase in the positioning accuracy, the performed analysis has enabled a new metric that is moderately correlated with the positioning error. This correlation may be useful to detect unreliable position estimates and discard them. The results presented in this work, based on two LoRaWAN datasets, show that the average and median positioning error can be improved by 5 % to 10 % by discarding 4 % to 6 % of operational samples. Keywords Fingerprinting, Clustering, Scalability, LoRaWAN 1. Introduction The Internet of Things (IoT) aims to interconnect a wide variety of objects, ranging from temperature sensors on mobile cooling containers to garbage bins in a city. In order to correctly interpret the measurements of such sensors, it is important to correlate them with location information. In many cases, IoT devices include a GNSS receiver for this purpose. However, this receiver only provides the device itself with location data, an Low Power Wide Area Network (LPWAN) such as LoRaWAN is often used to get sensor measurements and GNSS data to the user. This workflow is illustrated in Fig.1. ICL-GNSS 2022 WiP, June 07–09, 2022, Tampere, Finland $ info@jtorr.es ( JoaquΓ­n Torres-Sospedra); Michiel.Aernouts@uantwerpen.be ( Michiel Aernouts); adriano.moreira@algoritmi.uminho.pt ( Adriano Moreira); rafael.berkvens@uantwerpen.be ( Rafael Berkvens)  0000-0003-4338-4334 ( JoaquΓ­n Torres-Sospedra); 0000-0002-0527-3871 ( Michiel Aernouts); 0000-0002-8967-118X ( Adriano Moreira); 0000-0003-0064-5020 ( Rafael Berkvens) Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Figure 1: LPWANS are used to get sensor and location data to the user. Additionally, the user can access network metadata from the LPWAN. An important constraint on IoT communication and localization technologies is that they must be as energy-efficient as possible, because IoT devices generally operate for multiple years using small batteries. This, and the fact that GNSS can normally only be used in outdoor environments, has motivated researchers to omit power-hungry GNSS receivers and instead leverage the existing LPWAN link and sensor data for localization purposes. For example, metadata such as the Received Signal Strength Indicator (RSSI), phase or timing information from multiple LPWAN receivers can be translated to distance estimations between each receiver and a transmitting IoT device. However, these methods strongly depend on the LPWAN network deployment and generally lead to high location estimation errors. A previous analysis on the choice between GNSS and LPWAN localization shows that the latter should only be favored over GNSS when a large location error is justifiable and when the energy budget of an IoT device is extremely limited [1]. In practice, this means that implementing GNSS receivers on low-power IoT devices is often feasible. That being said, LPWAN localization can certainly still prove its use, because not all applications require location data with GNSS-like accuracy. For example, a construction company might only want to know at which of its building sites its assets are located, which implies that an error of hundreds of meters can be accepted. LPWAN localization can also play an important role in multimodal localization, for example as a fallback solution when a tracking device is moving into GNSS-denied areas such as tunnels or indoor environments [2]. Moreover, it may act as a verification mechanism to detect GNSS spoofing. In 2019, Aernouts et al. published an extended version of the LoRaWAN dataset described in [3]. Over a course of three months, 20 postal services cars carried LoRaWAN devices that periodically transmitted their latest GNSS location. As a result, the collected dataset contains 130430 entries with a ground truth location, the LoRa Spreading Factor (SF) used by the transmitter, timing data and Received Signal Strength (RSS) data for each receiving LoRaWAN gateway. It should be noted that the ground truth information was collected from GNSS receivers and, therefore, with potential errors of tens of meters. First, urban canyoning can decrease the GNSS accuracy, since the dataset is collected in a dense urban area. Second, the received GNSS coordinates of the transmitting device could differ from the actual device coordinates at receiving time because the total transmission time of a LoRa signal can take up to a few seconds, depending on the payload size and the SF. This effect becomes even more prominent when the transmitter travels at higher speeds. RSS data enables positioning with trilateration and fingerprinting. While the former requires knowing the location of the LoRaWAN Base Stations (BSs), the propagation model and the environment obstructions; the latter only requires a set of reference data at known positions, also known as the radio map. In this paper, we focus on passive fingerprinting, where a fingerprint is the set of RSSI measurements of a particular LoRaWAN message transmitted by a device and measured in the available LoRaWAN BSs in the operational area. This technique requires two phases: the offline phase focuses on geo-referenced RSSI data collection (see radio map collection in [3]), whereas the online phase estimates the position of new fingerprints at unknown positions with, for instance, a π‘˜-Nearest Neighbour (π‘˜-NN)-based algorithm and the radio map. However, fingerprinting is computationally demanding if the dataset contains thousands of samples, e.g. LoRaWAN datasets in [3]. In those datasets, every single operational fingerprint has to be compared with all the reference samples in the radio map, even if they significantly differ, to obtain the most similar ones and compute the final position estimate. Thus, clustering techniques have been applied to split the radio map into several smaller versions [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]. In the operational stage, the identification of the most relevant cluster is done first (coarse search). Then, the position is estimated using the corresponding reduced radio map (fine-grained search). This two-step procedure is significantly faster that regular fingerprinting, specially in large datasets [15]. In this paper we propose a version of 𝐾-Means clustering with outlier detection where noisy fingerprints are removed. We hypothetise that the clusters generated with 𝐾-Means over the feature RSSI space can be de-noised by removing the reference samples which are significantly far for the cluster geometric centroid. It is worth noting that the proposed algorithmic solution is performed after generating the clusters with 𝐾-Means. 𝐾-Means clustering is an unsupervised model that groups similar data without, in this case, the location information (i.e., the labels). Thus, we consider that 𝐾-MEANS basic principles cannot be significantly re-formulated to make it more robust. The main contributions of this work include: β€’ Modification of 𝐾-Means to remove outliers from clusters according to the geometric information; β€’ Comprehensive comparison between applying 𝐾-Means without and with ourlier detec- tion; β€’ A new metric which is correlated with the positioning error under some cases; β€’ A procedure to discard unreliable position estimations. The remainder of this work is organised as follows. Section 2 introduces the related work on LoRaWAN, fingerprinting and clustering. Section 3 describes the materials and methods used in this work. Section 4 details the experimental setup and shows the empirical results. Section 5 provides the final discussion and conclusions about this work. 2. Related work 2.1. LoRaWAN and fingerprinting LoRaWAN’s relatively wide bandwidth of 125 kHz to 250 kHz makes it a suitable candidate for both RSS-based and time-based localization. Thanks to the widespread availability of LoRaWAN networks and datasets [3, 16, 17, 18, 19], many researchers have evaluated the performance of various localization methods. For instance, Pospisil et al. evaluated the performance of five Time Difference of Arrival (TDoA) algorithms through simulation and validated two of them with field measurements. They achieved a mean location error of 543 m in a test area of 4.58 km2 [20]. The aforementioned LoRaWAN dataset by Aernouts et al. enabled many researchers to evalu- ate fingerprinting and machine learning approaches for localization. Pandangan et al. generated a hybrid dataset containing RSS and TDoA information based on the LoRaWAN dataset. Their hybrid dataset was then used to evaluate π‘˜-NN and Random Forest algorithms which resulted in a median error of 333 m and 194 m respectively [21]. This is a slight improvement compared to related research on Neural Network localization with the LoRaWAN dataset [22, 23]. Purohit et al. also used this dataset for their research on Neural Network localization. In their investigation of three different learning models, the Long Short-Term Memory (LSTM) model with 64 neurons came out on top with a mean error of 191 m [24]. Janssen et al. compared the location accuracy, 𝑅2 score and evaluation time of ten Machine Learning algorithms using the LoRaWAN dataset. Their experiments show that the weighted π‘˜-NN and Random Forest algorithms result in the best accuracy and 𝑅2 score, but Random Forest has a significantly faster computation time [25]. In a subsequent study, the authors extended their comparison with range-based localization using eight different path loss models and six weight functions. Their best path loss model - weight function combination yielded an estimation error of 700 m, which is significantly higher than the 340 m obtained with fingerprinting. Furthermore, this work provides a comprehensive overview of the trade-offs that must be made between range-based and fingerprinting-based localization, including accuracy, complexity, cost, etc. [26]. 2.2. Clustering in fingerprinting Clustering has been widely applied in Wi-Fi and BLE fingerprinting to reduce the computa- tional cost and keep similar accuracy, being 𝐾-Means [4, 5], including 𝐾-Medoids [6, 7] and Fuzzy 𝑐-Means (FCM) [8, 9, 10] variants, the most popular. Other approaches, such as Affinity Propagation Clustering (APC) [11, 12] or Density-based spatial clustering of applications with noise (DBSCAN) [13, 14], have also been explored but their feasibility may depend on the dataset according to some preliminary experiments we performed. Therefore, this work is focusing in 𝐾-Means clustering, trying to take benefit from the position information of the reference data to remove those reference samples that may poison the radio map. To enhance the performance of 𝐾-Means, we have used the Manhattan distance for distances computations in the feature (RSSI) space and the centroid initialization proposed in [27]. 3. Materials and Methods 3.1. 𝐾-Means in fingerprinting The core of the passive fingerprinting technique requires two phases: the off-line and on-line phases as explained before. In the off-line phase, reference fingerprints (𝑠𝑑 ) are generated from a set of received LoRaWAN messages (that include their position from GPS) by the available LoRaWAN BS, generating thus a radio map (𝒯 ). In the on-line phase, the operational fingerprints (from unknown positions) are compared to the fingerprints stored in the radio map. Their position is estimated using the locations of the most similar fingerprints in the radio map, usually computing their centroid. After generating the radio map 𝒯 , similar fingerprints in the feature RSSI space are grouped by 𝐾-Means clustering algorithm. It is expected that fingerprints within a cluster would be also close in the geometrical space. The output of 𝐾-Means provides the 𝐾 cluster centroids, π’žπ‘– , βˆ€π‘– ∈ [1, . . . , 𝐾], and the reduced radio map for every cluster 𝒯𝑖 , βˆ€π‘– ∈ [1, . . . , 𝐾]. The centroids and reference fingerprints are both vectors representing the feature RSSI space, thus having as many values as LoRaWAN BSs. As an illustrative example, a few clusters over the LoRaWAN 2017/18 dataset are shown in Fig. 2. The gray dots represent the reference fingerprints in the radio map, whereas the coloured ones represent the samples in the cluster. The number of reference fingerprints and their dispersion in the geometric space depends on the cluster. 2000 3000 1800 2500 1600 1400 2000 1200 1000 1500 800 1000 600 400 500 200 4000 4500 1600 3500 4000 1400 3000 3500 1200 2500 3000 1000 2000 800 2500 1500 600 2000 1000 400 1500 200 500 1000 Figure 2: Example of six illustrative clusters generated by 𝐾-Means in LoRaWAN 2017/18 dataset. Color indicates distance [m] to geometric centroid. In the operational phase, the search of most similar reference fingerprints is done in a two-step process. First, the operational fingerprint is compared to all the cluster centroids (RSSI space) to retrieve the one reporting the lowest Euclidean distance. Second, the search of most similar reference fingerprints is done over the corresponding reduced radio map, 𝒯𝑖 . 3.2. Analysis of clustering with 𝐾-Means Previous results in the literature show that 𝐾-Means in fingerprinting reduces the computational cost at the expense of a slightly higher positioning error. This reduction on time is specially relevant in large datasets [28]. In this paper, the same clustering model has been applied to both LoRaWAN datasets, being 𝐾 the squared root of the samples in the radio map as suggested in [28]. These results, which are shown in Section 4.2, were in phase with the results reported in the literature. However, to avoid the adoption of a black box approach while using π‘˜-Means, an additional overall analysis on the clusters was performed. In particular, the location (longitude and latitude in WGS84 format) of the reference samples in the reduced radio map was visually inspected for each cluster, showing a relevant output in many clusters. Fig. 2 shows six illustrative examples of the clusters generated with 𝐾-Means. Despite their size and dispersion depend on the cluster, most of them report cases where the fingerprints are very far (reddish points in the figure) from the current geometric centroid and close to others geometric centroids. Those outliers share similar RSSI values with respect other reference fingerprints in the cluster, but thet geographically far from them. Among other factors, this effect may be caused by the positioning errors introduced by the GNSS receivers. 3.3. Removing noisy samples from clusters The idea to remove noisy samples from the radio map is simple. Given the samples (fingerprints) of a cluster, their geometric centroid (in the WGS84 space) is calculated. All samples whose distance to the geometric centroid is higher than twice the median value are removed. This is only applied to those clusters where the maximum distance is higher than 5 times the median value. i.e., it is only applied to those clusters having significant outliers. The proposed model is described in Algorithm 1, which has 3 stages: clusters generation (line 2), clusters cleaning (ln. 3–12) and position estimation (ln.15–22). First and second stage can be performed once per dataset, so their timing can be neglected when providing the computational costs of providing a position estimate in the online phase. 𝒯 is the radio map, 𝒱 is the set with the test/evaluation samples, π‘˜ is the number of nearest neighbors for π‘˜-NN. A sample (fingerprint) is represented with s and has 𝑁𝑏𝑠 elements (one for each LoRaWAN BS), whereas its position is represented and its position (longitude and latitude in WGS84) with pos. For 𝐾-Means, 𝐾 is the number of clusters (𝐾 = |𝒯 | as suggested βˆšοΈ€ in [15]), π’ž represents the clusters RSSI centroids and 𝒒 represents the clusters geometric lat/lon centroids. 𝒯˙𝑐 stands for the clean reduced radio map for cluster 𝑐. The other parameters for the outlier detection were set based on the researchers experience. e.g., the threshold used to remove the noisy samples, 2 times the median distance, has been selected as the distances to the geometric centroid usually increase gradually. Algorithm 1 π‘˜-NN for positioning with 𝐾-Means and outlier detection 1: input 𝒯 , 𝒱, π‘˜, 𝐾 2: π’žπ‘– ,𝒯𝑖 ← Apply 𝐾-Means to 𝒯 3: for 𝑖 = 1 to 𝐾 do 4: 𝒒𝑖 ← {οΈ€Compute geometric (οΈ€ centroid )οΈ€ of samples }οΈ€ in 𝒯𝑖 5: 𝐺 ← π‘”π‘’π‘œπ·π‘–π‘ π‘‘π‘Žπ‘›π‘π‘’ pos𝑗 , 𝒒𝑖 , βˆ€π‘— ∈ 𝒯𝑖 6: if π‘šπ‘Žπ‘₯ (𝐺) > (5 Β· π‘šπ‘’π‘‘π‘–π‘Žπ‘› (𝐺)) then 7: // Remove {︁ samples far from geometric centroid }︁ 8: 𝒯𝑖 ← s𝒯𝑗 𝑖 ∈ 𝒯𝑖 : 𝐺𝑗 ≀ (2 Β· π‘šπ‘’π‘‘π‘–π‘Žπ‘› (𝐺)) Λ™ 9: else 10: 𝒯˙𝑖 ← 𝒯𝑗 // No cleaning for cluster 𝑖 11: end if 12: end for 13: for 𝑖 = 1 to |𝒱| do 14: Identify most relevant cluster, 𝑐 15: Set the reduced radio map 𝒯˙𝑐 16: for 𝑗 = 1 to |𝒯˙𝑐 | do Λ™ 17: Compute distance between s𝒱𝑖 and s𝒯𝑗 𝑐 18: end for 19: Sort distances in RSS space 20: Select the π‘˜ closest candidates 21: Estimate position lat/lon 22: end for 23: Return: Estimated positions for all samples in 𝒱 4. Experiments and Results 4.1. Experimental Setup In order to assess the proposed clustering model with outlier detection, we have compared the results between the plain π‘˜-NN, the optimization rule proposed by Moreira [29], 𝐾-Means without outlier detection and 𝐾-Means with outlier detection. To estimate the final position, we have used the simple 1-NN algorithm using the Euclidean distance. The models have been run 10 times to minimise the random initialization of 𝐾-Means. For the experiments, two datasets collected in the city of Antwerp between end of 2017 and beginning of 2019 [3, 30] have been used, namely LoRaWAN 2017/18 and LoRaWAN 2018/19. Both datasets were collected to evaluate fingerprint localization algorithms in large outdoor environments and, according to the database authors, the RSSI of the LoRaWAN messages could hold an additional GPS error. This feature makes them appropriate for assessing the proposed algorithm to remove noise from clusters. For both datasets, the samples have been sorted by timestamp and then split for training and testing, the first β‰ˆ 80% of samples have been used for training and the last β‰ˆ 20% of samples have been used for evaluation. This division has been performed to avoid having data from the same device and day on both subsets, 𝒯 and 𝒱. The evaluation metrics include the Averaged Positioning Error (APE), πœ–Β―; the Median Posi- tioning Error (MPE), πœ–Λœ; and the Averaged Operational Time (AOT), πœΒ―π‘“ 𝑝 , and consider all the 10 execution runs. The APE and MPE are included in the ISO18305 standard, whereas the AOT refers to the average time required to process an operational fingerprint and provide the position estimate. In contrast to the plain π‘˜-NN algorithm, where all fingerprints hold similar operational time, the operational time may significantly vary depending on the cluster. i.e., 𝐾-Means clustering does not guarantee that all clusters are equally distributed, so the time required to perform the fine-grained search will depend on the selected cluster. Therefore, the standard deviation is also reported for the operational time. 4.2. Results This subsection is devoted to show the empirical results. First, a comparison with traditional fingerprint models is introduced. Then, a comprehensive analysis about the consequences of removing noise from the radio map is performed. Finally, the possible benefits of the proposed model are described. 4.2.1. Comparative analysis Table 1 introduces the main results for the comparative analysis. It includes the plain π‘˜-NN algorithm (π‘˜ = 1), the optimization rule based on common strongest anchor proposed in Moreira et al. [29], and 𝐾-Means clustering without and with the outlier detection (OD). Fig. 3 introduces the Empirical Cumulative Distribution Function (ECDF) plots of the positioning error and operational time of the four methods for both datasets. Table 1 Main results: APE, MPE and AOT Lorawan 2017/18 Lorawan 2018/19 Method πœ–Β―[m] πœ–Λœ[m] 𝜏¯[ms] πœ–Β―[m] πœ–Λœ[m] 𝜏¯[ms] plain π‘˜-NN 558.3 374.1 2464.0 ( 13.7 ) 375.6 169.3 2518.2 ( 11.0 ) Moreira [29] 563.4 377.0 202.9 ( 149.7 ) 375.5 167.7 306.4 ( 223.9 ) 𝐾-Means 566.1 379.2 17.1 ( 7.3 ) 379.7 174.7 28.2 ( 16.0 ) 𝐾-Means OD 559.3 369.6 16.4 ( 6.8 ) 378.8 168.0 26.8 ( 15.5 ) In general, the four models provide similar results in terms of positioning error being the main difference their computational cost. The two solutions based on 𝐾-Means report the lowest computational cost with an averaged execution time below 20 ms and 30 ms respectively. Removing the outliers not only made 𝐾-Means slightly more accurate but also slightly more efficient in the operational stage as the proposed approach removed 8.6% and 9.5% of reference fingerprints on each dataset respectively. However, the improvements may be marginal. LoRaWAN 2017 LoRaWAN 2019 Figure 3: ECDF of positioning error and execution time for both datasets 4.2.2. A comprehensive analysis of removing noise Despite 𝐾-Means without and with outliers detection having similar performance according to the previous results, positioning can be based on two sources, namely a noisy radio map and a clean radio map. This enables to exploit some additional information at the operational stage as there are samples where both approaches based on 𝐾-Means (without and with outliers detection) do not agree in estimating the position. This happens in 12% and 32% of samples on each dataset, respectively. Thus, the evaluation set can be split into two subsets, one where both estimators agree and provide the same position estimation (β€œsame” in table and figure), and the other where they disagree (β€œdiff ”). Table 2 and Fig. 4 show the corresponding results and ECDFs. According to Table 2 and the ECDFs plots from Fig. 4, the subset β€œsame” is generally better than the subset β€œdiff ” in both metrics, positioning error and execution time, specially in the first dataset (LoRaWAN 2017/18). i.e. when the two estimators –without and with outlier detection– agree, the positioning results are better that when they disagree. If both estimators disagree, the positioning error provided with 𝐾-Means with outlier detection is better. Table 2 Results of 𝐾-Means without and with outlier detection Lorawan 2017/18 Lorawan 2018/19 𝐾-Means Subset πœ–Β―[m] πœ–Λœ[m] 𝜏¯[ms] πœ–Β―[m] πœ–Λœ[m] 𝜏¯[ms] without OD Same 545.2 363.9 16.9 ( 7.1 ) 373.4 169.0 24.3 ( 14.7 ) without OD Diff 734.5 530.8 18.1 ( 8.6 ) 393.3 190.3 36.6 ( 15.6 ) with OD Same 545.2 363.9 16.3 ( 6.7 ) 373.4 169.0 23.7 ( 14.4 ) with OD Diff 673.1 419.8 16.4 ( 7.6 ) 390.3 164.8 33.7 ( 15.4 ) LoRaWAN 2017/18 LoRaWAN 2018/19 Figure 4: ECDF of the models based on 𝐾-Means We hypothesise that a divergence between the position estimate between both 𝐾-Means models may indicate the quality of the position estimate provided by the proposed model. In particular, we explore the correlation between the distance between position estimations when they disagree and the positioning error using 𝐾-Means with outlier detection. First, that relation is shown as a scatter plot in Fig. 5 (top) and as a density heat map (color representing the amount of samples for a particular range in both dimensions) in Fig. 5 (bottom). In addition, we computed the Pearson correlation, which provided a correlation factor of 0.64 and 0.52 for the two datasets respectively. Thus, it seems that when both models based on 𝐾-Means disagree, the distance between the two estimators may indicate the positioning error. The scatter plots are dense as the number of test samples is large and the experiments have been run 10 times. Fig. 6 shows the boxplot of the positioning errors for different distances between estimates. The correlation trends between the distance between estimates and the positioning error using 𝐾-Means with the proposed outlier detection can be seen more clearly in the figure. However, it can also be seen that in distances above around 2000 m seems to be less reliable as the error and its variability are both high. i.e., the the lowest variability is provided in range [0, . . . , 500[ and it is increasing as the distance between estimates also increases and the number of cases is significant. For the ranges including the largest distances between estimations, there are only a few cases in both datasets. LoRaWAN 2017/18 LoRaWAN 2018/19 Figure 5: Scatter plots and heat maps relating distance between estimates and positioning error using the proposed method LoRaWAN 2017/18 LoRaWAN 2018/19 Figure 6: Relation between distances among estimations and positioning error 4.2.3. Possible benefits of combining noisy and cleaned data sets Positioning using 𝐾-Means has shown to be very efficient in terms of computational time. Computing the position estimate with fingerprints from the original reduced radio maps and the cleaned (without outliers) reduced radio maps is feasible. For any operational fingerprint, if the two position estimates differ, their distance could be used as an indicator of reliability (see Figs. 5-6). For instance, if this distance is higher than a predefined threshold, the position estimate could be discarded. We consider that positioning can take benefit of discarding unreliable samples. In general, these operational fingerprints may have a large positioning error attached. Therefore, the positioning error of the remaining fingerprints (the ones that are reliable) should be better. The only requirement is to set a threshold on the distance between the two position estimates. Table 3 and Fig. 7 show the results using 𝐾-Means with outlier detection and different thresholds, where π‘Ÿπ‘  stands for reliable samples. Table 3 Results of combining 𝐾-Means without and with outlier detection removing unreliable operational samples Lorawan 2017/18 Lorawan 2018/19 Threshold πœ–Β―[m] πœ–Λœ[m] 𝜏¯[ms] π‘Ÿπ‘ [%] πœ–Β―[m] πœ–Λœ[m] 𝜏¯[ms] π‘Ÿπ‘ [%] – 559.3 369.6 16.4 ( 6.8 ) 100.0 378.8 168.0 26.8 ( 15.5 ) 100.0 2000 m 547.3 365.8 16.4 ( 6.8 ) 99.1 373.9 165.9 26.9 ( 15.5 ) 99.7 1000 m 536.1 356.8 16.5 ( 6.8 ) 96.5 363.1 159.2 27.0 ( 15.5 ) 98.1 500 m 533.7 352.0 16.5 ( 6.9 ) 93.6 355.8 150.2 27.2 ( 15.6 ) 96.1 250 m 535.6 353.0 16.5 ( 6.9 ) 91.6 351.0 141.9 27.4 ( 15.6 ) 94.0 125 m 537.0 354.1 16.5 ( 6.9 ) 90.7 344.9 129.7 27.5 ( 15.7 ) 91.4 LoRaWAN 2017/18 LoRaWAN 2018/19 Figure 7: ECDF of 𝐾-Means with outlier detection and samples removal The ECDF is shown for both sets, reliable samples (solid) and unreliable samples (dashed). In general, as the threshold decreases, the more samples are considered unreliable and the lower the positioning error of the reliable samples. However, the presence of low positioning errors in the set of unreliable samples increases. i.e., the lower the threshold (e.g., 125 m), the better the results of the reliable samples, but also the higher the probability of discarding a good position estimate. According to the results presented in Table 3 and Fig. 7, the threshold depends on the dataset. For the two LoRaWAN datasets we have used, the threshold is 500 m (LoRaWAN 2017/18) and 125 m (LoRaWAN 2018/19), as they provide good results in terms of positioning error of the reliable samples in their respective datasets. On the other hand, the lower the threshold the more samples (including good estimations) are removed. 5. Discussion and Conclusions 𝐾-Means is often applied to fingerprinting as a black box to obtain a similar average positioning error with a significantly lower computational cost. In this paper, we have applied it to two large datasets, getting results in phase to what has been reported in state-of-the art works about Wi-Fi and BLE fingerprinting. Visual inspection on the generated clusters has shown that they might contain noisy fin- gerprints which are close to the cluster centroid in the RSSI space but on different locations. Thus, as the reference data provides the fingerprints (RSSI vectors) and their locations, we have proposed a simple rule to remove the noisy samples from clusters. Although the results are not outstanding, having two ways to estimate the position has enabled a new metric based on the distance between the two position estimates. For samples where both estimators diverge, this metric has shown to be moderately correlated to the positioning error provided by the proposed 𝐾-Means clustering with outlier detection. Being able to detect unreliable position estimates at the operational stage is an important step as a better accuracy can be ensured for the reliable ones. In this case, the average and median positioning error can be improved by 5 % to 10 % by discarding the 4 % to 6 % of operational samples. In this paper, we propose a model to clean the clusters. It is of utmost importance to not blindly trust on Machine Learning models if they were used as black boxes. Visual inspection allowed to detect noisy samples and get a new metric correlated to the positioning error. Further efforts will be devoted to improve noise removal with different strategies. Acknowledgments A. Moreira gratefully acknowledge funding from FCT – FundaΓ§Γ£o para a CiΓͺncia e Tecnologia within the R&D Units Project Scope: UIDB/00319/2020. References [1] M. Aernouts, T. Janssen, R. Berkvens, M. Weyn, Lora localization: With gnss or without?, IEEE IoT Magazine (Submitted) (2022). [2] M. Aernouts, F. Lemic, B. Moons, J. Famaey, J. Hoebeke, M. Weyn, R. Berkvens, A Multimodal Localization Framework Design for IoT Applications, Sensors 20 (2020) 4622. doi:10.3390/s20164622. [3] M. Aernouts, R. Berkvens, K. Van Vlaenderen, M. Weyn, Sigfox and lorawan datasets for fingerprint localization in large urban and rural areas, Data 3 (2018). URL: https: //www.mdpi.com/2306-5729/3/2/13. doi:10.3390/data3020013. [4] A. Anuwatkun, J. Sangthong, S. Sang-Ngern, A diff-based indoor positioning system using fingerprinting technique and k-means clustering algorithm, in: 16th International Joint Conference on Computer Science and Software Engineering, 2019, pp. 148–151. [5] S. G. Lee, C. Lee, Developing an improved fingerprint positioning radio map using the k-means clustering algorithm, in: Int. Conf. on Information Networking, 2020, pp. 761–765. [6] J. Cheng, Y. Cai, Q. Zhang, J. Cheng, C. Yan, A new three-dimensional indoor positioning mechanism based on wireless lan, Mathematical Problems in Engineering 2014 (2014). [7] H. Lin, L. Chen, An optimized fingerprint positioning algorithm for underground garage environment, in: Int. Conf. on Information Networking, 2016, pp. 291–296. URL: https: //doi.ieeecomputersociety.org/10.1109/ICOIN.2016.7427079. doi:10.1109/ICOIN.2016. 7427079. [8] H. Zhou, N. Van, Indoor fingerprint localization based on fuzzy c-means clustering, 2014, pp. 337–340. doi:10.1109/ICMTMA.2014.83. [9] D. J. Suroso, P. Cherntanomwong, P. Sooraksa, J. Takada, Fingerprint-based technique for indoor localization in wireless sensor networks using fuzzy c-means clustering algo- rithm, in: International Symposium on Intelligent Signal Processing and Communications Systems, 2011. doi:10.1109/ISPACS.2011.6146167. [10] C. Zhang, N. Qin, Y. Xue, L. Yang, Received signal strength-based indoor localization using hierarchical classification, Sensors 20 (2020). doi:10.3390/s20041067. [11] P. A. Karegar, Wireless fingerprinting indoor positioning using affinity propagation clustering methods, Wireless Networks 24 (2018) 2825–2833. URL: https://doi.org/10.1007/ s11276-017-1507-0. doi:10.1007/s11276-017-1507-0. [12] G. Caso, L. De Nardis, M.-G. Di Benedetto, A mixed approach to similarity metric selection in affinity propagation-based wifi fingerprinting indoor positioning, Sensors 15 (2015). doi:10.3390/s151127692. [13] M. Zhou, Y. Wei, Z. Tian, X. Yang, L. Li, Achieving cost-efficient indoor fingerprint localization on wlan platform: A hypothetical test approach, IEEE Access 5 (2017) 15865– 15874. doi:10.1109/ACCESS.2017.2737651. [14] B. Wang, X. Liu, B. Yu, R. Jia, X. Gan, An Improved WiFi Positioning Method Based on Fingerprint Clustering and Signal Weighted Euclidean Distance, Sensors 19 (2019). URL: https://pubmed.ncbi.nlm.nih.gov/31109054https://www.ncbi.nlm.nih.gov/ pmc/articles/PMC6567165/. doi:10.3390/s19102300. [15] J. Torres-Sospedra, P. Richter, A. Moreira, G. M. Mendoza-Silva, E. S. Lohan, S. Trilles, M. Matey-Sanz, J. Huerta, A comprehensive and reproducible comparison of clustering and optimization rules in wi-fi fingerprinting, IEEE Transactions on Mobile Computing 21 (2022) 769–782. doi:10.1109/TMC.2020.3017176. [16] P. Masek, M. Stusek, E. Svertoka, J. Pospisil, R. Burget, E. S. Lohan, I. Marghescu, J. Hosek, A. Ometov, Measurements of LoRaWAN Technology in Urban Scenarios: A Data De- scriptor, Data 6 (2021). URL: https://www.mdpi.com/2306-5729/6/6/62. doi:10.3390/ data6060062. [17] K. Mikhaylov, M. Stusek, P. Masek, R. Fujdiak, R. Mozny, S. Andreev, J. Hosek, On the performance of multi-gateway lorawan deployments: An experimental study, in: 2020 IEEE Wireless Communications and Networking Conference (WCNC), 2020, pp. 1–6. doi:10.1109/WCNC45663.2020.9120655. [18] L. Bhatia, M. Breza, R. Marfievici, J. A. McCann, Loed: The lorawan at the edge dataset: Dataset, in: Proceedings of the Third Workshop on Data: Acquisition To Analysis, DATA ’20, Association for Computing Machinery, New York, NY, USA, 2020, p. 7–8. URL: https://doi.org/10.1145/3419016.3431491. doi:10.1145/3419016.3431491. [19] R. Cardell-Oliver, C. HΓΌbner, M. Leopold, J. Beringer, Dataset: Lora underground farm sensor network, in: Proceedings of the 2nd Workshop on Data Acquisition To Analysis, DATA’19, Association for Computing Machinery, New York, NY, USA, 2019, p. 26–28. URL: https://doi.org/10.1145/3359427.3361912. doi:10.1145/3359427.3361912. [20] J. Pospisil, R. Fujdiak, K. Mikhaylov, Investigation of the performance of tdoa-based localization over lorawan in theory and practice, Sensors (Switzerland) 20 (2020) 1–22. doi:10.3390/s20195464. [21] Z. A. Pandangan, M. C. R. Talampas, Hybrid LoRaWAN Localization using Ensemble Learning, in: 2020 Global Internet of Things Summit (GIoTS), IEEE, 2020, pp. 1–6. doi:10. 1109/GIOTS49054.2020.9119520. [22] G. G. Anagnostopoulos, A. Kalousis, A Reproducible Comparison of RSSI Fingerprinting Localization Methods Using LoRaWAN, in: 16th Workshop on Positioning, Navigation and Communications, 2019. [23] I. Daramouskas, V. Kapoulas, M. Paraskevas, Using Neural Networks for RSSI Location Estimation in LoRa Networks, in: 2019 10th International Conference on Information, Intelligence, Systems and Applications (IISA), IEEE, 2019, pp. 1–7. doi:10.1109/IISA. 2019.8900742. [24] J. Purohit, X. Wang, S. Mao, X. Sun, C. Yang, Fingerprinting-based Indoor and Outdoor Localization with LoRa and Deep Learning, in: GLOBECOM 2020 - 2020 IEEE Global Communications Conference, IEEE, 2020, pp. 1–6. doi:10.1109/GLOBECOM42002.2020. 9322261. [25] T. Janssen, R. Berkvens, M. Weyn, Comparing Machine Learning Algorithms for RSS-Based Localization in LPWAN, in: Lecture Notes in Networks and Systems, volume 96, 2020, pp. 726–735. doi:10.1007/978-3-030-33509-0_68. [26] T. Janssen, R. Berkvens, M. Weyn, Benchmarking RSS-based localization algorithms with LoRaWAN, Internet of Things 11 (2020) 100235. doi:10.1016/j.iot.2020.100235. [27] D. Arthur, S. Vassilvitskii, K-means++: The advantages of careful seeding, in: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, 2007, pp. 1027– 1035. [28] J. Torres-Sospedra, D. Quezada-Gaibor, G. M. Mendoza-Silva, J. Nurmi, Y. Koucheryavy, J. Huerta, New cluster selection and fine-grained search for k-means clustering and wi-fi fingerprinting, in: 2020 Int. Conf. on Localization and GNSS (ICL-GNSS), 2020. [29] A. Moreira, M. J. Nicolau, F. Meneses, A. Costa, Wi-fi fingerprinting in the real world - RTLS@UM at the EvAAL competition, in: 2015 International Conference on Indoor Positioning and Indoor Navigation (IPIN), IEEE, ???? [30] M. Aernouts, R. Berkvens, K. Van Vlaenderen, M. Weyn, Sigfox and LoRaWAN Datasets for Fingerprint Localization in Large Urban and Rural Areas, 2019. doi:10.5281/zenodo. 3904158, https://doi.org/10.5281/zenodo.3904158.