WiVisit: POI Visit Identification Based on Auto-Generated Wi-Fi Fingerprint Qiang Huang1 , Xiang Li1 , Xin Li1 , Jiazhi Ni1 , Xin Zhang1 , Ning Xiao1 , Hongyi Liu1 , Chang Liu1 and Youchen Wang1,2 1 Tencent Inc. Beijing, China 2 School of Transportation Science and Engineering, Beihang University, Beijing, China Abstract Point of Interest (POI) visit is critical information for many location-based services, such as POI recom- mendation and advertising push. However, most POI visit information is obtained from user’s check-in data on social networks, which inevitably contains false visits and misses parts of real visits. Some work has been done to attempt mining POI visits from users’ GPS trajectories, but they could not cover indoor POIs. In this paper, we proposed a Wi-Fi fingerprint-based POI visit identification system, WiVisit, in order to get accurate POI visit info, including indoor POIs. Different from traditional Wi-Fi fingerprint- based localization system, WiVisit can generate Wi-Fi fingerprints automatically with Wi-Fi and POI binding info for different POIs without any human effort. Therefore, WiVisit system could be easily and widely deployed in the real world. Moreover, a multi-model fusion based POI visit identification method was used in WiVisit to handle multiple POI types. Finally, extensive real POI visits were collected and used to assess the performance of WiVisit system. WiVisit achieved a 90% recall rate and 83% accuracy from the visited POIs, which already outperformed the state-of-the-art. Keywords Wi-Fi Fingerprint, POI, Indoor Localization 1. Introduction The term “Point of Interest” (POI) refers to a geographical location that someone may find interesting, useful, or visit frequently. The POI visit information of a user is very important for a lot of location-based services (LBS), such as push advertisements, next POI recommendation [1, 2, 3]. However, most POI visit information is obtained from users’ check-in data on location- based social networks, such as Yelp, Foursquare, and Facebook Places. Due to that POI visit information is pushed by a user manually, there are many missing and fake POI visits. Moreover, if the user pushed the visit information when they left the POI, the best time to push ads will be missed. Therefore, we wanted to build an accurate localization system which can identify real POI visits when users arrive at the POI. Some researchers also tried to use GPS trajectory information to identify POI visit [4, 5, 6]. However, many POIs are indoor POIs. For these POIs, the GPS signal will be blocked, and the IPIN 2021 WiP Proceedings, November 29 – December 2, 2021, Lloret de Mar, Spain " johnnhuang@tencent.com (Q. Huang); allenxli@tencent.com (X. Li); clarkxinli@tencent.com (X. Li); andyni@tencent.com (J. Ni); deanxzhang@tencent.com (X. Zhang); ariesxiao@tencent.com (N. Xiao); hongyiliu@tencent.com (H. Liu); levenliu@tencent.com (C. Liu); youchenwang@tencent.com (Y. Wang) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Wi-Fi SSID POI Name Wi-Fi1:rssi1 Wi-Fi2:rssi2 POI visit info Binding Wi-Fi3:rssi3 … Fusion Wi-Fi1:KFC Wi-Fi2:McDonalad’s Online query … Wi-Fi POI Binding Identification models Auto-generation Features Localization query logs POI Wi-Fi fingerprints Query1 Query2 Query3 … Figure 1: Framework of WiVisit. GPS trajectory information will be missing as well. For that reason, utilizing GPS trajectories for indoor POI visit identification is impracticable. In recent years, Wi-Fi devices have been broadly used, and Wi-Fi fingerprint-based localization techniques have been also widely applied in both outdoor and indoor environment. Based on pre-built Wi-Fi fingerprints for each POI, a localization system [7, 8, 9, 10, 11] can be used to identify if the user is at a POI and which POI they are visiting. However, in order to achieve this goal, there are several challenges. First, the amount of POI is colossal, and one POI may change frequently due to renovation or relocation. Thus, POI fingerprints are very difficult to collect and update manually. Second, different types of POI have diverse Wi-Fi environments; therefore, the features of fingerprints for distinct POIs will vary considerably. In order to solve these challenges, we implemented an auto-generated Wi-Fi fingerprint-based POI visit identification system (WiVisit). Figure 1 shows the framework of WiVisit system. First, we proposed an automatic POI fingerprints collection method. The basic idea is, if the scanned Wi-Fi list in the localization query contains a POI’s Wi-Fi and the RSSI is strong enough, the user is most likely to be visiting that POI. That is, they are very close to the Wi-Fi device based on the path loss model [12] at the POI. In order to obtain POI’s Wi-Fi information, we designed a POI-Wi-Fi binding module based on the similarity between names of the POI and Wi-Fi. Applying the POI-Wi-Fi binding information, many in-POI localization queries will be collected automatically, of which the scanned Wi-Fi list contains a POI’s Wi-Fi and the RSSI is strong enough. Then, the POI’s fingerprints could be built from these queries. Secondly, based on the POIs’ type information, we divided all POIs into different groups, and trained different POI visit identification models for each of them. At last, we proposed a multi-model fusion mechanism to combine these models in one. The following are our main contributions in this paper: • We proposed a POI-Wi-Fi binding module, which could discover Wi-Fi devices that are placed in the POIs. Based on the POI-Wi-Fi binding information, the WiVisit system could automatically collect and update POI fingerprints without any human effort. • We designed a multi-model fusion based POI visit identification method which can be used for different types of POI. • We also conducted extensive experiments to test the performance and effectiveness of WiVisit system in the real world. The following sections are organized as follows: Section 2 introduces the related work and Section 3 gives a formal definition of POI visit identification. Section 4 is a detail description about how to generate POI fingerprints automatically. Section 5 introduces the POI visit identification model. Section 6 is the experiments and evaluations about WiVisit, and Section 7 is the conclusion. 2. Related Work 2.1. Indoor Localization Many POIs are placed in an indoor environment, such as stores, shopping malls, restaurants, and bars. Determining whether a user is visiting a POI, rather than just passing through it, can be treated as an indoor localization problem. Over the years, many indoor localization technologies have been developed including cam- eras [13][14], sound [15, 16], radio frequency [7, 17, 18, 19], etc. Camera-based technolo- gies [13, 14] can achieve a high accuracy, but the privacy concern of them is a big issue for real-life deployment. Sound-based solutions [15, 16] are vulnerable to environmental noises and the coverage is quite small. In addition, dedicated infrastructure is needed for UWB- based [19, 20, 21] and Bluetooth-based [17] solutions. Since Wi-Fi access points are deployed ubiquitously, we focused on Wi-Fi based indoor localization methods in our work. Wi-Fi-based indoor localization has drawn a lot of attention from researchers in the last decade. RADAR [7] is a pioneering system employing Wi-Fi RSSI information as the fingerprint for localization. After which, many RSSI-based indoor fingerprint localization methods have been developed to reduce the offline training load or to improve the accuracy [8, 9, 10, 11, 22]. However, these methods still required some human efforts on fingerprint collection and updating, which is impossible for a large number of POIs. In the past few years, some fine-grained CSI- based indoor localization methods have been proposed [23, 24, 25, 26, 27, 28]. However, only a few commercial Wi-Fi chips could provide the CSI information to users. Hence, CSI-based indoor localization methods cannot be adopted widely in the real world. Compared with these methods, WiVisit can automatically generate and update POIs’ Wi-Fi fingerprints from localization queries without any human effort. Meanwhile, WiVisit system adopts RSSI information as the fingerprints, which are available on almost all existing Wi-Fi chips. Therefore, the WiVisit system can be deployed in the real world rapidly and widely. 2.2. POI Visit Mining from GPS Trajectory GPS system is widely used for outdoor localization. Consequently, in recent years, some researchers tried to mine POI visit events from users’ GPS trajectory data [4, 5, 6]. They first extracted stay-points from a GPS trajectory as first, then they detected semantic locations from these stay-points and assigned a POI visit information for each location. However, due to privacy concerns, users would not open GPS localization service all the time. For this reason, getting a complete GPS trajectory of a user for POI visit mining is very difficult in the real world. Additionally, in an indoor POI, users may not be able to receive a GPS signal, so they cannot get the GPS location information which is required for POI visit mining. In comparison with these methods, WiVisit system does not require users’ trajectories. Whenever the user needs position information, the WiVisit system can identify the POI visit based on the user’s localization query. Moreover, as Wi-Fi devices are ubiquitous in the world, almost all POIs have deployed their own Wi-Fi devices that WiVisit system can use for POI visit identification. 3. Definition of POI Visit Identification Localization query: A localization query is a user scanned Wi-Fi list that contains Wi-Fi MAC address (𝑚) and RSSI value (𝑟): 𝑞𝑢𝑒𝑟𝑦 = {𝑚1 : 𝑟1 , 𝑚2 : 𝑟2 , . . . , 𝑚𝐾 : 𝑟𝐾 } (1) where 𝐾 represents the number of scanned Wi-Fi. POI fingerprint: A POI fingerprint (𝐹 𝑃𝑝𝑗 ) records Wi-Fi information in a period of local- ization queries of the POI (𝑝𝑗 ): 𝐹 𝑃𝑝𝑗 = {𝑚𝑗1 : (𝑟𝑗1 , 𝑜𝑗1 ); 𝑚𝑗2 : (𝑟𝑗2 , 𝑜𝑗2 ), . . . , (2) 𝑚𝑗𝑁 : (𝑟𝑗𝑁 , 𝑜𝑗𝑁 )} where 𝑁 is the number of Wi-Fis, 𝑟𝑗𝑛 is the median RSSI of the Wi-Fi 𝑚𝑗𝑛 in the POI (𝑝𝑗 ) localization queries, used to build fingerprints and 𝑜𝑗𝑛 is the rate of occurrence: | {𝑞𝑢𝑒𝑟𝑦|𝑚𝑗𝑛 ∈ 𝑞𝑢𝑒𝑟𝑦 ∧ 𝑞𝑢𝑒𝑟𝑦 ∈ 𝑝𝑗 } | 𝑜𝑗𝑛 = (3) | {𝑞𝑢𝑒𝑟𝑦|𝑞𝑢𝑒𝑟𝑦 ∈ 𝑝𝑗 } | POI visit identification: The POI visit identification’s task is to determine if a localization query is from a POI and which POI it is from, based on the scanned Wi-Fi list in the query and POIs’ Wi-Fi fingerprints: ⎧ ⎨arg max 𝑃 (𝑝𝑖 |𝑞𝑢𝑒𝑟𝑦), 𝑃𝑚𝑎𝑥 ≥ 𝜃, 𝑝= 𝑖 (4) ⎩ 𝑁 𝑜𝑛𝑒, 𝑃𝑚𝑎𝑥 < 𝜃 𝜃 is the probability threshold used to determine whether the query is in a POI. 4. Auto-Generated POIs’ Fingerprint 4.1. Wi-Fi POI binding Most of Wi-Fi fingerprint based indoor localization systems [7, 8, 9, 10, 11, 22] collect fingerprints information by human efforts. However, it is not feasible to create a mass pool of POI Wi-Fi Wi-Fi SSID Set SSIDCN Wi-Fi Position SSIDPY SSIDPYABB SSIDEN SSIDENABB Wi-Fi SSID Interpreter Wi-Fi1:KFC Wi-Fi2:McDonalad’s … POI Name Set Matcher NameCN Wi-Fi POI Binding NamePY NamePYABB NameEN NameENABB POI Position POI Name Interpreter Figure 2: POI-Wi-Fi binding module. Based on the similarity among different forms of POI name, Wi-Fi SSID and their coordinate distance, a set of precise Wi-Fi POI binding pairs were created. fingerprints manually. Therefore, we proposed an effective Wi-Fi binding method, which is the key point to build Wi-Fi fingerprints of POIs automatically. Wi-Fi is ubiquitous and most of POIs have their own Wi-Fis. We noticed that most POIs have an SSID (Service Set Identifier, the WiFi name that users can see) similar to their POI name. For instance, in China, McDonald’s-related Wi-Fi SSIDs are mcd-chinanet or mcdonald’s. Because of this, we proposed a POI-Wi-Fi binding method, shown in Figure 2. First, for each POI name, we used a natural language processing (NLP) interpreter [29, 30] to get its’ Chinese name (CN ), Chinese pinyin (PY ), Chinese pinyin abbreviation (PYABB), English name (EN ) and English abbreviation(ENABB), shown in equation 5. Then, a text similarity between each pair of POI and Wi-Fi was computed as following: 𝑆(𝑚, 𝑝) = max 𝑆(𝑚𝑙𝑎𝑏 , 𝑝𝑙𝑎𝑏 ), (5) 𝑙𝑎𝑏 ∈ {𝐶𝑁, 𝑃 𝑌, 𝑃 𝑌 𝐴𝐵𝐵, 𝐸𝑁, 𝐸𝑁 𝐴𝐵𝐵} where 𝑆(𝑚𝑙𝑎𝑏 , 𝑝𝑙𝑎𝑏 ) is defined as the text hamming distance. Similarity score is the maximum value among different Wi-Fi POI names. In addition to Wi-Fi SSID and POI name’s similarity, we also integrated the coordinate distance to the Wi-Fi POI matching score. It is given as: {︃ 𝑆(𝑚, 𝑝) − 𝛾𝐷(𝑚, 𝑝), 𝐷(𝑚, 𝑝) <= 𝐷𝑡ℎ 𝐺(𝑚, 𝑝) = (6) 0, 𝐷(𝑚, 𝑝) > 𝐷𝑡ℎ where 𝛾 is a normalize factor between similarity and distance, 𝐷(𝑚, 𝑝) is the distance between Wi-Fi and POI, and 𝐷𝑡ℎ is the distance threshold value to select candidate POI Wi-Fi pair sets for binding process1 . Finally, we assigned each Wi-Fi to the POI which has the maximum score greater than 𝐺𝑡ℎ among all candidate POIs. It is formulated as follows: ⎧ ⎨ argmax 𝐺(𝑚, 𝑝𝑖 ), 𝐺𝑚𝑎𝑥 ≥ 𝐺𝑡ℎ < 𝑝, 𝑚 >= <𝑝𝑖 ,𝑚> (7) ⎩𝑁 𝑜𝑛𝑒, 𝐺𝑚𝑎𝑥 < 𝐺𝑡ℎ where 𝐺𝑡ℎ is the score threshold for choosing the real matched POI-Wi-Fi-pair. 1 Wi-Fi positions were calculated from history localization logs, which is omitted due to out of scope of this paper. POI positions obtained from Tencent Map [31]. 4.2. Automatic POI fingerprints generation Applying the POI-Wi-Fi binding information, POI fingerprints can be collected automatically. Based on the path loss model [12], the further away the user is from the Wi-Fi device, the lower the power of received Wi-Fi signal is. Thus, if a user visits a POI, the scanned POI Wi-Fi’s RSSI will be higher than the user outside of the POI. For a localization query, if a scanned Wi-Fi in the query is a POI-Wi-Fi and the RSSI value is high enough2 , the localization query is likely to have taken place in the POI, and was used to build the POI fingerprint information. This way, based on a period of localization queries, the fingerprint information of POIs can be generated. For WiVisit system, the POI fingerprint records all Wi-Fi that were scanned in the history localization queries at that POI. For each Wi-Fi, it contains two statistics information, median RSSI value and the rate of occurrence, which are defined in Section 3. 5. POI Visit Identification Model 5.1. Sample Extraction To train the POI visit identification model, first, we need extract a set of in-POI and out-POI queries as train samples. Similar to fingerprint generation, for WiVisit system, the in-POI queries are those that contain a POI WiFi and the RSSI value is high enough and without additional GPS info. In addition, for the WiVisit system, to make sure that the out-POI queries take place not only truly outside the POIs, but also not far from them, the out-POI queries are extracted based on three criteria: (1) they contain GPS info and their GPS location accuracy [33], [34] is no larger than 30 meters; (2) their distance from POI positions is less than 100 meters; (3) they contain one binding Wi-Fi in their scanned Wi-Fi lists at least. Due to that the Wi-Fi signal can penetrate walls, a Wi-Fi can be scanned in multiple neighboring POIs, meaning a WiFi can appear in different POI fingerprints. Consequently, each query can generate multiple feature vectors for different POIs, whose fingerprint contains at least one Wi-Fi that is in the scanned Wi-Fi list of the query, which makes a query sample potentially correspond to multiple training samples for our identification model. For this reason, we label each sample as follows: 1. If query recalls a different POI from its raw-extracted POI, the corresponding feature vector is treated as a negative sample to the recalled POIs. 2. If a query recalls the same POI as its raw-extracted POI, the original label is used. 5.2. Feature Extraction Recently, almost all POIs deploy their own Wi-Fi routers to provide Wi-Fi services for their employees and visitors. Therefore, when a user visits a POI, the POI’s Wi-Fi devices will appear on the scanned Wi-Fi list of the user. However, the range of that a Wi-Fi can be scanned is limited. If a user does not visit a POI, the user is likely to fail to scan the POI’s Wi-Fi. There is a strong correlation between scanned Wi-Fi and POI visit, which is crucial for POI visit identification. Thus, given a Wi-Fi query and a POI 𝑝, 𝑃 (𝑝|𝑞𝑢𝑒𝑟𝑦) is the probability of the query occurring in POI. We extracted several features from four dimensions to reflect 𝑃 (𝑝|𝑞𝑢𝑒𝑟𝑦). 2 For WiVisit system, we choose −50𝑑𝑏 as the threshold. 1. 𝑃 (𝑚|𝑝𝑗 ) is defined as the probability of Wi-Fi 𝑚 scanned when users visit a POI 𝑝𝑗 . First, we assume that each Wi-Fi scanned in a query is independent from one another. Based on the bayes formula, the posterior probability 𝑃 (𝑝𝑗 |𝑞𝑢𝑒𝑟𝑦) is proportional to the likelihood function 𝐿(𝑞𝑢𝑒𝑟𝑦|𝑝𝑗 ). 𝑘=𝐾 ∏︁ 𝐿(𝑞𝑢𝑒𝑟𝑦|𝑝𝑗 ) = 𝑃 (𝑚𝑘 |𝑝𝑗 ) (8) 𝑘=1 where 𝐾 is the number of scanned Wi-Fi. 𝑃 (𝑚𝑘 |𝑝𝑗 ) can be approximate as the rate of occurrence 𝑜𝑗𝑘 . Thus, the first posterior probability feature is: 𝑘=𝐾 ∑︁ 𝐹1 = 𝑙𝑜𝑔𝑃 (𝑚𝑘 |𝑝𝑗 ) (9) 𝑘=1 However, not all of the Wi-Fi are independence with each other. Thus, we also chose some statistics features about 𝑃 (𝑚𝑘 |𝑝𝑗 ), which do not require independence between the Wi-Fi: 𝐹 2 = 𝑚𝑎𝑥(log 𝑃 (𝑚𝑘 |𝑝𝑗 )) (10) 𝐹 3 = 𝑚𝑒𝑎𝑛(log 𝑃 (𝑚𝑘 |𝑝𝑗 )) (11) 2. Due to that Wi-Fi signals can penetrate walls, a POI’s Wi-Fi can sometimes still be scanned even when the user is outside. However, in those cases, the user is usually further away from the POI’s Wi-Fi device than in the POI, which makes the signal strength of the POI’s Wi-Fi weaker. Moreover, due to the obstruction of walls, the received signal strength will also be weaker when the user is outside. For these reasons, the RSSI of POI’s Wi-Fi is also very important for POI visit identification. Therefore, we calculated RSSI weighted posterior probability based on the following likelihood function: 𝑃𝛼 (𝑚𝑘 , 𝑟𝑘 |𝑝𝑗 ) = 𝛽𝑘 𝑃 (𝑚𝑘 |𝑝𝑗 ) (12) where 𝛽𝑘 = [(𝑟𝑘 + 𝑟𝑚𝑎𝑥 )/𝑟𝑚𝑒𝑎𝑛 ]𝛼 . Similar to Equation 9-11, we can get RSSI weighted posterior probability features: 𝑘=𝐾 ∑︁ 𝐹4 = 𝑙𝑜𝑔𝑃𝛼 (𝑚𝑘 , 𝑟𝑘 |𝑝𝑗 ) (13) 𝑘=1 With different 𝛼, we can get different likelihood functions, which will generate different posterior probability features based on Equation 13. Then, we can get a set of RSSI weighted posterior probability features. 3. In recent years, Wi-Fi is used not only to connect to the Internet, but also to communicate between smart home devices. In an indoor environment, for example, in addition to routers, there are many smart devices with Wi-Fi chips that can be scanned, such as smart TV, intelligent speakers, smart air-conditioners. Therefore, when a user is in a POI, the scanned RSSI of smart devices could also be very strong. Then, the proportion of strong Wi-Fi in the scanned Wi-Fi list will be very high. However, if the user is outside the POI, due to the obstruction of walls and their increasing distance from the smart devices and routers, the proportion of strong Wi-Fi in the scanned Wi-Fi list will decrease. Based on this intuition, we also extracted posterior probability features between POI and Wi-Fi that is filtered by the absolute RSSI value. For these features, the likelihood function can be represented as follow: {︃ 𝑃 (𝑚𝑘 |𝑝𝑗 ), 𝑟𝑘 ≥ 𝑟𝑓 𝑃𝑓 (𝑚𝑘 |𝑝𝑗 ) = (14) 0, 𝑟𝑘 < 𝑟𝑓 where 𝑟𝑓 is the absolute RSSI threshold. For different absolute RSSI thresholds, based on Equation 14, the RSSI filtered posterior probability feature can be represented as: 𝑘=𝐾 ∑︁ 𝐹5 = 𝑙𝑜𝑔𝑃𝑓 (𝑚𝑘 |𝑝𝑗 ) (15) 𝑘=1 With different 𝑟𝑓 , we can get a set of absolute RSSI filtered features. 4. For commercial Wi-Fi devices, the RSSI measurements of Wi-Fi signal will be influenced by antennas, quality of Wi-Fi chips, the position of holding the phone, etc. Thus, even in the same location, different phones scanning the same Wi-Fi will have different RSSI values. Sometimes, features generated by the absolute RSSI value will introduce such bias. Therefore, we extracted some posterior probability features between the POI and Wi-Fi that is filtered by relative RSSI information to remove this bias. First, we extracted marginal features for the query:: ∑︁ 𝐹6 = log 𝑃 (𝑚𝑘 |𝑝) (16) 𝑘∈𝑄𝑡𝑜𝑝(𝑥) where 𝑄𝑡𝑜𝑝(𝑥) is the set of the highest 𝑥 Wi-Fi based on RSSI value in the scanned Wi-Fi list of the query. Meanwhile, we extracted the edge distribution probability features for the POI: ∑︁ 𝐹7 = log 𝑃 (𝑚𝑘 |𝑝𝑗 ) (17) 𝑘≤𝐾∧𝑚𝑘 ∈𝐿𝑡𝑜𝑝(𝑠) where 𝐿𝑡𝑜𝑝(𝑠) means the highest 𝑠 Wi-Fi based on median RSSI value in the fingerprint of the POI. And finally, we extracted the joint probability features: ∑︁ 𝐹8 = log 𝑃 (𝑚𝑘 |𝑝𝑗 ) (18) 𝑘∈𝑄𝑡𝑜𝑝(𝑥) ∧𝑚𝑘 ∈𝐿𝑡𝑜𝑝(𝑠) Based on Equation 16 - 18, with different 𝑥 and 𝑠, we can also get a set of relative RSSI filtered features. Prediction Voting fusion … … LightGBM M1 M2 M61 Models … Train samples B2 B2 B2 balance … … … … Train samples (feature vectors) POI Wi-Fi fingerprints POI queries B1 B1 … B1 balance POI Set1 POI Set2 … POI Set61 Queries Queries Queries POIs’ Localization Query Samples Figure 3: POI visit identification model training workflow. 5.3. Training Model As described in section 5.2, based on different values of hyper-parameters, we used a 784- dimensions feature vector to represent each train sample. Meanwhile, the lightGBM [32] classification model was used to build the POI visit identification model. For the lightGBM model, we had chosen the gradient Boosting Decision tree (GBDT) as the boosting tree and binary log-loss as the loss function. The number of leaves for each tree was set as 127. Meanwhile, to avoid over-fitting, the feature fraction was set as 0.6 and the bagging fraction was set as 0.7. All other parameters of lightGBM model were used as their default values. To deal with hybrid POI types, we trained 61 sub-models by dividing query samples into 61 parts based on their POI info, and then combined their predicted results to get the final result by voting. If the number of sub-models, which gives visit prediction, is larger than 30 (i.e. half of 61), the final prediction is visit and the highest score POI is the visited POI. Figure 3 shows the full POI visit identification model training workflow. Moreover, in order to improve the robustness and generalization of WiVisit, we brought in several optimization methods: 1. POI queries balance: Different POIs will have different popularity. A popular POI will have more localization queries than an unpopular POI. The unbalanced query numbers of different POIs will affect the generalization of the model. Therefore, during the sample extraction phase, the maximum number of query samples in each POI is limited. 2. Training samples balance: For most of POIs, the number of negative samples is much larger than positive samples. In order to get more effective parameter estimation, we adopted random under-sampling on negative samples for each POI. Finally, for each POI, the number of negative samples is no more than 1.2 times of positive samples. 5.4. Online Prediction Finally, we can use the trained model to predict POI visit information. In detail, for a user localization query with scanned Wi-Fi list, the related POIs’ fingerprints are extracted from the fingerprint database, whose fingerprint contains at least one Wi-Fi in the scanned Wi-Fi list of the query. For each POI, the feature vector is computed, and visit prediction of sub-models is added to obtained the final visiting score. If there is at least one POI predicted as visited finally, we selected the POI with the highest visiting score as the visiting result. 6. EVALUATION 6.1. Experiment Settings In order to evaluate WiVisit system, we extracted the top 6 hot types (’Entertainment’, ’Life services’, ’Restaurant’, ’Shopping’, ’Fitness’, ’Hotel’) of POIs in Beijing from Tencent Map [31], which contained 210 sub-types and had more than 90,000 POIs in total. We then applied our POI Wi-Fi binding module, introduced in Section 4, to these POIs, to get POI Wi-Fi information. We extracted these Wi-Fis which occurred at least once in last two weeks of all Tencent localization queries as the candidate Wi-Fi set to binding POI. The size of candidate Wi-Fi set is about several billions. Due to the large number of Wi-Fi existing, for a given POI, all similar Wi-Fi to the given POI cannot be computed within a reasonable time. Therefore, given a POI, we only extracted the set of Wi-Fi whose distances from the POI is less than 200 meters (𝐷𝑡ℎ ) which is large enough compared with common indoor Wi-Fi coverage area. The threshold 𝐷𝑡ℎ is used for filtering a small candidate binding Wi-Fi set to each POI. The normalization factor 𝛾 was set as 0.005, which means that the score of one POI-Wi-Fi pair will be equal to 0 if their similarity is 1 but their distance is larger than 𝐷𝑡ℎ = 200. To assess the accuracy of POI-Wi-Fi binding module, we manually labeled 20,000 POI-Wi-Fi pairs as the benchmark data set. The accuracy of different binding score thresholds was calculated and shown in Figure 4. Finally, POI-Wi-Fi pairs with binding scores larger than 0.8 and accuracy larger than 98% were used to generate POI fingerprints. After POI-Wi-Fi binding, we obtained 42278 POIs which had binding Wi-Fi. For these POIs, we extracted raw user queries from 20210301 to 20210304 to build the fingerprint database. To train the POI identification model, we collected localization queries from 20210305 to 20210306 as the training set of WiVisit. To evaluate the identification performance, we collected localization queries from another two days (20210307, 20210308) as the test set to compare different methods. Moreover, we also manually collected real POI visited queries to evaluate the performance of WiVisit in the real world. For each POI, the visited queries are collected from two collectors with two different types of mobile phones and staying five minutes at least. The POI identification model consists of 61 sub-models. These sub-models were trained parallelly in cluster mode, whose training time is no more than half an hour. Online prediction is triggered on a server Wi-Fi POI binding accuracy 1.00 0.95 0.90 Accuracy 0.85 0.80 0.75 0.5 0.6 0.7 0.8 0.9 1.0 Binding Score Figure 4: Wi-Fi POI binding score vs. accuracy. Table 1 Sample Sets Data Set Time POI Set In-POI Out-POI Train 20210305-20210306 42278 1178542 1994754 Test 20210307-20210308 19506 61349 61012 MANUAL 20210307-20210320 592 6237 0 when one query is received. Based on our test, the average time of predicting is less than 30ms per query. The detailed information of these sample sets is shown in Table 1. 6.2. Compared Methods 1. BASE0: The rule for labeled visiting samples, that is, if at least one scanned Wi-Fi in the query is a binding Wi-Fi and the RSSI value is higher than -50db without any GPS info, then the query is visit sample and the corresponding POI is the visited POI. For evaluation, we only compare BASE0 with other methods in manual set. 2. BASE1: One intuitive assumption is that one localization query having more overlapped Wi-Fi with a POI fingerprint is more likely to occur in this POI. Based on this assump- tion, we computed ratios of Wi-Fi overlapping between a localization query and POIs’ fingerprints. | {𝑚|𝑚 ∈ 𝑞𝑢𝑒𝑟𝑦 ∧ 𝑚 ∈ 𝐹 𝑃𝑗 } | 𝑅𝑗 = (19) | {𝑚|𝑚 ∈ 𝑞𝑢𝑒𝑟𝑦} | Where 𝑅𝑗 is the ratio of Wi-Fi overlapping between the query and POI 𝑗. If the maximum overlapping ratio is larger than the given threshold, this query is labeled as a visit query and the POI corresponding to the maximum ratio is the visited POI. In order to choose the best threshold, we plotted the overlapping ratios between query and its associated POI 3000 Un-Visit Visit 2500 Frequency 2000 1500 1000 500 0 0.0 0.2 0.4 0.6 0.8 1.0 Ratio Figure 5: Ratio of overlapping Wi-Fi. for each training sample shown in Figure 53 . As shown in the figure, the best threshold is 0.84. 3. BASE2: In addition to the ratio of Wi-Fi overlapping, we included the RSSI strength and the rate of occurrence of each Wi-Fi to define a visiting score as follows: ∑︁ 𝑃𝑣 (𝑝𝑗 |𝑞𝑢𝑒𝑟𝑦) = 𝑤𝑖 𝑜𝑗𝑖 (20) 𝑚𝑖 where 𝑃𝑣 (𝑝𝑗 |𝑞𝑢𝑒𝑟𝑦) is defined as the score of a user in POI 𝑝𝑗 with the scanned query, and 𝑤𝑖 is the weight of Wi-Fi 𝑚𝑖 in query with 𝑟𝑖 . Based on the path loss model [12], the weight function is 𝑤𝑚𝑖 = 10(𝑟𝑖 −𝑟0 )/𝑟𝑡ℎ , where 𝑟𝑖 is the signal strength of Wi-Fi 𝑚𝑖 , 𝑟0 is the maximum of RSSIs of the given query, and 𝑟𝑡ℎ is a measure of divergence of all RSSIs. In our evaluation, 𝑟𝑡ℎ was fixed as 50. 𝑜𝑗𝑖 is the co-occur probability of 𝑝𝑗 and 𝑚𝑖 , which was defined in Section 3. For this method, the best threshold is 0.12, with the highest F score in the Train Set. 4. WiVisit-1: The same classification model as WiVisit with only one lightGBM model. 5. WiVisit: Our fusion POI-visiting model for different POI types. 6.3. Metrics For POI visit identification, the objective is to not only determine if a localization query occurs in a POI, but also to identify which POI it is from. Thus, we used four dimensions metrics (precision, recall, F score, accuracy) to evaluate the performance. The following are definitions about these metrics: • TP: {𝑞𝑢𝑒𝑟𝑦|(𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑣𝑖𝑠𝑖𝑡) ∧ (𝑟𝑒𝑎𝑙 𝑣𝑖𝑠𝑖𝑡)} • FP: {𝑞𝑢𝑒𝑟𝑦|(𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑣𝑖𝑠𝑖𝑡) ∧ (𝑟𝑒𝑎𝑙 𝑢𝑛-𝑣𝑖𝑠𝑖𝑡)} • FN: {𝑞𝑢𝑒𝑟𝑦|(𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑢𝑛-𝑣𝑖𝑠𝑖𝑡) ∧ (𝑟𝑒𝑎𝑙 𝑣𝑖𝑠𝑖𝑡)} 3 We remove samples whose overlapping ratio is 1 to make the figure easy to read. For negative samples, 1- overlapping-ratio samples are no more than 0.5%. For Positive sample, 1-overlapping-ratio samples are nearly 73.2%. Table 2 Test Set Performance Method Precision Recall F Score Accuracy BASE1 92.3% 80.8% 0.86 39.3% BASE2 51.6% 83.6% 0.64 66.3% WiVisit-1 81.1% 96.2% 0.88 73.4% WiVisit 91.5% 96.4% 0.94 83.4% • HP: {𝑞𝑢𝑒𝑟𝑦|(𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑃 𝑂𝐼) ≡ (𝑟𝑒𝑎𝑙 𝑃 𝑂𝐼)} • Precision: 𝑃 𝑃 𝑉 = |𝑇 𝑃|𝑇|+|𝐹 𝑃| 𝑃| • Recall: 𝑇 𝑃 𝑅 = |𝑇 𝑃|𝑇|+|𝐹 𝑃| 𝑁| • F score: 2×𝑃 𝑃 𝑉 ×𝑇 𝑃 𝑅 𝑃𝑃𝑉 + 𝑇𝑃𝑅 • Accuracy: |𝑇 𝑃|𝐻𝑃 | |+|𝐹 𝑃 | 6.4. Results 6.4.1. Test Set First, we compared these methods on the Test Set, which was collected by the same rules as the Training Set, but in different time period. Table 2 shows the evaluation result. BASE1 can accurately determine whether the user is visiting (92.3% Precision), but its predicted POI is not the real visited (39.3% Accuracy). That is, POIs with the maximum of overlapping Wi-Fis were not always the visited POIs. BASE2 had a higher recall rate and accuracy, which means the RSSI and rate of co-occurrence is more useful to find the right visited POI. However, the lower precision means it’s not enough to decide whether a localization query is occurring in a POI. Compared with these two baseline methods, WiVisit-1 achieved much better performance in both POI visit judgment and final visited POI identification. Moreover, compared with WiVisit- 1, since the fusion model is more robust in various POIs, WiVisit could obtained about 10% improvements in both precision and accuracy. 6.4.2. POI Types We further compared these methods in different POI types, since the different types of POI had various Wi-Fi environments. As shown in Figure 6, for the 6 types of POIs, we used the F scores to show the performance on POI visit judgment and accuracies to show the performance on final visited POI identification. For POI visit judgment, as shown in the figure, BASE1 and WiVisit-1 yielded a similar performance while WiVisit-1 yielded a more robust result in different types of POIs. BASE2 gave in the worst F score with a much better accuracy for visited POI identification. Obviously, BASE1 and BASE2 were both affected by POI types, especially for ”Hotel”. WiVisit-1 and WiVisit achieved much robust performance in terms of F score and accuracy. Moreover, WiVisit obtained much better performance in all types of POIs compared with other methods. 1.0 1.0 0.9 0.9 0.8 0.8 Accuracy 0.7 0.7 F score 0.6 0.6 0.5 0.5 0.4 BASE1 0.4 BASE1 BASE2 BASE2 0.3 WiVisit-1 0.3 WiVisit-1 WiVisit WiVisit 0.2 0.2 E L R S F H E L R S F H (a) F score (b) Accuracy Figure 6: Comparison of different methods over different POI types. POI types shorted as follows. E: Entertainment, L: Life services, R: Restaurant, S: Shopping, F: Fitness, H: Hotel Table 3 Manual Set Comparison Methods Recall Accuracy BASE0 32.4% 100% BASE1 65.4% 63.7% BASE2 88.4% 73.3% WiVisit-1 84.4% 80.7% WiVisit 87.2% 83.6% 6.4.3. Manual Set Comparison We further compared the above methods on the Manual Set, shown in Table III. Since Manual Set only contained visit samples, we only used recall and accuracy to assess the performance. BASE0 achieved 100% accuracy, meaning the rule for visit query extraction was very effective. Meanwhile, strict rules have caused a lower recall rate 32.4% for BASE0. BASE1 obtained a remarkable improvement in recall rate. However, it did not achieve a reasonable accuracy. BASE2 had a much better balance performance in both recall and accuracy. The accuracy is much lower than the recall which means that both BASE1 and BASE2 still cannot distinguish the true visited POI among POI dense area. WiVisit1 and WiVisit still had the best performance than other methods. While WiVisit had a similar recall rate with BASE2, it achieved a significant accuracy improvement from 73.3% to 83.6%. 7. Conclusion In this paper, we proposed an auto-generated Wi-Fi fingerprint-based POI visiting identification system, WiVisit, which collects accurate user’ POI visit information, including indoor POIs. It is crucial for various LBS. Meanwhile, compared with traditional Wi-Fi fingerprint-based localization methods, WiVisit does not require any human effort in fingerprint collection and updating. Therefore, WiVisit can be deployed in the real world widely and easily. Moreover, WiVisit system adopts a multi-model fusion based method for POI visit identification, which can deal with different types of POI in the real world. Based on our extensive experiments, the recall rate of WiVisit is around 90% and the accuracy is 83%, which already outperforms state-of-the-art. However, there are still many POIs cannot be bound with Wi-Fis, due to their irregular Wi-Fi SSID or no Wi-Fi device is deployed in them. In the future, crowd-sourcing methods can be used to collect more POIs’ Wi-Fi information, which will make WiVisit system versatile for more POIs in the real world. References [1] M. Ye, P. Yin, W.-C. Lee, D.-L. Lee, Exploiting geographical influence for collaborative point-of-interest recommendation, in: Proceedings of SIGIR’11, 2011, p. 325–334. [2] E. Cho, S. A. Myers, J. Leskovec, Friendship and mobility: User movement in location-based social networks, in: Proceedings of SIGKDD’11, 2011, p. 1082–1090. [3] H. Yin, W. Wang, H. Wang, L. Chen, X. Zhou, Spatial-aware hierarchical collaborative deep learning for poi recommendation, IEEE Transactions on Knowledge and Data Engineering 29 (2017) 2537–2551. [4] J. Suzuki, Y. Suhara, H. Toda, K. Nishida, Personalized visited-poi assignment to individual raw gps trajectories, ACM Trans. Spatial Algorithms Syst. 5 (2019). [5] D. Ashbrook, T. Starner, Learning significant locations and predicting user movement with gps, in: Proceedings. Sixth International Symposium on Wearable Computers„ 2002, pp. 101–108. [6] X. Cao, G. Cong, C. S. Jensen, Mining significant semantic locations from gps data, Proc. VLDB Endow. 3 (2010) 1009–1020. [7] P. Bahl, V. Padmanabhan, Radar: an in-building rf-based user location and tracking system, in: Proceedings IEEE INFOCOM, volume 2, 2000, pp. 775–784 vol.2. [8] Y. Jiang, X. Pan, K. Li, Q. Lv, R. P. Dick, M. Hannigan, L. Shang, Ariel: Automatic wi-fi based room fingerprinting for indoor localization, in: Proceedings of UbiComp’12, 2012, p. 441–450. [9] H. Xu, Z. Yang, Z. Zhou, L. Shangguan, K. Yi, Y. Liu, Enhancing wifi-based localization with visual clues, in: Proceedings of UbiComp’15, 2015, p. 963–974. [10] M. Youssef, A. Agrawala, The horus wlan location determination system, in: Proceedings of MobiSys’05, 2005, p. 205–218. [11] H.-H. Liu, Y.-N. Yang, Wifi-based indoor positioning for multi-floor environment, in: TENCON 2011 - 2011 IEEE Region 10 Conference, 2011, pp. 597–601. [12] T. Rappaport, Wireless Communications: Principles and Practice, 2nd ed., Prentice Hall PTR, 2001. [13] Q. Cai, J. Aggarwal, Automatic tracking of human motion in indoor scenes across multiple synchronized video streams, in: Sixth International Conference on Computer Vision, 1998, pp. 356–362. [14] J. M. Chaquet, E. J. Carmona, A. Fernández-Caballero, A survey of video datasets for human action and activity recognition, Comput. Vis. Image Underst. 117 (2013) 633–659. [15] W. Mao, J. He, L. Qiu, Cat: High-precision acoustic motion tracking, in: Proceedings of MobiCom’16, 2016, p. 69–81. [16] S. Yun, Y.-C. Chen, L. Qiu, Turning a mobile device into a mouse in the air, in: Proceedings of MobiSys’15, 2015, p. 15–29. [17] M. Altini, D. Brunelli, E. Farella, L. Benini, Bluetooth indoor localization with multiple neural networks, in: IEEE 5th International Symposium on Wireless Pervasive Computing, 2010, pp. 295–300. [18] L. Yang, Y. Chen, X.-Y. Li, C. Xiao, M. Li, Y. Liu, Tagoram: Real-time tracking of mobile rfid tags to high precision using cots devices, in: Proceedings of MobiCom’14, 2014, p. 237–248. [19] S. Gezici, Z. Tian, G. Giannakis, H. Kobayashi, A. Molisch, H. Poor, Z. Sahinoglu, Localiza- tion via ultra-wideband radios: a look at positioning aspects for future sensor networks, IEEE Signal Processing Magazine 22 (2005) 70–84. [20] A. Martinelli, S. Jayousi, S. Caputo, L. Mucchi, Uwb positioning for industrial applications: the galvanic plating case study, in: IPIN’19, 2019, pp. 1–7. [21] H. Perakis, V. Gikas, Evaluation of range error calibration models for indoor uwb position- ing applications, in: IPIN’18, 2018, pp. 206–212. [22] S. Lembo, S. Horsmanheimo, M. Somersalo, M. Laukkanen, L. Tuomimäki, S. Huilla, Enhancing wifi rss fingerprint positioning accuracy: lobe-forming in radiation pattern enabled by an air-gap, in: IPIN’19, 2019, pp. 1–8. [23] H. Abdel-Nasser, R. Samir, I. Sabek, M. Youssef, Monophy: Mono-stream-based device-free wlan localization via physical layer information, in: IEEE Wireless Communications and Networking Conference, 2013, pp. 4546–4551. [24] K. Chintalapudi, A. Padmanabha Iyer, V. N. Padmanabhan, Indoor localization without the pain, in: Proceedings of MobiCom’10, 2010, p. 173–184. [25] X. Li, S. Li, D. Zhang, J. Xiong, Y. Wang, H. Mei, Dynamic-music: Accurate device-free indoor localization, in: Proceedings of UbiComp’16, 2016, p. 196–207. [26] K. Wu, J. Xiao, Y. Yi, M. Gao, L. M. Ni, Fila: Fine-grained indoor localization, in: Proceedings IEEE INFOCOM, 2012, pp. 2210–2218. [27] D. Vasisht, S. Kumar, D. Katabi, Decimeter-level localization with a single wifi access point, in: NSDI’16, USENIX Association, 2016, pp. 165–178. [28] B. Berruet, O. Baala, A. Caminada, V. Guillet, E-loc: Enhanced csi fingerprinting localization for massive machine-type communications in wi-fi ambient connectivity, in: IPIN’19, 2019, pp. 1–8. [29] pypinyin 0.42.0, Accessed July 2, 2021. URL: https://pypi.org/project/pypinyin. [30] Tencent fanyijun, Accessed July 2, 2021. URL: https://fanyi.qq.com. [31] Tencent map, Accessed July 2, 2021. URL: http://map.qq.com. [32] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, T.-Y. Liu, Lightgbm: A highly efficient gradient boosting decision tree, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems, volume 30, Curran Associates, Inc., 2017.