50


         SOME ASPECTS OF MACHINE LEARNING IN
                   LOCATION TASKS

                                                            1
                                     Kateryna Kononova
     1
    Department of Economic Cybernetics and Applied Economics, V. N. Karazin Kharkiv
National University, e-mail: kateryna.kononova@karazin.ua


         Abstract. As a result of the study, locations for Pan-Asian food delivery service
         in Kharkiv have been found so that their network evenly covered the entire city;
         and different units were at an acceptable distance from each other. The
         company’s order database allowed us to apply ML algorithms, in particular,
         clustering methods to find optimal locations. Three clustering models were
         developed and a series of experiments were conducted with each of them. The
         analysis of the model results allowed us to confirm both hypotheses put forward
         in the paper, namely: 1) reducing dimension does not skew clustering results
         obtained on the full database; 2) urban traffic has a significant impact on
         clustering results. This made us recommend pre-group the data and consider
         urban traffic in location tasks for the referred company.

         Keywords: Location task, Machine Learning, Clustering, Shift Means, K-
         means, API, Google maps.


1. Introduction
The company’s success significantly depends on the location. It affects not only the
cost of rent, access to materials, workers, transportation, but also the perception of the
brand and expansion of the customer number.
   Location databases have enabled companies to do initial screening themselves,
hence reducing their need to rely on external experts to providing only very specific
information on locations [7]. Machine Learning (ML) algorithms are effectively used
for finding the right locations using accumulated companies’ data; especially,
clustering methods, which within the geomarketing approach, use spatial data
(coordinates, address, registry or other bindings) along with general information.
   Various theoretical aspects of ML application in the location tasks are explored in
the scientific literature. Montejano et al. overviewed different location models used
within the geomarketing field, exemplifying it through the use of Geographic
Information Systems (GIS) [8]. Serajnik et al. performed the statistical analysis,
evaluated geodata and carried out spatial analysis with a subsequent cartographic
visualization to define mall location strategy [11]. Rosu et al. used quantitative
models for measuring accessibility to the existing shopping centers in the city,
calculating thus their catchment area, for identifying a suitable location for a new
shopping center [9].


Copyright © 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                    51

   A lot of papers are devoted to the location task analysis in food retail. To propose
new locations for the supermarkets, Grassi [6] analyzed four conditions that are
prioritizing for new locations: the supermarket area of influence, road access,
competition and income of the target audience. Based on the analytic hierarchy
process method and spatial analysis of GIS, Xiao [14] built a methodology for the
process of selecting a supermarket site location. Bekti [3] and Baviera-Puig provided
an application of the spatial methods for retail marketing strategy development for the
supermarkets [2].
   Geomarketing algorithms are also widely used in international and domestic
business practices. For example, WIGeoGIS [13] helps to choose the best GIS system
and the relevant market data, as well as implements of mapping solution for location
analysis. Ukrainian company GeoDesign [4] offers a business strategy development
using spatial data.


2. Hypothesis
The object of the study is the Pan-Asian food delivery service, which is tasked with
finding optimal locations for a network of its units. The company is represented in
several cities of Ukraine, including Kharkiv where it already has three divisions.
   The company is rising rapidly, the number of its customers is growing, so the
current production capacity is no longer sufficient to meet the orders flow in strict
delivery time limits. Therefore, the company has a need to open two new divisions so
that the load at all units was uniform, and the delivery time took no more than 15
minutes. It was decided to renovate the company structure completely closing the old
units and opening new ones in the optimal locations.
   Delivery time is a key location factor for this company. Thus, it was important to
test the impact of urban traffic on model results.
   To find the optimal locations, data on the orders made in Kharkiv last year at peak
hours (from 15-00 to 21-00) were collected. During this period 36095 orders were
received from 9002 customers (table 1).

                           Table 1. A fragment of the dataset
        Order Time                     Latitude                   Longitude
          16:12                       50.027284                   36.225768
          16:27                       50.053163                   36.197766
          15:12                       50.019159                   36.222417
          15:39                       50.013091                   36.278568
          17:53                       49.953651                   36.214925
          15:13                       50.013091                   36.278568
          15:38                       49.958041                   36.329566
          17:21                       49.981018                   36.147102
          17:08                       49.950979                   36.163038
          15:36                       49.963422                   36.287482

  To find dense areas of the customers, we decided to design clustering models.
52

   Since it was decided to use Google Maps API [5] to consider urban traffic in the
model, it was crucial to optimize the number of requests sent online at each stage of
the clustering algorithm. Hence, it was necessary to choose the method of combining
points into a cluster, and decide if it should be based on pairwise distances or centroid
method. The advantage of the first-type methods is that they do not need recalculating
distances every time after combining, which significantly reduces the computational
complexity of the algorithm.
   However, according to preliminary estimates, the use of pairwise distances
methods requires more than 81 million requests for a base of 9,000 clients; while for
the five clusters detecting, centroid methods require about 45,000 requests for each
iteration (and given that 30-40 iterations are needed for the algorithm convergence,
we get about 1.5 million requests only). Thus, the centroid method has been chosen.
   Nevertheless, the question of query optimization remained open. To solve this
problem, it was decided to test the hypothesis that dimension reduction does not skew
the results of clustering.
   Thus, the following two hypotheses were put forward for consideration in the
paper:
    1) reducing dimension does not skew clustering results obtained on the full
          database;
    2) urban traffic has a significant impact on clustering results.
   Three clustering experiments were performed to test these hypotheses:
         full sample clustering,
         pre-grouped sample clustering,
         clustering based on the urban traffic data.


3. Full sample clustering
Mean Shift algorithm based on the centroid method was selected for clustering. Mean
shift clustering is a sliding-window-based algorithm that attempts to find dense areas
of data points. It is a centroid-based algorithm meaning that the goal is to locate the
center points of each class, which works by updating candidates for center points to
be the mean of the points within the sliding-window. These candidate windows are
then filtered in a post-processing stage to eliminate near-duplicates, forming the final
set of center points and their corresponding groups [10].
   To use this algorithm with geographical coordinates, one has to select a distance
metric. Since the size of the sliding window was given in kilometers, it was decided to
use the Haversine metric [12]:


                        (√      (        )        ( )     ( )       (        ))


where d is the distance between points (in km);
r is the globe radius (6371 km);
x1, x2 is the longitude of two points;
y1, y2 is the latitude of two points.
                                                                                         53

   The results of the baseline clustering model obtained on the full dataset are
presented in Table 2 and Figure 1.

                     Table 2. Results of the baseline clustering model
 Cluster          Center          Number of        Customer       Number of   Share of
               Coordinates        customers          share         orders      orders
1 – blue     49.9531, 36.2977       1589            17.65%           5426     15.03%
2 – red      49.9879, 36.2218       2766            30.72%          10148     28.11%
3 – green    50.0430, 36.2229       1633            18.14%           7917     21.93%
4 – pink     49.9377, 36.3872        765            8.49%            2326      6.44%
5 – orange   50.0117, 36.3407       2249            24.98%          10278     28.47%


                   Fig. 1. Visualization of the baseline clustering model


4. Pre-grouped sample clustering
To check the first hypothesis, a weighted clustering method was used to reduce the
dimension. A centroid calculated by a weighted value considers each customer to
have individual value. The centroid is not created in the center of all customers but in
the center of the customers who most satisfy the value, one has weighted [1].
   Using a k-means algorithm with the Haversine metric to detect 500 clusters, the
coordinates of the weighted centers were obtained.
   Figure 2 presents the initial points (marked red), and the weighted centers (marked
purple).
54


             Fig. 2. Visualization of clients’ dataset after dimension reduction

   Next, the set of 500 points was clustered using the Mean Shift algorithm and the
following estimates were obtained (Figure 3, Table 3).


                Fig. 3. Visualization of the pre-grouped sample clustering
                                                                                             55

               Table 3. The results of the pre-grouped sample clustering
 Cluster          Center          Number of       Customer        Number of       Share of
               Coordinates        customers         share          orders          orders
1 – blue     49.9554, 36.2939       1633           18.14%            5802         16.07%
2 – red      49.9887, 36.2196       2692           29.90%            9744         26.99%
3 – green     50.0434, 36.223       1612           17.90%            7818         21.65%
4 – pink     49.9386, 36.3851        817           9.07%             2444          6.77%
5 – orange   50.0115, 36.3408       2248           24.97%           10287         28.49%

   Comparative analysis of the first and second clustering results showed that the
reduction of dimension does not lead to its significant skew, the obtained clusters
coincide by 97%. This allowed us to use the pre-grouped dataset for the calculations,
which require urban traffic data.


5. Clustering based on the urban traffic data
To consider urban traffic data, the clustering algorithm has been modified – to
measure the distance between the two points instead of the Haversine metric we used
data provided by Google Maps [5].
   Working with the Google Maps API, the following settings were specified: type of
transport was ‘car’; forecast time was ‘02.12.2019 18:00’; forecast type was ‘most
likely’ (‘pessimistic’ and ‘optimistic’ estimates were also tested).
   As a result of the clustering model with regard to urban traffic, the following
estimates were obtained (Figure 4, Table 4).


               Fig. 4. Visualization of clustering with regard to urban traffic
56

                  Table 4. Clustering results with regard to urban traffic
 Cluster           Center           Number of       Customer       Number of       Share of
                Coordinates         customers         share         orders          orders
1 – blue      49.9844, 36.2261        3109           34.53%          11468         31.77%
2 – red       50.0411, 36.2229        1608           17.86%           7763         21.50%
3 – green     49.9452, 36.3366        1943           21.58%           6328         17.53%
4 – pink      50.0009, 36.3401        1465           16.27%           6578         18.22%
5 – orange    50.0307, 36.3369         877            9.74%           3958         10.96%

   Comparative analysis of three clustering experiments showed that traffic data
significantly affects the clustering results. This leads to the conclusion that it is
necessary to consider this factor in location tasks for the referred company.


6. Conclusions
As a result of the study, locations for Pan-Asian food delivery service in Kharkiv have
been found so that their network evenly covered the entire city; and different units
were at the acceptable distance from each other (less than 15 minutes by car).
   To find the locations, data about 36095 orders from 9002 customers were
collected. The location database has enabled us to do screening using ML algorithms,
in particular, clustering methods, which within a geomarketing approach, use spatial
data along with general information.
   Two hypotheses have been put forward for consideration, namely:
   1) reducing dimension does not skew clustering results obtained on the full
database;
   2) urban traffic has a significant impact on clustering results.
   Three clustering models were developed and a series of experiments were
conducted with each of them.
   Comparative analysis of the first and second clustering results showed that the
reduction of dimension does not lead to its significant skew. This allows us to use the
pre-grouped dataset for the calculations based on urban traffic data.
   To calculate the distance between two points with regard to urban traffic, the
Haversine metric has been replaced with Google Maps data. The analysis of the
experiments showed that traffic data significantly affects the clustering results.
   Thus, as a result of the study, both hypotheses were confirmed. This made us
recommend pre-group the data and consider urban traffic in location tasks for the
referred company.


References
1. ArcGIS.       (2019).     Find    optimal       store    locations.      Retrieved     from
   http://desktop.arcgis.com/ru/arcmap/latest/extensions/business-analyst/find-optimal-store-
   locations-mean-store.htm.
                                                                                           57

2. Baviera-Puig, A., Buitrago-Vera, J., & Escriba-Pere, C. (2016). Geomarketing Models in
    Supermarket Location Strategie. Journal of Business Economics and Management, 17(6),
    1205–1221. DOI: 10.3846/16111699.2015.1113198.
3. Bekti, R., Pratiwi, N., & Jatipaningrum, M. (2018). Multiplicative Competition Interaction
    Model to obtained Retail Consumer Choice based on Spatial Analysis. IOP Conference
    Series: Earth and Environmental Science, 187(1), 1-9. DOI: 10.1088/1755-
    1315/187/1/012041.
4. Geodesign.info. (2019). Retrieved from https://geodesign.info/.
5. Google Maps. (2019). Retrieved from https://www.google.com/maps/.
6. Grassi, V. (2010). Estratégias de localização de uma rede de supermercados: o
    geomarketing aplicado à companhia zaffari em Porto Alegre. Porto Alegre, Brasilia:
    Universidade Federal do Rio Grande do Sul. DOI: 10.13140/RG.2.2.23534.31041.
7. Heil,         K.         (2012).      Location        strategy.        Retrieved      from
    https://www.referenceforbusiness.com/management/Int-Loc/Location-Strategy.html.
8. Montejano, J.A., & Cruz Bello, G.M. (2018, February 5). Geomarketing Localization
    Models. Espacialidades, Revista de temas Contemporáneos sobre lugares, política y
    cultura,               8(1),             95-120.               Retrieved             from
    http://espacialidades.cua.uam.mx/vol/08/2018/01/06_Montejano_y_Cruz.pdf.
9. Rosu, L., Blăgeanu, A., & Ionuţ-Ciprian, I. (2013). Geomarketing. A New Approach in
    Decision Marketing: Case Study – Shopping Centres in IASI. Lucrările seminarului
    geografic        ‘Dimitrie      Cantemir’,     36,      123-133.       Retrieved     from
    https://www.researchgate.net/publication/291956981_                  GEOMARKETING_-
    A_NEW_APPROACH_IN_DECISION_                               MARKETING_CASE_STUDY_-
    _SHOPPING_CENTRES_IN_IASI.
10. Seif, G. (2018). The 5 Clustering Algorithms Data Scientists Need to Know. Retrieved
    from https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-
    know-a36d136ef68.
11. Serajnik, T., Amaduzzi, S., & Paulus, G. (2014). Geomarketing. Analyses of the Città Fiera
    Ma. GI_Forum, 1, 105-114. DOI:10.1553/giscience2014s105.
12. Sinnott R.W. (1984). Virtues of the Haversine. Sky and Telescope 68 (2), 159.
13. Wigeogis. (2019). Transparency and success. Geomarketing supports retailers. Retrieved
    from https://www.wigeogis.com/en/retail_geomarketing/.
14. Xiao, D., & Ye, W. (2019). Combining GIS and the Analytic Hierarchy Process to Analyze
    Location of hypermarke. IOP Conference Series: Earth and Environmental Science, 237, 1-
    5. DOI: 10.1088/1755-1315/237/3/032012.