<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Framework for Microbusiness Density Forecasting</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Eimantas</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Krilavičius</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zaranka</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dmytro</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Klepachevskyi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bohdan</string-name>
          <email>bohdan.krushelnytskyi@stud.vdu.lt</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomas</string-name>
          <email>tomas.krilavicius@vdu.lt</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centre of Applied Research and Development</institution>
          ,
          <country country="LT">Lithuania</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Krushelnytskyi</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Vytautas Magnus university</institution>
          ,
          <addr-line>Kaunas</addr-line>
          ,
          <country country="LT">Lithuania</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Microbusinesses are a vital part of a country's economy, and forecasting their density is crucial for both governments and hosting providers. Accurate predictions enable the government to plan future benefits for business owners, and hosting providers can efficiently allocate resources. In this study, we use data provided by Forward Venturer by GoDaddy, along with data collected from the U.S. Census Bureau to develop a model for microbusiness density forecasting. During the study we performed experiments using various machine learning techniques, including linear regression (LR), Ridge, Lasso, ElasticNet regression, decision tree (DT), random forest (RF), multilayer perceptron (MLP), gradient boosting, Ada boosting, support vector machine (SVM), XGBoost, LGBM, and TensorFlow decision forest (TFDF) regressors, as well as several neural network architectures such as multilayer perceptron (MLP), recurrent neural network (RNN), long short-term memory (LSTM), N-BEATS, and autoencoder. The performance of each model was evaluated using MAE and SMAPE metrics. This study highlights the potential of various machine learning and neural network algorithms for forecasting microbusiness density, which can aid in better resource planning for hosting providers and the government. Microbusiness, density forecasting, feature selection, machine learning, regression.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>2023 Copyright for this paper by its authors.
CEUR</p>
      <p>ceur-ws.org</p>
      <p>Additionally, to the initial data, we were encouraged to use any other useful data. We have found many
useful variables that could make the prediction more accurate and help to realize more advanced
approaches to improve predictions, and they are described in Section II.</p>
      <p>
        There is a lack of comprehensive studies analysing microbusiness density in the United States. While
there has been some research on small businesses [
        <xref ref-type="bibr" rid="ref2">1</xref>
        ], which generally include those with up to 500
employees, there has not been as much attention given to micro businesses, which typically have fewer
than 10 employees. This is even though microbusinesses make up a significant portion of the overall
business landscape in the US. As such, there is a need for more research on this important sector of the
economy.
      </p>
      <p>The rest of the paper is organized as follows. Literature review of the microbusiness density is
presented in Section II. Section III provides description of the data that will be used for forecasting.
Section IV presents data preprocessing with subsections of data combination, data preparation and
features selection. Forecasting results are provided in Section V. Section VI describes the selected
techniques. Finally, concluding remarks regarding forecasts are discussed in Section VII.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Literature review</title>
      <p>
        We believe that the analysis of microbusiness is a relative new topic that has not received much
attention so far. Microbusiness density has been the subject of interest only in [
        <xref ref-type="bibr" rid="ref3">2</xref>
        ]. The goal of this study
was to determine the factors that influence local microbusiness venture density and the factors that are
influenced by it. Authors employed several quantitative analysis techniques, including Ordered Least
Square regression, Probit with the Huber-White sandwich estimator of variance, and Ordered Probit for
the equivalent ordinal variable estimate. The results suggest that there is a significant relationship
between microbusiness density and:










employment level
distribution of self-employed/wage-employee
population density
business turnover
urban/rural area indicator
gender distribution
education
prosperity indexes
internet usage data
distribution of ethnic groups.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>In this study, various models were utilized to predict the target variable. Models were selected based
on their usage in previous competitions and included popular regression models such as Linear
Regression, Decision Tree Regressor, XGB Regressor, and others. The performance of these models
was evaluated using the symmetric mean absolute percentage error (SMAPE) as an accuracy metric,
which was required by the competition evaluation rules and mean absolute error (MAE). These chosen
metrics are defined as follows
where   is an actual value and   is forecasted value.
=

1

∑
 =1

 =1</p>
      <p>|  −   |
(|  | + |  |)/2
∑|  −   |
(1)
(2)</p>
      <p>To examine the relationships between variables, a correlation analysis was conducted using the
Pearson correlation coefficient. This statistical method was selected for its ability to quantify the
strength and direction of linear associations between variables. Pearson correlation coefficient of two
variables X and Y is formally defined as
  , =
,
deviation of Y.
where cov( ,  )is covariance of the two variables,   is standard deviation of X and   is standard</p>
      <p>We applied data normalization to ensure that all attributes have an equal effect on the resulting
variable. Min-max normalization was used to scale the variables between zeros to one:
(3)
(5)
(6)
(7)
(8)
1. Randomly select  initial objects as centroids.
2. Calculate the distance between each object and centroids.
3. Assign each object to the nearest cluster.
4. Calculate the mean of each cluster as new centroid.
5. Repeat steps 2 – 4 until convergence.</p>
      <p>The objective of K-means algorithm is to minimize the squared error function:

 =1  ∈ 
 = ∑
∑ | −   |</p>
      <p>2
where  – is a point in space representing a given object,   is the mean value of cluster   .</p>
      <p>
        To define optimal number of clusters we used Davies-Buildin Score and Elbow Method.
DaviesBuildin Score [
        <xref ref-type="bibr" rid="ref4">3</xref>
        ] is defined as the ratio between the within cluster scatter and the between cluster
separation
 ′ =

  − 
( ) − 
( )
( )
(4)
where min(X) and max(X) represent the minimum and maximum values of X respectively.
      </p>
      <p>To better understand the differences between counties we applied K-means clustering algorithm.
The idea behind K-means is to partition the data 
= { 1,  2, . . . ,   } into  (
&lt;  )clusters so that so
the objects within a cluster are more similar to each other than the objects in different clusters. In this
case, objects similarity was measured based on Euclidean distance. Stepwise K-means clustering
algorithm can be defined as follows.</p>
      <p>where   is a measure of scatter within the ith cluster defined as
and   , is a measure separation between ith and jth clusters:
Lower value of Davies-Buildin Score indicates the better clustering.
∑ max</p>
      <p>≠
 =1
  +</p>
      <p>,
1
   ∈ 
  =</p>
      <p>∑  ( ,   )
  , =  (  ,   )</p>
      <p>
        The Elbow Method [
        <xref ref-type="bibr" rid="ref5">4</xref>
        ] is a partitioning method, where the goal is to define clusters such that the
total intra-cluster variation is minimized


(∑  (  )),
(9)
where   is the ith cluster and W(  ) is the within cluster variation.
      </p>
      <p>After the calculations, the results are plotted according to the number of clusters, and the location of
a bend in a plot is usually considered an indicator of the appropriate number of clusters.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Data description</title>
      <p>
        The initial dataset used in this study was provided by GoDaddy [
        <xref ref-type="bibr" rid="ref6">5</xref>
        ]. It consisted of information on
microbusiness density for 3135 counties. The data covered the period from August 2019 to December
observations and 7 features. The description of the dataset can be seen in Table 1. We also used
additional information obtained from U.S. Census Bureau and covering the period from 2017 to 2021.
A detailed description of this dataset is shown in Table 2.
      </p>
      <p>The target variable represented the number of microbusinesses per 100 people age over 18 in the
given county. Due to the ACS update window, the population figures used to calculate the
microbusiness density are on a two-year lag. This means that the microbusiness density for 2022 was
calculated using population figures from 2020.</p>
      <sec id="sec-4-1">
        <title>Description</title>
        <p>ID code consisting of cfip and first day of the month columns
A unique county identifier using Federal Information Processing
System, where first two digits corresponds to the state FIPS code, while
following three numbers represents the county</p>
      </sec>
      <sec id="sec-4-2">
        <title>Name of the county</title>
      </sec>
      <sec id="sec-4-3">
        <title>Name of the state</title>
        <p>The date of the observation. Consists of year, month, and day
Number of microbusinesses per 100 people age over 18 in the given
county. This is a target variable.</p>
        <p>The microbusiness in the county (not provided in future forecasting)
features as Table 1, but with missing target feature microbusiness_density and active column, which
is yet unknown. The forecasts should cover the period from January to June 2023 for all 3135
counties.</p>
        <p>
          The organizers strongly encouraged the use of external data sources that might help to improve
prediction performance. Therefore, we enriched the initial data using publicly available sources such
as the ACS website. Every externally collected dataset had the same structure as Table 2, where each
column contained information for the given year. The following information was gathered:
 Business turnover [
          <xref ref-type="bibr" rid="ref7">6</xref>
          ].
 Population estimates [
          <xref ref-type="bibr" rid="ref8 ref9">7,8</xref>
          ].
 Demographic of population in counties [
          <xref ref-type="bibr" rid="ref8 ref9">7,8</xref>
          ].
 Internet usage [
          <xref ref-type="bibr" rid="ref10">9</xref>
          ].
 Unemployment [
          <xref ref-type="bibr" rid="ref11">10</xref>
          ].
 Education of populous [
          <xref ref-type="bibr" rid="ref12">11</xref>
          ].
 Ethnicity of counties [
          <xref ref-type="bibr" rid="ref13">12</xref>
          ].
        </p>
        <p>
           Geographical location of counties [
          <xref ref-type="bibr" rid="ref14">13</xref>
          ].
4.1.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Explanatory data analysis</title>
      <p>To get a better understanding of the initial data, we performed the following steps: we checked the
distribution of target values, performed correlation analysis, and checked for possible seasonality.</p>
      <p>The microbusiness density distribution revealed that most observations are centered around zero,
see Figure 1(a). Moreover, data distribution is skewed to the right. For this reason, we applied the log
transformation, as shown in Figure 1(b). Based on these results, we can conclude that the microbusiness
density follows a close-to-log-normal distribution.</p>
      <p>(a) (b)
Figure 1: (a) original microbusiness density distribution, (b) logarithmically transformed microbusiness
density.</p>
      <p>Analyzing microbusiness data on a state level, we observed that several states had higher density of
microbusinesses, i.e., California, Colorado, Delaware, Wyoming, Utah, Nevada, and Florida. The
average microbusiness density per county in these states was 8.38, while the rest of the U.S. states had
an average of 3.41, with the standard deviations of 13.21 and 3.06 respectively. The two most
outstanding states were Delaware with an average of 18.74 and Nevada with 12.42 microbusinesses
density per county, as shown in Table 3. To account for these fluctuations, we decided to perform a
clustering analysis. To identify the most appropriate number of clusters, we performed experiments
with a different number of clusters  = 2,3, . . . 30. The maximum number of clusters was based on the
rule of thumb, i.e.</p>
      <p>~√ /2,
(10)
where n is a number of observations.</p>
      <p>Table 3
Highest density states information</p>
      <p>State</p>
      <p>Avg. microbusiness
density</p>
      <sec id="sec-5-1">
        <title>Standard deviation</title>
        <p>of microbusiness
density
7.5628
8.7500
18.7419
6.9452
12.4261
8.3480
9.3745
3.8278</p>
        <p>To compare clustering results obtained using different k values, we used Davies-Bouldin Score (DB
Score) and the Elbow Method. The results are illustrated in Figure 2. In this case, the Elbow Method
did not show a clear indication of an elbow point (see Figure 2(a)), while the DB Score indicated that
the optimal number of clusters for this dataset is four (see Figure 2(b)). A more detailed analysis
revealed that extracted clusters quite well represent the distribution of microbusiness density among
counties. For instance, cluster number two represents counties having the largest density, i.e., the
average value of microbusiness density in this cluster is 62.13 with a standard deviation of 51.19. In
contrast, Cluster 3 has the lowest results, with an average of 2.20 and a standard deviation of 1.02. For
more details, please refer to Table 4.</p>
        <p>Based on these results, we included an additional variable representing the counties' membership
among clusters.</p>
        <p>To analyze relations between microbusinesses density and other factors we performed correlation
analysis. Due to the nature of microbusiness density, we focused on two possible variations. The first
approach sought to understand how a 2-year lag impacts the correlation, and second, how features
correlate with each other. As can be seen in Figure 3, a 2-year lag has a positive impact on the correlation
between features. A weak-to-medium linear dependency can be observed between features, with the
correlation coefficient ranging between 0.2 and 0.6. The target value correlates the most with the college
education feature (value 0.5), followed by broadband usage and median income features (values 0.4),
as shown in Figure 3(a)
(a)</p>
        <p>Lastly, the seasonality analysis was conducted. It was noticed that the majority of the counties has a
noticeable increase in microbusinesses during festive periods and decrease during inter-holiday period.
The rest of the counties had either no seasonality or very small seasonality.
4.2.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Data preparation</title>
      <p>
        The process of data preparation is essential for machine learning model performance. The following
preprocessing steps were performed to prepare the data for the modelling stage:
1. Combination of training and testing datasets. Because a one-year lag will be introduced into
the dataset, it is important to have a temporary full dataset. Before the combination two
additional features were introduced: is_test column that marked the original sets and dcount that
marked the sequence of observations in each county. This step ensures that the first row of each
county will have a full set of features.
2. Introduction of lag terms. The dataset was enriched using series’ own past values, so-called
lags. In this case, we included 1-12 month lags representing the microbusiness density of the
previous year.
3. Imputation of missing values. The merging external dataset introduced some missing values
that were imputed using the mean value of the corresponding county feature.
4. Splitting the combined dataset back into training and testing sets. To prevent data leakage
full dataset was split back into training and testing datasets using is_test feature introduced in
the first step of data preparation.
5. Removal of features with incomplete lag values in the training dataset. The first eleven
entries of each county in the training dataset were dropped due to the missing lag values.
6. Numerical feature scaling. All numerical features, except the target and lag values, were scaled
to be in the range of [
        <xref ref-type="bibr" rid="ref2">0, 1</xref>
        ]. This ensures that machine learning models interpret all features on
the same scale.
7. Log-transformation. Two different variations of datasets were created. One with original target
and lag values, and the second, where target and lag values were logarithmically transformed.
8. Categorical feature encoding. All categorical features, i.e., state, county, cluster number were
transformed using one-hot encoding.
      </p>
      <p>The final dataset had a total of 3162 features. Seeking to avoid unnecessary complexity caused by
the dataset’s dimensionality we performed feature selection.
4.3.</p>
    </sec>
    <sec id="sec-7">
      <title>Feature selection</title>
      <p>
        The data preparation step introduced many new features that negatively impacted the model.
Therefore, statistical feature significance tests, specifically ordinary least squares regression tests [
        <xref ref-type="bibr" rid="ref15">14</xref>
        ],
were conducted for feature selection. A stepwise feature selection procedure [
        <xref ref-type="bibr" rid="ref16">15</xref>
        ] was performed to
ensure that all features had p-values less than 0.05. Based on this procedure, we identified 21
statistically significant features. Please refer to Table 5 for more details.
5. Results
college
or
      </p>
      <p>Both original and logarithmically transformed datasets were split into training and validation sets.
Training data contained observations from August 2019 to July 2022, and validation data from August
2022 to December 2022. Experiments were performed using various regression techniques, i.e., linear,
Ridge, Lasso and ElasticNet regression; decision tree, random forest, multilayer perceptron, gradient
boosting, Ada boosting, support vector machine, XGBoost, LGBM and TensorFlow decision forest
regressors. Additionally, the following neural network architectures were trained: recurrent, multilayer
perceptron, LSTM, N-BEATS and autoencoder. Finally, a validation test was used to estimate the
model's performance. The obtained results are presented in Table 6.</p>
      <p>The best performing model with original target values was a multilayer perceptron, achieving MAE
of 0.055 and SMAPE of 1.696. On the other hand, the least accurate model was the AdaBoost Regressor,
with MAE of 3.560 and SMAPE of 80.690. However, when using logarithmically transformed target
values, the best performing models were the linear regression and Ridge regression, both with an MAE
of 0.057 and SMAPE of 1.710. The worst models were ElasticNet and Lasso regressor, both models
achieved an MAE of 2.290 and SMAPE of 56.297. It is important to note that the best performing
models for the final submission may differ from the best performing models on the validation dataset.</p>
      <p>Models trained on logarithmically transformed data showed superior results compared to those
trained on original target values, as seen in Table 6. Consequently, the logarithmic transformation was
selected for the final submissions. It is worth noticing, however, that the best performing models for
the Kaggle competition differed from the models in experimentation phase. The subset of model’s
forecasting, that showed the highest results, can be seen in Table 7.</p>
      <p>The results were evaluated on data that contained only January 2023 density values. The highest
performing model is the XGBoost regressor with a SMAPE of 3.3159, followed closely by the Random
Forest regressor with an SMAPE of 3.3189. The least accurate model is the recurrent neural network
with SMAPE of 3.8251. The final and full results will be known on June 14th, 2023.</p>
    </sec>
    <sec id="sec-8">
      <title>6. Conclusions</title>
      <p>In this paper, we presented the results of our investigation on microbusiness density forecasting
using various machine learning techniques. Experiments were performed using the dataset from the
GoDaddy Kaggle competition, which was enriched using external data sources. Explanatory data
analysis revealed a significant positive relationship between microbusiness density, college education,
broadband usage, and median income. Moreover, we observed that the distribution of microbusiness
density is not uniform across counties, i.e., California, Colorado, Delaware, Wyoming, Utah, Nevada,
and Florida had a higher density of microbusinesses. To account for these fluctuations, we performed a
clustering analysis based on the K-means algorithm. As a result, four clusters were extracted and
included in the analysis as an additional feature.</p>
      <p>The final dataset consisted of 3162 features. To reduce the dimensionality of the data, we conducted
feature selection using a statistical significance test. As a result, we identified 21 statistically significant
features that were later used for modelling experiments. A detailed analysis of various regression
techniques revealed that the XGBoost method performed the best, with a SMAPE of 3.3159. We also
discovered that logarithmically transforming microbusiness density values produced better validation
results. Consequently, we chose the logarithmically transformed dataset to forecast unseen data.
However, forecasting in an open system is unpredictable due to many factors that can positively or
negatively impact the results. Therefore, such forecasting requires continuous model retraining to
achieve the best possible results for the upcoming months.
7. References</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          2022.
          <article-title>Thus, each county was described by 41 monthly observations</article-title>
          .
          <source>In total, the dataset had 128 535 Standard deviation of</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [1]
          <string-name>
            <surname>U.</surname>
          </string-name>
          <article-title>S small business administration, Office of Advocacy</article-title>
          . URL: https://advocacy.sba.gov/category/ research/.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Saridakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Litsardopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hand</surname>
          </string-name>
          , Great Britain Microbusiness White Paper,
          <year>2022</year>
          . URL: https://www.godaddy.com/ventureforward/wpcontent/uploads/2022/03/GoDaddy_Great_Britain_Microbusiness_White_Paper_
          <year>2022</year>
          .pdf.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>U.</given-names>
            <surname>Aickelin</surname>
          </string-name>
          , I. Dent,
          <string-name>
            <given-names>T.</given-names>
            <surname>Craigy</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Roddenz</surname>
          </string-name>
          .
          <article-title>An approach for assessing clustering of households by electricity usage</article-title>
          .
          <source>In proceeding of: UKCI</source>
          <year>2012</year>
          , 12th Workshop on Computational Intelligence,
          <year>2012</year>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Boehmke</surname>
          </string-name>
          ,
          <article-title>K-means Cluster Analysis</article-title>
          . URL: https://uc-r.github.io/kmeans_clustering#kmeans
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K. J.</given-names>
            <surname>Gracey</surname>
          </string-name>
          , Dataset Description,
          <year>2023</year>
          . URL: https://www.kaggle.com/competitions/godaddymicrobusiness
          <article-title>-density-forecasting/data.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>United</given-names>
            <surname>States Census Bureau</surname>
          </string-name>
          ,
          <source>County Business Patterns</source>
          ,
          <year>2020</year>
          . URL: https://data.census.gov/tabl e?
          <source>q=CBP2020.CB2000CBP&amp;g=010XX00US$0500000&amp;tid=CBP2020.CB2000CBP.</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>United</given-names>
            <surname>States Census Bureau</surname>
          </string-name>
          , County Population Totals:
          <fpage>2010</fpage>
          -
          <lpage>2019</lpage>
          . URL: https://www.census.gov/data/datasets/time-series/demo/popest/2010s-counties-total.html
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>United</given-names>
            <surname>States Census Bureau</surname>
          </string-name>
          ,
          <source>County Population Totals and Components of Change: 2020-2022</source>
          ,
          <year>2022</year>
          . URL: https://www.census.gov/data/datasets/time-series/demo/popest/2020s-countiestotal.html.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Thomas</surname>
          </string-name>
          , Broadband Usage in
          <string-name>
            <surname>US</surname>
          </string-name>
          ,
          <year>2022</year>
          . URL: https://data.world/amberthomas/broadbandusage-in-us.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>United</given-names>
            <surname>States Census Bureau</surname>
          </string-name>
          , Employment Status,
          <year>2021</year>
          . URL: https://data.census.gov/table?q=S 2301&amp;g=010XX00US$
          <fpage>0500000</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>United</given-names>
            <surname>States Census Bureau</surname>
          </string-name>
          , Educational Attainment,
          <year>2021</year>
          . URL: https://data.census.gov/table? q=S1501&amp;g=010XX00US$
          <fpage>0500000</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>B.</given-names>
            <surname>Dill</surname>
          </string-name>
          ,
          <article-title>County level population by race ethnicity 2012-</article-title>
          <year>2019</year>
          ,
          <year>2020</year>
          . URL: https://data.world/bdill/county
          <article-title>-level-population-by-race-</article-title>
          <string-name>
            <surname>ethnicity-</surname>
          </string-name>
          2010-
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>Latitude</given-names>
            <surname>Longitude</surname>
          </string-name>
          <string-name>
            <surname>Team</surname>
          </string-name>
          , States in United States,
          <year>2012</year>
          . URL: https://www.latlong.net/category/ states-236-14.html.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Lumivero</surname>
            ,
            <given-names>XLSTAT</given-names>
          </string-name>
          :
          <article-title>Ordinary Least Squares Regression (OLS)</article-title>
          . URL: https://www.xlstat.com/e n/solutions/features/ordinary-least
          <article-title>-squares-regression-ols</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kuhn</surname>
          </string-name>
          , K. Johnson, Feature Engineering and
          <article-title>Selection: A Practical Approach for Predictive Models</article-title>
          ,
          <year>2019</year>
          . URL: https://bookdown.org/max/FES/greedy-stepwise-selection.html
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>