=Paper=
{{Paper
|id=Vol-2753/paper2
|storemode=property
|title=Identifying Explosive Epidemiological Cases with Unsupervised Machine Learning
|pdfUrl=https://ceur-ws.org/Vol-2753/paper2.pdf
|volume=Vol-2753
|authors=Serge Dolgikh
|dblpUrl=https://dblp.org/rec/conf/iddm/Dolgikh20
}}
==Identifying Explosive Epidemiological Cases with Unsupervised Machine Learning==
<pdf width="1500px">https://ceur-ws.org/Vol-2753/paper2.pdf</pdf>
<pre>
Identifying Explosive Epidemiological Cases with Unsupervised
Machine Learning
Serge Dolgikha,b
a
    Solana Networks, 301 Moodie Dr., Ottawa, K2H9C4, Canada
b
    National Aviation University, 1 Liubomyra Huzara Ave, 1, Kyiv, 03058, Ukraine


                Abstract
                An analysis of a combined dataset of epidemiological statistics of national and subnational
                jurisdictions, aligned at approximately two months after the first local exposure to Covid-19
                with unsupervised machine learning methods such as Principal Component Analysis and deep
                autoencoder dimensionality reduction allows to clearly separate milder background cases from
                those with more rapid and aggressive onset of the epidemics. The analysis and findings of this
                study can be used in evaluation of possible epidemiological scenarios and as an effective
                modeling approach to identify possible negative epidemiological scenarios and design
                corrective and preventative measures to avoid developments with potentially heavy impact.

                Keywords 1
                Infectious diseases, epidemiology, Covid-19, machine learning, unsupervised learning

1. Introduction
    An analysis of factors that can influence the course of the development of the epidemics in a given
jurisdiction is both a challenging and interesting undertaking given the number of potential factors and
their interaction. For example, a possible link between the effects of Covid-19 pandemics and a number
of epidemiological factors including universal immunization program against tuberculosis with BCG
vaccine was proposed in Miller et al. [1] and further investigated in [2-4]. Other factors, such as: gender
and ethnicity; age demographics; social habits such as smoking; and others were investigated in a
number of studies [5,6] and others.
    However, given the large number of factors that may have influence on the out-come of the
epidemics in each case, identification of the most influential ones may represent certain challenge due
to the number, complexity and interaction of contributing factors. In this work we attempt an analysis
of the combined dataset of nation-al and subnational reporting jurisdictions adjusted and aligned at the
same time point of approximately two months after the first local exposure to the Covid-19 epidemics
with the methods of unsupervised machine learning.
    The unsupervised dimensionality / redundancy reduction methods such as Principal Component
Analysis (PCA) [7] and unsupervised deep artificial neural network models such as autoencoders (AE)
[8] allow to analyze the distribution of case data points in the informative parameter spaces identified
by these methods and to at-tempt and in many instances, identify characteristic regions associated with
the variable of interest, such as in this work, the severity of the epidemiological scenario in the
jurisdiction. Establishing combinations of the latent and observable parameters that identify such
regions can be used to evaluate and predict the risks of heavier epidemiological impacts in the
jurisdiction proactively with the opportunity to make necessary corrections before the explosive onset
of the epidemics would cause heavy costs to the society.


IDDM’2020: 3rd International Conference on Informatics & Data-Driven Medicine, November 19–21, 2020, Växjö, Sweden
EMAIL: sdolgikh@nau.edu.ua (S. Dolgikh)
ORCID: 0000-0001-5929-8954(S. Dolgikh)
             ©️ 2020 Copyright for this paper by its authors.
             Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
             CEUR Workshop Proceedings (CEUR-WS.org)
2. Methodology
    As the experience of the pandemics to the day shows, timing can be a critical factor in the
development of the epidemics and an accurate analysis of the corresponding statistical data. To ensure
correctness of the analysis in the study we used two approaches: 1) data aligned with respect to the
duration of the exposure in the reporting jurisdiction, i.e., the dataset composed mainly of the cases that
have the same or similar time of the exposure. Where it is not the case 2) time-based adjustment of the
data is performed so that the statistical records are taken at the same or similar time of local exposure.
To simplify timing analysis, the global zero time of the start of the Covid-19 pandemics was defined in
[2] as: TZ = 31.12.2019. The exposure time in the study in the format TZ + y months is relative to this
time point.
    A number of known factors was expected to have strong influence on the course of the epidemics in
the cases was identified in [1-3] and other studies, including: the time of the local exposure;
demographics; social, tradition, lifestyle; the level of economic and social development; the quality and
efficiency of the healthcare system and not in the least, the quality of public health policy making and
execution.
    The methodology is based on processing the input data expressed as a set of observable parameters
that were identified and described in the study with unsupervised machine learning methods to identify
and extract a smaller set of informative features. In many cases, evaluating distributions of data in the
representations of informative components such as principal components in PCA or dimensionality
reduction with neural network autoencoder models allowed to identify and separate characteristic
classes of cases in the observable data by essential latent parameters that can be linked to the observed
outcome.

2.1.    Data
    Evidently, the time of the local exposure to the epidemics is one of the critical parameters of the
impact, so the case data was adjusted and aligned at a similar phase in the development of the epidemics,
chosen based on the availability of data at approximately, local Time Zero + two months, i.e.
approximately two months after the first local exposure to the infection. In the study this translates to
the beginning of April, 2020 for Wave 1 cases (LTZ in January, 2020) and beginning of May for Wave
2 (LTZ end of February to early March, 2020).
         A combined dataset of approximately forty cases was thus constructed based on the conditions
outlined in [2], essentially, bringing together the cases with similar social and economic parameters to
minimize the number of potentially influencing factors along with the expectation of certain minimal
level of exposure to the epidemics and reliability of the reported data.
    The dataset was constructed from the publicly available current data on the epidemics impact per
case, i.e., reporting jurisdiction. It comprises the current value of the epidemics impact recorded in the
jurisdiction (case) and measured in in mortality per capita m(t) (M.p.c.), per million of population, and
a number of observable parameters selected as described further in this section with the hypothesis of
a certain level of correlation between the observable parameter set and the severity of the outcome.
    On the relative scale of impact by jurisdiction, the “explosive” cases were normally identified as
those with relative M.p.c. (i.e., relative to the maximum among all reporting jurisdictions worldwide)
of around and above 0.5. This subgroup of cases included all commonly reported cases of high
epidemics impact at the time of writing.
    In evaluation of distribution in the coordinates of principal components two higher impact clusters
of cases were identified by relative impact: explosive cases with relative M.p.c. above 0.8 group
included the well-known first wave cases: Italy; Spain and New York with the highest impact
worldwide observed to date. In the second group were six somewhat milder-impact cases, namely:
United Kingdom; France; Belgium; Netherlands; Ireland and Quebec (Canada), with relative M.p.c. in
the range from 0.6 to 0.8.
    The impact parameter was not used in the training of the unsupervised learning models (i.e. excluded
from the training dataset) but only for identification of the regions of interest (i.e. higher
epidemiological impact) in the latent representations produced by the models as a result of training.
2.2.    Observable Parameters
    The examples of factors of influence can include, among others: genetic differences; population
density, social traditions and cultural practices, past widespread public policy such as immunization;
smoking habits and of course the epidemiological policy of the jurisdiction aimed at controlling the
spread of the disease.
         In addition to the common measurable factors such as population density, age demographics,
smoking prevalence a number of additional factors with potential impact on the severity of the
epidemics pattern were considered in this study as de-scribed in this section. A common comment for
some of them is that due to limitation of time and resources, a rating scale approach was chosen for
those factors that can-not or would be challenging to measure directly. Understandably, such an
approach can be influenced by subjective perceptions; however, we believe that more robust and
objective techniques can be developed over time improving the quality of the analysis and the resulting
conclusions.
    Connectivity: intended to measure the intensity of international and regional connections in the
jurisdiction of the case, for example, international, inter and intra-regional travel and migration,
tourism; seasonal and work-related migration and so on; more intensive connection hubs can be
expected to have higher exposure to the pandemics increasing the probability of a heavier impact.
    Social proximity: intended to reflect the closeness of inter-personal connections in the case, again
in multiple spheres and domains, for example: family connections; socializing practices and traditions;
the intensity of business connections; lifestyle practices; social events and others. Again, as was
commented previously modeling such a complex factor as a single value parameter may open the
analysis to the vulnerability of subjectiveness; yet we believed that it could be important for the analysis
and improvements to make its evaluation, by case more objective and accurate are possible in the future
studies.
    We also used three rating parameters intended to measure the policy of the juris-diction as relates to
the response to the pandemics. They are: 1) epidemiological preparedness of the public healthcare
system to an intensive and rapid development of an epidemics; 2) the effectiveness of the policy
response; and 3) the timeliness of the public health epidemiological response.
    Epidemiological preparedness: intended to measure the preparedness of the health care system to
handle a rapid onset of a large-scale epidemics. This parameter is intended to be specific to
epidemiological situation rather than the general state of the health care system, its technological level,
funding and so on).
    Effectiveness of policy response: intended to indicate the quality of the public health policy in
controlling the epidemics based on available scientific data at the time including its clarity and
availability for understanding and following by the general population facilitating its preparedness to
participate. While some concerns can be expressed that this factor can be influenced by post-impact
considerations with potential post-factum correlated with the outcome, we believe that with the accurate
approach these risks can be minimized. For example, it is evident that an unclear or misleading policy
message could be highly detrimental to the intended effect and one doesn’t need the outcome to judge
such policy parameters objectively at the time the decision is made and before the outcome is recorded.
    Timeliness: measures the relative timing of introduction of the epidemiological policy to the local
exposure and development of the epidemics.
    Universal BCG immunization record: indicates the record of a current or previous immunization
program according to classification introduced in [9]. The detailed definition is provided in the
Appendix.
    Finally, epidemiological impact: was measured in Covid-19 caused mortality per 1 million capita
relative to the world’s maximum value at the time of the analysis.
    Due to a large spread within the range of the impact of the epidemics in the dataset, the logarithmic
scale was also used in evaluation of the impact of the epidemics represented by Measured Value
parameter (MV) being the logarithm of mortality per capita (in cases per 1M of population in the
jurisdiction).
                                                     𝑀𝑜𝑟𝑡𝑎𝑙𝑖𝑡𝑦, 𝑐𝑎𝑠𝑒𝑠
                         𝑀𝑉(𝑙𝑜𝑐𝑎𝑙𝑖𝑡𝑦, 𝑡) = 𝑙𝑜𝑔⁡(                           )                          (1)
                                                   𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛, 𝑀𝑖𝑙𝑙𝑖𝑜𝑛
    It needs to be noted that in the framework of unsupervised analysis, epidemiological impact is not
known a priori and for that reason it was not used in the evaluation of data with the selected methods.
It was used however to analyze distributions obtained with the models and identify regions of potential
interest, such as combinations of observable parameters associated with the areas of higher
epidemiological impact.
    The resulting dataset of 40 national and subnational cases with the identified observable parameters
and the recorded epidemiological impact at the time of preparation is presented in Table 1, Appendix.
    Reservations and qualifications:
    1. Consistency and reliability of data reported by the national, regional and local health
administrations.
    2. Alignment in the time of reporting may not be consistent between all jurisdictions due to possible
differences in reporting practices.
   Sources:
   World BCG atlas [10]
   Google coronavirus map [11]
   World statistical data [12,13]
   National and subnational jurisdictions Covid-19 information [14-16] and other.

2.3.    Unsupervised Machine Learning
   To evaluate the hypothesis of the correlation between the identified parameters and the
epidemiological outcome in the cases in the dataset, several common machine learning methods were
used:
   1. Linear regression.
   2. Principal Component Analysis and identification of principal informative factors.
   3. Unsupervised deep neural network-based dimensionality reduction and selection of dominant
informative factors.
   The first method produces a best fit linear approximation of the resulting effect series with a
minimum total deviation from the trend [17].
   Principal Component Analysis [7] produces a linear transformation of the data to the coordinates
with the highest variation. The method is based on the characteristics of the data and does not require
prior knowledge of the recorded outcome.
   A deep neural network autoencoder (method 3) performs a non-linear dimensionality reduction of
the observable data to the lower-dimensional representation with identified informative features. The
diagram of the architecture of the unsupervised autoencoder model is given in Figure 1.


Figure 1: Redundancy reduction with deep neural network autoencoder model
    The structure of the deep neural network model used in this work is described in detail in [18].
    In the unsupervised training phase, the model is trained to reproduce the input da-ta with good
accuracy and thus does not require labels marked with the outcome; the same applies to PCA. Achieving
an improvement in the accuracy of reproduction of the input data, that can be measured by a number of
training metrics indicates that the model has learned some essential characteristics of the initial
distribution. The aim of unsupervised learning is thus to minimize the deviation of the original training
sample from its regeneration created by the model.
3. Results
   In this section we present the results of the analysis of the dataset with methods outlined in the
previous section with a brief discussion.

3.1.    Linear Regression
    Linear regression with 9 identified observable parameters produced a trend with a strong correlation
score to the recorded impact, with the value of 0.9 of out 1.0 maximum. The factors with the highest
influence on the regression trend measured in logarithmic M.p.c. are shown in Table 1.
Table 2
Linear Regression analysis
                       Factor           Linear regression score         Correlation
              Policy, timing                     0.534                     0.906
              Connection                         0.196                     0.697
              Policy, effectiveness              0.094                     0.856
              BCG immunization                   0.092                     0.686
              Social proximity                   0.078                     0.794
    Policy factors were expected to have a strong influence on the outcome of the case that is confirmed
by the results of the linear regression analysis. As well, the importance of other factors such as
connection intensity, social proximity culture, BCG immunization and smoking was observed.

3.2.    Principal Component Analysis
   Principal component analysis identified three principal components with overall influence above
95% as shown in Table 2. The highest influence factors in the PCA analysis were mostly aligned with
the results of the linear regression analysis: policy-timing, connection hub, social proximity, BCG and
smoking prevalence.
   PCA transformation is inherently unsupervised method of learning, meaning that the prior known
outcome labels are not required to learn the principal components as well as representation of the input
data in the coordinates of identified principal component eigenvectors. By plotting the data in the
coordinates of the identified principal component vectors, interesting results can be obtained by
indicating the cases with the highest recorded impact of the epidemics.
Table 2
Covid-19 Principal component analysis
                Eigenvector              Observable parameters                   Weight
      Axis 1                                 Policy-time, BCG                     0.570
      Axis 2                                   BCG, smoking                       0.166
      Axis 3                              Connection hub, social                  0.127
                                                 proximity
   Figure 2 shows visualizations of the distribution of the dataset of epidemiological cases in the
coordinates of the three principal components with the highest variation identified by PCA analysis.
The cases and approximated region of the highest-impact cluster is shown in blue, defining the region
of principal coordinate values with the highest recorded impact of the epidemics; in a similar way,
cluster with medium impact (6 cases) is shown in magenta.
   A clear separation of the higher-impact case clusters from the general background cases can be
clearly observed in the diagrams. It allowed to identify the region where the cases with potentially
higher impact including the “explosive” pattern are distributed in the latent coordinates of the principal
component representation.
Figure 2: Higher impact cluster identification with PCA
   A straightforward linear transformation then allows to obtain the corresponding region of interest in
the initial, observable parameter space, with the possibility to identify the combinations of the
observable parameters that can be linked to the outcomes with higher epidemiological impacts.

3.3.    Analysis with Unsupervised Autoencoder
    A similar approach can be demonstrated with an unsupervised neural network autoencoder model
that reduces the number of parameters by compressing the observable data space into a lower-
dimensional representation in an unsupervised training process aimed at improving the accuracy of
regeneration form the compressed representation. Models of a similar type were used to create
structured unsupervised representations of different data types via unsupervised autoencoder training
with minimization of generative error [8].
         The dimensionality of the unsupervised representation for the models in the study that is defined
by the size of its central encoding layer was chosen based on the results of the Principal Component
Analysis in the previous section, indicating three most informative components.
    Presented in Figure 3 are direct visualizations of the distributions of data in the unsupervised
representation created by a trained autoencoder model.


Figure 3: Identification of higher impact cluster identification with deep autoencoder
    The highest impact cluster of three cases is shown in green whereas the medium one (6 cases), in
orange. Again, a similar pattern of clear separation of higher-impact cases from the general background
can be observed with these models, in full agreement with the results of PCA analysis in the previous
section.
         It is worth noting that as with PCA, autoencoder models though essentially non-linear, also
allow to identify the higher-impact regions in the coordinates of the observable parameters. This can be
achieved by forward-propagating through the generative part of the model the identified region of
interest, defined by a set of characteristic points in the latent representation, defining the corresponding
region in observable parameters. The combinations of observable parameters that produce the effect of
interest can be identified proactively, and used in development of an effective preventative or mitigating
epidemiological policy.

4. Conclusion
    The methods of unsupervised machine learning can be effective in identifying and separating the
informative features in complex general data. In this work, two different methods of unsupervised
learning applied independently, consistently demonstrated good separation of cases with higher Covid-
19 epidemiological impact from the general background.
    The analysis and the findings of the study can be used in evaluation of possible epidemiological
scenarios in jurisdiction based on evaluation of the factors identified and discussed in this work, as well
as those that can be added in the subsequent studies. Further research and development in the identified
direction has a potential of producing effective modeling tools to identify the areas of potential
epidemiological risk in the public healthcare policy and design corrective and / or preventative measures
to avoid the heavier impact scenarios.
    Further studies can be focused on improving the accuracy of measurement of the identified
observable parameters as well as introducing additional ones, leading to higher accuracy and confidence
of the evaluation.

5. References
[1] Miller A., Reandelar M-J., Fasciglione K., Roumenova V., Li Y., Otazu G.H. Correlation between
     universal BCG vaccination policy and reduced morbidity and mortality for COVID-19: an
     epidemiological study, medRxiv doi: 10.1101/2020.03.24.20042937 (2020).
[2] Dolgikh S., Further evidence of a possible correlation between the severity of Covid-19 and BCG
     immunization, preprint MedRxiv, doi: 10.1101/2020.04.07.20056994v2 (2020).
[3] Sharma A., Sharma S.K., Shi Y., et al. BCG vaccination policy and preventive chloroquine usage:
     do they have an impact on COVID-19 pandemic? Cell death & disease, 11(7), 1-10 (2020).
[4] Yitbarek K., Abraham G., Girma T. et al. The effect of Bacillus Calmette–Guérin (BCG)
     vaccination in preventing sever infectious respiratory diseases other than TB: implications for the
     COVID-19 pandemic. Vaccine 38(41), 2020, 6374–6380 (2020).
[5] Ebina-Shibuya, R., Horita, N., Namkoong, H., Kaneko, T. National policies for paediatric
     universal BCG vaccination were associated with decreased mortality due to COVID-19.
     Respirology (Carlton, Vic.), https://europepmc.org/article/pmc/pmc7323121 (2020).
[6] Dayal, D., Gupta, S. Connecting BCG vaccination and COVID-19: additional data. Medrxiv doi:
     10.1101/2020.04.07.20053272 (2020).
[7] Jolliffe I.T., Principal Component Analysis, Series: Springer Series in Statistics, 2nd edition,
     Springer, NY (2002).
[8] Bengio Y., Learning deep architectures for AI, Foundations and Trends in Machine Learning, 2(1),
     1–127 (2009).
[9] Zwerling A., Behr M.A., Verma A., Brewer T.F., Menzies D., Pai M., The BCG World Atlas: a
     database of global BCG vaccination policies and practices. PLOS Medicine, doi:
     10.1371/journal.pmed.1001012, (2011).
[10] BCG World Atlas online, URL: http://www.bcgatlas.org/
[11] Coronavirus data and map, URL: https://www.google.com/covid19-map/ (4.04.2020).
[12] Our World in Data: World smoking prevalence, URL: https://ourworldindata.org/smoking
     (4.04.2020).
[13] Worldometers: Population data, URL: https://www.worldometers.info/world-population/
     (4.04.2020).
[14] Canada Covid-19 Situation Update, URL: https://www.canada.ca/en/public-
     health/services/diseases/2019-novel-coronavirus-infection.html?topic=tilelink (4.04.2020).
[15] CDC Covid-19 Advice, URL: https://www.cdc.gov/coronavirus/2019-ncov/index.html (2020).
[16] NHS Covid-19 Advice, URL: https://www.nhs.uk/conditions/coronavirus-covid-19/ (2020).
[17] Freedman D., Statistical Models: Theory and Practice. Cambridge University Press (2005).
[18] Prystavka P., Cholyshkina O., Dolgikh S., Karpenko D., Automated object recognition system
     based on aerial photography. In: 10th International Conference on Advanced Computer
     Information Technologies ACIT-2020 Deggendorf, Germany (2020).
Appendix         Time-adjusted Dataset of Epidemiological Cases

Table 1
Epidemiological Case Dataset adjusted at LTZ + 3 months
Case                   Policy
                p-       p-      p-    Conn   Bcg    Smo      Den    Soc     Age    Impact
               prep     qlty    tme
Taiwan             0        0     0     0.1      0    0.34     0.3     0.2    0.3    0.001
Japan            0.1      0.1     0     0.6      0   0.674     0.3     0.2    0.5    0.002
Singapore          0        0     0     0.4      0    0.33     0.5     0.3   0.25    0.004
Australia        0.2      0.2     0     0.2    0.3   0.298    -0.5     0.3   -0.4    0.005
South            0.1      0.2     0     0.2      0   0.996     0.3     0.2      0    0.013
Korea
Finland          0.3      0.2    0.1    0.1    0.3   0.418    -0.2     0.2    0.3    0.017
Canada           0.4      0.2    0.2    0.3    0.8   0.354    -0.5     0.4      0    0.023
Ontario          0.4      0.2   0.25    0.3    0.8   0.258    -0.2     0.4      0    0.025
(Canada)
Germany          0.3      0.2    0.2    0.5    0.2   0.608     0.2     0.4    0.5    0.052
Sweden           0.3      0.3    0.3    0.1    0.6   0.412     0.0     0.3      0    0.148
UK               0.5      0.7    0.7    0.7    0.8   0.398     0.2     0.5      0    0.248
France)          0.5      0.5    0.6    0.7    0.6   0.596     0.2     0.7   -0.2    0.371
Belgium          0.5      0.4    0.5    0.7      1    0.53     0.2     0.5      0    0.429
Spain            0.8      0.7    0.8    0.5    0.8   0.584     0.2     0.8    0.5    0.965
Italy            0.8      0.8    0.9    0.7      1   0.566     0.2     0.8    0.5    0.969
USA              0.5      0.5    0.5    0.3      1    0.39    -0.2     0.4   -0.4    0.095
New York         0.8      0.8    0.9      1      1    0.25     0.5     0.8   -0.5    1.000
(USA)
California       0.5      0.3    0.2    0.5     1    0.226     0.1     0.4   -0.5    0.040
(USA)
Slovakia         0.2      0.2    0.2      0      0   0.794     0.2     0.2   -0.1    0.016
Argentina        0.4      0.3    0.3      0      0   0.478    -0.2     0.3   -0.5    0.019
Chile            0.2      0.2    0.1      0      0    0.76     0.1     0.2   -0.5    0.050
Ukraine          0.6      0.4    0.3      0      0    0.94     0.2     0.4    0.1    0.027
Poland           0.3      0.2    0.1    0.2      0   0.648     0.2     0.3   -0.1    0.066
Moldova          0.6      0.4    0.3      0      0    0.56     0.2     0.4   -0.4    0.125
Czechia          0.3      0.2    0.1    0.1      0   0.766     0.2    0.25      0    0.082
Croatia          0.3      0.2    0.1      0      0    0.74     0.2    0.25    0.5    0.068
Albania          0.3      0.2    0.1      0      0      0.8    0.2    0.25   -0.5    0.038
Greece           0.2      0.1      0    0.4      0        1    0.2     0.5    0.5    0.049
Israel           0.1      0.1    0.1    0.4    0.3   0.382     0.2     0.2   -0.5    0.094
Prairies (1)     0.3      0.2    0.2      0    0.6   0.292    -0.3     0.2   -0.3    0.016
(Canada)
Quebec           0.6      0.4    0.5    0.3    0.8   0.304    -0.2     0.5    0.3    0.912
(Canada)
Norway           0.2      0.2    0.2    0.2    0.2   0.452    -0.2    0.25   -0.1    0.138
Denmark          0.2      0.2    0.2    0.1    0.3   0.352     0.2    0.25    0.1    0.303
Switzerlan       0.2      0.2    0.2    0.2    0.3    0.51     0.2    0.25   0.25    0.603
d
Austria          0.2      0.2    0.2    0.2    0.2   0.704     0.2    0.25    0.4    0.238
Portugal         0.3      0.3    0.3    0.3      0    0.63     0.2     0.5    0.5    0.355
Ireland (2)      0.4      0.3    0.5    0.4    0.2   0.444     0.2     0.6   -0.4    0.653
Netherlan        0.3      0.4    0.4    0.5      1   0.524     0.3    0.25    0.4    0.774
ds
1
    Manitoba and Saskatchewan provinces, Canada
2
    Inconsistencies in implementation of universal BCG policy, [19]
Observable factors
Policy
          p-prep: health care preparedness, range 0 .. 1, lower to higher preparedness
          p-qlty: response measures, range 0 .. 1, lower to higher epidemiological policy quality;
          p-tme: response timing, range 0 .. 1, timely to delayed
Conn: connection intensity, range 0 .. 1, lower to higher connection intensity
Bcg: BCG immunization record, range 0 .. 1. The value of 0 indicates current or very recent universal
immunization policy; the value of 1 indicates no effective immunization policy and equivalent cases
[2]. A value between 0 and 1 indicates a previous universal immunization policy relative to the time
after cessation.
Smo: smoking prevalence in the population. In the cases with large disparity between genders and so
on, the higher of values was taken.
Den: population density. Due to significant variability in population density between the cases in the
dataset, a logarithmic band scale was used; additionally, in cases with very large territory, a negative
offset was added to account for non-homogeneousness of the distribution of individual cases and the
delay in propagation of the epidemics due to geographical distance. A higher granularity analysis of
national jurisdictions with very high geographical spread can be attempted in a future study.
Age: age demographics, median age, logarithmic band of the deviation from the dataset mean, range: -
0.5 .. 0.5.
Outcome parameter
Impact: the epidemiological impact in the jurisdiction at the time of analysis measured as relative
mortality per 1 Million capita (R.mpc), relative to world’s highest at the time.

</pre>