Open Public Data and Early Factor Analysis in a Developing Public Health Event Serge Dolgikh National Aviation University, 1 Lubomyra Huzara Ave, Kyiv, 03058, Ukraine Abstract An important role of timely public availability of data in the early analysis and formulation of hypotheses in developing public health events, such as infectious epidemics has been highlighted in many results. The recent pandemic demonstrated perhaps the first example where serious attempts were made to present consistent and detailed information, on the international scale and in near real-time, for a developing major public health event, with important and numerous implications for the formulation of responses and policies. In this work we analyze characteristics of publicly available data, including timeliness; granularity i.e., level of detail; consistency between different reporting jurisdictions and others; as well as issues and problems with processing publicly available information for early analysis and formulation of early hypotheses. Based on the experience and the analysis in this work we attempt to formulate expectations and conditions for the collection and publication of publicly available data for future public health and more generally, events with potentially high societal impact. Keywords 1 Data collection, public data, statistical analysis, factor analysis 1. Introduction Whereas collection and publication of statistics in many areas and aspects of society, politics, demographics and economy have been a known and relatively common occurrence for some time, the recent onset of the Covid-19 global pandemic caused collection, publication and availability of data describing both local and global spread and impacts of the developing epidemic in near real-time. The availability of these data has been invaluable in the early analysis of multiple potential factors of significance, formulation of early hypotheses, analysis, evaluation and feedback on public policy decisions. In this work we attempted to address the benefits as well as caveats of working with publicly available data; methods and practices of collection and preparation of the data for the analysis, examples where public “ecological” data was used in formulation of early hypotheses; and the need for a thorough and comprehensive follow up by more detailed studies to confirm or reject them. We also discuss possible directions of evolution on the preparation and publication of epidemiological data in the public domain and ways to make it more informative and useful in the analysis. Here the term “ecological data” signifies, not to be confused with the subject of ecology, the data that is available or can be obtained directly from observations of the system, entity or process being examined, in contrast, for example, to controlled studies with fixed sets of observed subjects to test formulated hypotheses. There are numerous instances and well-established practices of data being collected and published in the open to general public domains and format on a broad range of subjects, including sociology, economics, consumer behavior and others, but to the author’s best knowledge, the recent Covid-19 pandemic was one of the first examples in the history where a sustained attempt to IDDM’2023: 6th International Conference on Informatics & Data-Driven Medicine, November 17 - 19, 2023, Bratislava, Slovakia EMAIL: sdolgikh@nau.edu.ua (A. 1) ORCID: 0000-0001-5929-8954 ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings collect and publish consistent data directly related to a major developing epidemiological event have been made. The sources of this information were both diverse and numerous: from national and local health agencies and offices, to research institutions, media and information services such as Google, Facebook and others. Along with massive volumes of information that were available for early analysis, almost in real-time, it highlighted a number of issues and challenges that will be discussed further in this work. Generally, a “good” ecological dataset, that is, of high or at least, sufficient quality that can be used without significant modification and / or preparation in statistical analysis, can be characterized by the following features: • Usability: observable factors are clearly interpretable and straightforward to use in the analysis; • Representativity: contains a sufficient set of samples, reasonably representative of the real distribution; • Consistency: reporting of observable factors is consistent between different data points; alternatively, allows obtaining consistent data points via a well-defined process; • Detailization (granularity): informative observable factors describe observations of the event or phenomena at data points in sufficient detail for different forms of analysis. • Breadth: a sufficiently large range of data points is presented to support the expectation that the data points in the set describe the entire or most of the variable and value range of the unknown distribution. Further in this work, the following terminology will be used: - Distribution: in general, a variable that describes values of certain factors of interest in the domain of interest; for example, the height, weight, immunological, epidemiological, and so on factors in the general population. - Dataset: a set of data composed of data points each represented by a set, e.g., a numerical vector of observable factors. A supervised dataset additionally associates data points with a (set of) factors of interest. - Factors: - Observable factors describe parameters of the data in the set that can be obtained from observations, e.g., by measurement. - Factors of interest describe certain characteristics of the observed event or phenomena that are of interest in the study. - Informative factors are associated with the observable factors and allow to establish a clear relationship between the observable factors and the factors of interest. The problem of factor analysis [1] then can be defined as establishing the relationship between the factors of interest and the observable/informative factors, based on the information contained in a representative set of data points. 2. Literature Review Publication in the open-to-the-public domains of sociological, economic, demographical and other information has been common and regular in recent decades with sites and services such as Statista; Worldometer; Google and many others [2-4] offering access to collected and partially processed data on a wide range of issues and topics. As quoted: “Worldometer is a provider of global COVID-19 statistics for many caring people around the world. Our data is also trusted and used by the UK Government, Johns Hopkins CSSE, the Government of Thailand, the Government of Pakistan, the Government of Sri Lanka, Government of Vietnam, The Financial Times, The New York Times, Business Insider, BBC, and many others.” [2] “Statista is a German online platform specialized in data gathering and visualization, which offers statistics and reports, market insights, consumer insights and company insights in German, English, Spanish and French”. [3] “The Google Health COVID-19 Open Data Repository is one of the most comprehensive collections of up-to-date COVID-19-related information. Comprising data from more than 20,000 locations worldwide, it contains a rich variety of data types to help public health professionals, researchers, policymakers and others in understanding and managing the virus.” [4] Multiple datasets have been collected over the years and made available in the public domain for independent analysis of sociological, public and economic behavior [5], resulting in active research and multiple publications. A precipitous onset of the global Covid-19 pandemic stimulated the collection and the publication of information on the development of the epidemic both globally and on the national and subnational level, providing citizens and research professionals with near-real-time view of a developing major public health event. Availability of timely and accurate information was crucial for the formulation of sound and effective responses, advisories and policies by above-national and international bodies like the World Health Organization, EU Center for Disease Control and Prevention [6,7], national health administrations [8,9] and lower-level subnational public health jurisdictions, such as regional, provincial, etc., local and municipal health offices and authorities. Availability of such data made possible a wide range of research in different aspects and from different perspectives, including: examination and formulation of early hypotheses on potential factors influencing the development and severity of the epidemic [10,11]; investigation into the origins of the epidemics [12]; development patterns and scenarios [13]; evaluation of the effectiveness of policies and responses [14] and many other aspects and directions. All in all, it can be concluded that the availability of such data in the open access was a positive factor in facilitating research and formulation of sound and effective policies. At the same time, certain challenges and shortcomings in the practice of publication of direct epidemiological information can be mentioned. Accumulating data from many diverse sources it was extremely challenging to maintain a set standard of accuracy; disparities and availability of information from lower-level reporting jurisdictions had significant differences and inconsistencies [15]; granularity and connectedness of data, that is, the ability to navigate between the reporting levels with preservation of some level of accuracy was not, generally, possible with the public data; wide variation in reliability and accuracy of the data between and even within reporting jurisdictions. All these factors may have contributed to the challenges in the early analysis of the data obtained from public sources with the possibility of reducing the confidence and skewing the results of the analysis. As the initial, challenging phase of the pandemic appears to be over, the time may be right to review both the successes and challenges of the early-stage analysis of public data, including in the collection and publication of the openly accessed epidemiological data. Improvements may include several factors influencing the content; accuracy, consistency and reliability of the published data; level of detail; usability for analysis and other essential characteristics of data made available in public access, with the potential to facilitate early analysis of developing events, formulation of hypotheses, correctness and confidence in the conclusions. 3. Publicly Available Data in Early Factor Analysis and Formulation of Hypotheses Let us consider a general case of data that can be a sampling of an unknown distribution D, W = { P, F } where P = { p }: points of observation, such as individual subjects in medical trials; cities or social groups in sociology etc; F = { f }, the observable factors. We would like to investigate and if possible, establish the relation R between certain factor(s) of interest, K and the observable factors of data points, p. 𝐾(𝑝) = 𝑅(𝐹(𝑝)) (1) The relationship in (1) represents the classical problem of factor analysis [1]: establishing the relationship R between the factor(s) of interest and the observable characteristics of the data points based on the data W. In this work, we will focus on the methods and approaches to how the sampling of observable data W can be produced in minimal time, accessed and used in early factor analysis research. It is understood that the factor of time can be essential in some situations and applications, of which a developing major public health or social event can be a prime example. 3.1. Sources of Public Ecological Data The practice of collection and publication of statistical data in the open publicly accessible format has been in place for a considerable time. Notably, information services like Worldometer, Statista and others provided statistical information from multiple national and subnational jurisdictions on a wide spectrum of topics, including economic characteristics, social factors and conditions and others. From the onset of the pandemic in the early months of 2020, a wide range of sources emerged for publicly available data related to the progress and the impacts of the pandemic. Specifically, these sources included: • International organizations: WHO, EU Health and Food Safety Commission and others; • National and subnational public health offices, agencies and administrations in most national jurisdictions; • Local and municipal health units and governments; • Media companies and services; • Research institutions, foundations, universities [16,17] • Specialized statistics and information sites and services [2,3] • Information and social networks [4] The composition and content of the information published in open public format is described in Table 1. Table 1 Composition, content and other characteristics of publicly available Covid-19 data Type Example Scope International, health WHO, EHA Advisories, general information National, subnational CDC, NHS, Health Canada, National and subnational statistics, health authorities etc. advisories, policy Local and municipal Local, municipal health Local statistics, advisories and health authorities offices information, local health policy Media Many national, regional National and subnational statistics, and local media sources advisory, information, stories Information sites, Worldometer, Statista, General information, national, social networks Google subnational and other statistics Statistical data was published in a variety of layouts and formats. Some of the information was available in formats friendly to statistical analysis such as Excel, csv, xml etc. However, in many cases issues with consistency in the collection and presentation of data were encountered that were notable complicating factors in the analysis of the data. 3.2. Early Factor Analysis with Public Ecological Data A study that is based on analysis of any data begins with: • Definition of the objectives of the study; • Choosing the method or methods to pursue it; • Establishing the source and obtaining the data; • Preparing the data for the analysis. The open availability of the data in the public domain can significantly improve and speed up locating the data that will be used for the study; in comparison to traditional controlled studies, in some applications and types of analysis data can be obtained almost instantly, saving significant lead time in accumulation and collection. However, it in no way means that working with ecological data is free from problems and challenges. Several of them are outlined below. Research-friendly format: from the outset, many sources of open-to-the-public ecological data are oriented toward informing the public and are not necessarily research-friendly. It can mean that the data is made available in a difficult-to-process format such as HTML, plain text, graphics, etc. and had to be extracted and compiled manually. Some sources, including public health authorities, research institutions and others (as listed in Section 3.1) may have strived to provide data in research-friendly formats, including csv, XML and other structured data formats. Consistency: the data is usually obtained from national and subnational public health sources; consistency in collecting and processing data cannot be assured; specifically, it is known that different criteria have been applied by national reporting authorities in the collection of case statistics, and possibly other data. Accuracy: a consistent accuracy standard cannot be assured with data obtained from different jurisdictions; an examination of different sources of data may be warranted. Breadth and depth of data: availability of data representing all characteristic groups/regions in the distribution; possibility to trace data to lower-level sources, for example, from national to national, local and municipality levels. Yet, despite these challenges, publicly available ecological data can be used in the early analysis of the trends in the development of the event and formulation of hypotheses that can be confirmed or ruled out at a later time, when more data has been collected and more comprehensive analysis can be performed. In this section, we will provide some examples of early factor analysis based on public data collected from open sources. 3.2.1. Collecting and Processing Public Ecological Data The collection of data begins with identifying the source that satisfies the objectives of the study. Some factors of primary importance are: 1. Representativity or variance of the data: sufficient for the objectives, for example, to examine the association/correlation of the factor of interest with observable factors among all essential regions of the distribution (for example, groups of population). 2. Expression: observable factors describe data points to the level of detail sufficient for the objectives of the analysis. 3. Accuracy: the data is expected to be accurate within the constraints of the analysis. 4. Consistency: accuracy and other characteristics of collection are expected to be within a reasonable margin of variance, acceptable for the objectives of the analysis. An example of the last point in the list above, the consistency of data can be given by an ecological analysis of the factor of interest, for example, infection rate among subnational regions. The data published by regional health units can depend on methods of collection of data, reporting standards and practices and so on; and if these procedures are not harmonized or standardized between the regional reporting offices, the consistency of the resulting national data cannot be assured and the accuracy of the analysis may suffer. Once the data has been accessed and compiled into a set of data points P described by observable factors F(P) possibly, with the recorded values of the factor of interest: (W(P, F), K(P)) preparation of the data for the analysis can begin. Rather than entering raw recorded data directly into the selected method(s) of the analysis, it can be essential for the accuracy of the analysis that the observable factors in the dataset are processed and perhaps, transformed to provide a consistent and uniform view of the observation. The methods and objectives of preprocessing are described below. Scaling (linear and non-linear): statistics of infectious epidemics are often collected and presented in incidence or case count statistics (such as cumulative, interval, etc.). Understandably, comparing total recorded counts in jurisdictions with one million vs. 100 million population would make little sense. A common practice in statistics is to transform the counts into per capita factors that can provide a better basis for comparison, though this practice is not without caveats as briefly discussed in the next section. 𝐶𝑟 The basic law of linear scaling is 𝐶𝑟 → 𝑁 , where Cr: total (raw) count; Npop: population of the region 𝑝𝑜𝑝 of the recorded total count. Temporal adjustment: in the analysis of interval values, accumulated over q certain period, it is essential that data points are compared over the same or similar intervals. For this reason, some adjustments in preparation may be needed. Accuracy and consistency adjustment: where multiple sources of the same or similar factors exist, they can be verified to improve the accuracy and consistency of the data. Normalization: some standard methods of factor analysis require a standard transformation of the data to produce normalized data as input to the methods of analysis [18]. Feature analysis: covariance analysis of distributions of observable factors can indicate observable factors that are dependent or correlated. Using such factors unintentionally may amplify the significance of these factors and skew the analysis. A decision has to be taken whether to keep such factors in the dataset or exclude them to avoid the amplification effect. Derived factors: in some problems and studies, invariant factors derived from the observed ones can introduce additional perspectives in the analysis. An example of an analysis with such a derivative factor is given in the next section. 3.2.2. Early Factor Analysis and Formulation of Hypotheses In this section we will consider some examples of using ecological public data in the early analysis of the factors in the Covid-19 pandemic. 1. Bacille Calmette-Guérin (BCG) Immunization Correlation with Lower Covid-19 Impact Hypothesis In the early days of the pandemic, based on early case and impact statistics, the hypothesis of correlation of lower observed Covid-19 impact, measured in observable factors of incidence; morbidity and mortality was proposed in [19]. Further studies indicated a certain level of statistical significance of the correlation hypothesis in the period up to the introduction of vaccines. Currently, as a result of several statistical and controlled studies, strong evidence in favor of the correlation hypothesis has not been established. One possible explanation can be surmised as related to the preprocessing, namely the scaling of the data discussed in the preceding section. It can be worth a brief discussion here, even to underline possible caveats in the direct analysis of early ecological data. In an analysis of the development of an infectious epidemiological event, it is reasonable to assume that the spread of the infectious agent would be proportional to the rate of essential contact in the population (the characteristics determining the effectiveness of the contact are dependent on the nature of the infectious agent). However, the factor measuring the rate or intensity of contact cannot be observed and measured directly; one can conclude that it is not one of the observable factors, that can be derived from the other ones that can be observed directly, such as the total population, population density and such. Then, patterns of distribution of the population geographically can vary significantly among the jurisdictions, and the average density of the population may not be a sufficiently informative factor to describe it. As a factual example, let us consider three countries: Slovakia, population 5 million; Portugal, approximately 10 million; and Austria, in the same range of population as Portugal. Looking at the concentrations of the population, one can observe that the maximum urban populations are similar between Slovakia and Portugal (circa 0.5 million) whereas it reaches a significantly higher range of values in Austria: 1.9 million. Hence, while the factor of the contact can be expected to be similar between the first two countries, it can be significantly higher in Austria due to the higher concentration of the population. For this reason, the assumption of linearity of the infectivity of the agent relative to the population may not be justified in all cases, and the factor(s) of infectious impact per capita, derived from the impact counts statistics such as total cases, etc. and the population, may not lead to the correct conclusion of the analysis. This example indicates that in many cases of the early factor analysis, preprocessing procedures applied in preparation of the data should be considered themselves as assumptions, to be verified in the subsequent, more detailed studies. This example can be summarized as follows: Problem: ostensible covariation between early Covid-19 impact statistics and BCG immunization record in the national jurisdictions. Early result: formulation of the hypothesis of a correlation between a lower Covid-19 impact (in cases and morbidity) and the record of universal BCG immunization. Benefit to the public: proposing the hypothesis stimulated research into innate immunity and testing the hypothesis with more detailed studies. Subsequent in-depth analysis and conclusion: hypothesis not confirmed by in-depth data analysis and controlled studies. 2. Policy Effectiveness An ecological dataset was compiled from publicly available data on national and subnational jurisdictions in Europe and Northern America to examine the effectiveness of public policies aimed at controlling the spread of the epidemics known as “lockdowns”. Data on policy in the jurisdictions has been collected and graded by presumed severity of the measure, where maximum value can be associated with physical curfews, closure of some or most services and the lower, with additional information, advice and similar. The early hypothesis that was tested was a correlation between the severity of the lockdown and reduced epidemiological impact, in incidence and morbidity. The compiled dataset had the following structure: The range of data (P): national and subnational health jurisdiction, Europe and North America; Impact (factor of interest, K(p)): morbidity, mortality per capita; Observable factors F(p): Policy factors: communications; severity; popularity and engagement; targeted response; Social and cultural factors: average population density; population centers; culture of socialization; international connectivity; Public health system, condition: general condition; resourcing and capacity; preparedness for major public health event; Testing the hypothesis with the publicly available data at an early stage of approximately six months after the local onset of the epidemic and before the vaccines with the methods of statistical analysis did not demonstrate strong statistically significant support for the hypothesis. Summary: Problem: testing the effectiveness of proposed policy decisions aimed at controlling the spread of the epidemic. Early result: formulation of the hypothesis of a correlation between the severity of the lockdown policy and lower Covid-19 impact (cases and morbidity). Early analysis did not support the hypothesis. Benefit to the public: feedback to policy making. Formulation of a variety of methods and approaches in control of the infectious spread, including public information, advice, environmental engineering and other measures [20]. Subsequent in-depth analysis and conclusion: studies continue. 3. Vaccination Level vs Rate of Spread of the New Viral Variants In this case, a relation between the rate of vaccination and the rate of spread of the Omicron variant of the Covid-19 virus was analyzed with an ecological dataset compiled from publicly available data for a subset of European jurisdictions. For the rate of spread, two variables were selected: rate_max, the maximum weekly count of new cases reported in the jurisdiction, per capita; and a novel invariant factor rate_inv, defined as the ratio of the maximum case count Cmax (peak) to the preceding minimum Cmin (trough), i.e., using exclusively characteristics of the case dynamics specific for the jurisdiction (Figure 1). Both types of analysis yielded consistent results. Summary: Problem: examining possible relation between the rate of vaccination and the rate of spread of the Omicron variant of Covid-19 virus in European jurisdictions. Early result: an indication of a possible minor positive correlation, not considered statistically significant; a statistically significant exclusion of a strong negative correlation. Benefit to the public: developed novel factors describing the rate of spread and methods to analyze the effectiveness of the vaccination policy. Subsequent in-depth analysis: statistical studies continue. Figure 1: Definition of an invariant rate factor rate_inv from the public incidence statistics The scenarios in the analysis of early ecological public data described in this section illustrate the use of public ecological data in the early formulation of hypothesis and factor analysis, with potential benefits to the public. 4. Methods in Early Factor Analysis Even a remotely detailed discussion of the methods of factor analysis would need a voluminous dedicated work. Here we will only outline some of the most obvious choices to begin the analysis and obtain the initial results. 4.1. Methods of Statistical Analysis Methods of statistical analysis can be used to obtain correlation factors (such as correlation coefficient) between the observable factor f and the factor(s) of interest K: 𝐶𝑓 = 𝐶𝑜𝑟𝑟(𝑊(𝑓), 𝐾) where W(f): a vertical slice (column) of the dataset W at factor f. Values of Cf approaching 1 by absolute value indicate a strong correlation (positive or negative) whereas those close to zero, an insignificant one. Additionally, these methods can produce other essential statistical characteristics in the distributions of observable factors and factors of interest, including confidence interval of values of Cf at a given confidence level. Examples of the use of statistical methods in early factor analysis of ecological public datasets were discussed in Section 3.2.2. 4.2. Regression and Multivariate Factor Analysis Effective methods of regression and multivariate factor analysis have been developed and widely used in practice. These methods produce a projected dependency of the factors(s) of interest on the observable factors and can be both linear (linear regression and interpolation [21]) and non-linear (polynomial and other) in nature. Some methods, including Random Forest regression, SelectKbest and others [22] can produce in addition, rankings of the observable factors with respect to the influence on the factor(s) of interest. This capability can be essential in the early analysis of the data and formulation of early hypotheses about the possible influencing factors in a developing epidemiological event. A related aspect of regression and multivariate analysis is the analysis and verification of the produced dependency (trend analysis). It can be instrumental in the evaluation of potential scenarios in the development of the event, risks and preventative policy options. Multivariate factor analysis can be a relatively simple and informative approach in the initial analysis of ecological data. 4.3. Supervised Machine Learning Methods of supervised machine learning can be seen as a type of multivariate regression, not necessarily of a linear type. They can produce an expected value (prediction) for values of observable factors that were not present in the original data via the process of training: 𝑀 = 𝑇(𝑊(𝑓), 𝐾) where M: a trained method (predictor); W: training dataset that includes known values of the factor of interest (K) associated with observations in the set; T: training process. Once the method has been trained, it can produce predictions as: 𝑘̃ = 𝑀(𝑓̃), where 𝑓̃: an arbitrary combination of values of the observable factors, f, 𝑘̃: the predicted value of the factor of interest. Many types of supervised methods are used in practice, both linear and non-linear, including ensemble methods, artificial neural networks and many others. 4.4. Unsupervised Learning Methods of unsupervised learning can be instrumental in the analysis of the structure of the data, expressed in observable parameters, without known association with the factors of interest (i.e., unmarked, unlabeled data). Many different types and flavors of unsupervised methods exist, linear and non-linear. Some of the simplest ones are linear methods such as Principal Component Analysis (PCA) and more generally, Singular Value Decomposition (SVD). These methods produce representations of data in modified orthonormal (linearly uncorrelated) coordinates, obtained by linear transformation of the observable ones. These methods work well with complex data expressed in a large number of observable factors. Methods of unsupervised learning were applied in the early analysis of Covid-19 epidemiological data to evaluate the distribution of the datapoints, identify characteristic types or clusters in the data and the informative factors that control the distribution of data points in the characteristic clusters. An illustration of the distribution of an epidemiological impact dataset (Covid-19) in a low-dimensional informative embedding obtained with a generative neural network model is shown in Figure 2. Figure 2: Distribution of Covid-19 epidemiological dataset in an informative three-dimensional embedding (generative ANN, [23]). Many more methods of non-linear unsupervised learning and dimensionality reduction have been developed and used in practice, including: spectral, t-SN and other linear and non-linear embeddings (manifold learning), generative ANN [24,25] and others [26]. An advantage of these methods is that they have no limitations in expressing the informative factors (that is, embedded coordinates) as linear combinations of the observable parameters of the data and can produce effective informative low- dimensional representations (embeddings) of complex real-world data. Using methods of unsupervised learning, both linear and non-linear with complex real-world data may produce insights into the characteristic structure of the data and be instrumental in identifying essential factors from the perspective of the general problem of factor analysis discussed in Section 3. 5. Discussion As was demonstrated in the discussion in this work, openly available public data can facilitate rapid analysis of a developing public health event, from different perspectives and with a variety of methods thus being instrumental in formulating hypotheses, providing advice and feedback to policy actions and so on. Taking into account the experience, successes and shortcomings of public data made and being made available during the onset of the pandemic, we suggest the following improvements that can improve the usability of the data, as well as accuracy and confidence in the findings and results. • Ensure lateral consistency data by harmonizing reported factors, units, collection and reporting practices, standards and formats [15]. • Attempt to provide and ensure vertical traceability and consistency of the data i.e., the ability to navigate from higher to lower levels of reporting without significant loss of accuracy. • Attempt to identify and provide confidence level in the published statistics. • Connect and attempt to provide consistency between different observable factors, for example, cases; morbidity; severity; etc. • Ensure compatibility and consistency between different releases of the data; in the least, it should be possible to convert different releases to the same consistent format. • Provide a stable permanent repository for the data and easily locatable access point(s). • Attempt to provide research-friendly formats of the data that do not require significant manual processing. • As a further perspective, work on harmonization of reporting practices, standards and resulting data between national and international structures. 6. Conclusion In this work we examined the supposition that the availability of detailed; accurate and current data on the development of the event can be essential for the effectiveness of early factor analysis with methods described in this study and many others that can produce essential early insights into the factors, dependencies and trends in the developing situation; be instrumental in the formulation of hypotheses, testing and evaluation of the effectiveness of the implemented policies and responses. Practical examples of research cases based on the data available in the open public access support this view. Challenges in the use of public data were noted and discussed in detail, and recommendations for the preparation and publication of research-friendly data, availability and access policies formulated for implementation into the practice. The challenges of the early factor analysis with publicly available data often relate to the trade-off between the shortening of the research cycle such as the formulation of hypotheses and the confidence of the conclusions. The assumptions, hypotheses and trends identified in the early phase will need to be tested and verified with more data and research, including controlled studies. However, it can be expected to have a largely positive effect due to the exchange of ideas and mutual stimulation of research directions in the crucial early phase in the development of public events. An associated intent of this work has been to stimulate a discussion in the research community. With the improvements in the practice of collection, preparation and publication of the ecological data, as suggested here as well as identified in the follow-up discussions, it can be fully expected that early statistical analysis of public data can become an instrumental and necessary stage in the evaluation of developing public health events. 7. References [1] Gorsuch, R.L.: Factor Analysis, Chronicle Books 2 ed. (1983). [2] Worldometers: World Statistics. Online: https://www.worldometers.info/about (2023). [3] Statista.com: Online Data Platform. Online: https://www.statista.com/aboutus/ (2023). [4] Google Health: COVID-19 Open Data Repository. Online: https://health.google.com/covid- 19/open-data/ (2023). [5] Henry Bull Library: Advanced Sociological Research. Online: https://hbl.gcc.libguides.com/soci377/data [6] World Health Organization: Coronavirus Disease. Online: https://www.who.int/emergencies/diseases/novel-coronavirus-2019 (2019). [7] European Centre for Disease Prevention and Control (ECDC): Covid-19. Online: https://www.ecdc.europa.eu/en/covid-19 (2019). [8] National Health Service, United Kingdom (NHS) Covid-19 Information and Advice. Online: https://www.nhs.uk/covid-19-advice-and-services/ (2019). [9] Centers for Disease Control and Prevention, USA (CDC) Covid-19. Online: https://www.cdc.gov/coronavirus/2019-ncov/index.html (2019) [10] Mukherjee, S., Pahan, K.: Is COVID-19 Gender-sensitive? Journal of Neuroimmune Pharmacology 16, 38–47 (2021). [11] Sze, S., Pan, D., Nevill, CR et al: Ethnicity and clinical outcomes in COVID-19: A systematic review and meta-analysis. EclinicalMedicine 100630 (2020). [12] Bloom, J.D., Chan, Y.A., Baric, R.B. et al: Investigate the origins of Covid-19. Science 372 (6543), 694 (2021). [13] Skegg, D., Gluckman, P., Boulton, G. et al.: Future scenarios for the COVID-19 pandemic. Lancet 397 (10276) 777–778 (2021). [14] Kim, D.D. and Neumann, P.J.: Analyzing the cost effectiveness of policyresponses for COVID- 19: the importance of capturing social consequences. Medical Decision Making, 40 (3), 251–253 (2020). [15] Simon, S.: Inconsistent reporting practices hampered our ability to analyze COVID-19 data. Here are three common problems we identified. The Covid Tracking project, The Atlantic (2021). [16] Jons Hopkins Coronavirus Resource Center. Online: https://coronavirus.jhu.edu/map.html (2020). [17] The Center for Evidence-based Medicine, Oxford University. Online: https://www.cebm.net/oxford-covid-19-evidence-service/ (2020). [18] Li, B., Wu, F., Lim, S.-N. et al.: On feature normalization and data augmentation. In: Proceedings of CVPR-2021 (2021). [19] Escobar, L. E., Molina-Cruz, A., Barillas-Mury, C.: BCG vaccine protection from severe Coronavirus disease 2019 (COVID19). Proceedings of the National Academy of Sciences, 117 (44), 27741–27742 (2020). [20] Dolgikh, S.: Smart-Covid: intelligent solutions for higher risk environments. HAL archives- ouvertes, hal-02915459 (2020). [21] Wendland H.: Scattered data approximation. Cambridge University Press 9 (2005). [22] Dolgikh, S. and Mulesa, O.: Covid-19 epidemiological factor analysis: identifying principal factors with Machine. In: 7th International Conference "Information Technology and Interactions" (IT&I- 2020) Kyiv Ukraine, CEUR-WS.org 2833 114–123 (2021). [23] Dolgikh, S.: Unsupervised clustering in epidemiological factor analysis. The Open Bioinformatics Journal 14(1), 63–72, 2021. [24] Coates, A., Lee, H., Ng, A.Y.: An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of 14th International Conference on Artificial Intelligence and Statistics 15, 215–223 (2011). [25] Seddigh, N., Nandy, B., Bennett, D., Ren, Y. et al., "A framework & system for classification of encrypted network traffic using Machine Learning", In: 15th International Conference on Network and Service Management (CNSM) Halifax Canada, 1–5 (2019). [26] Izonin, I., Tkachenko R., Dronuyk I. et al.: Predictive modeling based on small data in clinical medicine: RBF-based additive input-doubling method. Math Biosc. Eng, 18 (3) 2599–2613 (2021).