A scalable pipeline for COVID-19: the case study of Germany, Czechia and Poland. Wildan Abdussalam1,* , Adam Mertel1 , Kai Fan1 , Lennart Schüler1,2 , Weronika Schlechte-Wełnicz1 and Justin M. Calabrese1,3,4 1 Center for Advanced Systems Understanding, Helmholtz-Zentrum Dresden-Rossendorf, Untermarkt 20, 02826 Görlitz, Germany 2 Department of Computational Hydrosystems, Helmholtz Centre for Environmental Research (UFZ), Permoserstraße 15, 04318 Leipzig, Germany 3 Department of Ecological Modelling, Helmholtz Centre for Environmental Research (UFZ),Permoserstraße 15, 04318 Leipzig, Germany 4 Department of Biology, University of Maryland, College Park MD, Maryland, USA Abstract Throughout the coronavirus disease 2019 (COVID-19) pandemic, decision makers have relied on forecasting models to determine and implement non-pharmaceutical interventions (NPI). In building the forecasting models, continuously updated datasets from various stakeholders including developers, analysts, and testers are required to provide precise predictions. Here we report the design of a scalable pipeline which serves as a data synchronization to support inter-country top-down spatiotemporal observations and forecasting models of COVID-19, named the where2test, for Germany, Czechia and Poland. We have built an operational data store (ODS) using PostgreSQL to continuously consolidate datasets from multiple data sources, perform collaborative work, facilitate high performance data analysis, and trace changes. The ODS has been built not only to store the COVID-19 data from Germany, Czechia, and Poland but also other areas. Employing the dimensional fact model, a schema of metadata is capable of synchronizing the various structures of data from those regions, and is scalable to the entire world. Next, the ODS is populated using batch Extract, Transfer, and Load (ETL) jobs. The SQL queries are subsequently created to reduce the need for pre-processing data for users. The data can then support not only forecasting using a version-controlled Arima-Holt model and other analyses to support decision making, but also risk calculator and optimisation apps [1, 2]. The data synchronization runs at a daily interval, which is displayed at https://www.where2test.de. 1. Introduction realise the data surveillance and outbreak response man- agement, which have been implemented in fighting other In building forecasting models of COVID-19, many re- endemic diseases [4, 5, 6, 7]. searchers employ the training datasets provided by each To date, the data management have been applied in country’s representative institutions, e.g., Robert Koch controlling the outbreak of COVID-19 [8, 9, 10, 11, 12, Institute in Germany. The publicly accessible COVID- 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]. Most of them pro- 19 data provided in raw textual format, such as CSV, vide maps and the prevalent data in the following re- JSON, and XML are downloaded and analysed by the gional level: (i) National level, e.g., COVID-19 data of researchers employing either statistical or machine learn- World wide [10], Europe [11, 12, 13], and Latin Amer- ing approaches. However, the data are unwell struc- ica [14]; (ii) State and county levels, e.g., the COVID-19 tured and require heavy pre-processing as well as in- data warehouse for Italy [15], COVID-19 dashboard for gestion activities for further analysis. This method is UK [17], the COVID-19 dashboard for Maryland [18], inherently inefficient due to identical and manual paral- and for Germany [19].; (iii) County level, e.g., Dresden, lel pre-processing of the RKI data (using e.g. python or Germany [20]. More completed version is provided by R scripts) performed by each researcher. This reduces the John Hopkins University [21], which serves the dash- the efficiency of each and everyone’s work as all have to board and the prevalent data for each regional level in spend hours and days in pre-processing data before com- the USA as well as for most of countries around the ing to modeling and forecasting. Advanced computing world. Likewise, the similar method in the presence infrastructures and novel software pipelines are crucial of semi-automatic validation strategy was conducted to tools to synchronize the data structures which originate check the data quality of daily updated numbers with from various sources and to extremely reduce heavy pre- governmental/official data sources [22]. However, most processing [3]. They serve as essential prerequisites to of dashboards and data warehouses have not provided the features to let the users perform an inter-country Proc. of the First International Workshop on Data Ecosystems (DEco’22), top-down spatiotemporal observation, i.e., observing the September 5, 2022, Sydney, Australia inter-country prevalence and simultaneously being able * Corresponding author to observe to the microscopic level (nation → state → $ w.abdussalam@hzdr.de (W. Abdussalam); j.calabrese@hzdr.de county → municipality). The features could provide in- (J. M. Calabrese) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License sights, for example, to study COVID-19 border dynam- Attribution 4.0 International (CC BY 4.0). CEUR CEUR Workshop Proceedings (CEUR-WS.org) Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 ics which have been so far attracted considerable atten- 64 tions [23, 24, 25, 26]. Moreover, they are lack of fore- casting features, which play a key role in predicting the future prevalence as well as determining non pharma- ceutical interventions (NPI). A tremendous number of forecasting models, e.g., agent-base [27], machine learn- ing [28, 29], combination model [30, 31], compartment model [32, 33, 34, 35, 36], time series [37, 38, 39, 40, 41, 42, 43] have employed government datasets to provide essen- tial inputs for public decisions. However, most of datasets that were used in those studies are limited to the specific time window which are likely to produce different results when the datasets are updated. Establishing a system of daily-updated-datasets assisted forecasts, therefore, is an alternative to improve their consistency and precision. In this paper, we address the aforementioned issues by proposing the design of a scalable pipeline which allow us to perform the top-down spatiotemporal obser- vation among Germany, Czechia, and Poland as well as to perform daily forecasts. The method of the pipeline which consists of extraction of various data sources and the ODS is described in subsec 2.1. More specifically, we will describe the dimensional fact database model and a daily migration process which underline the data synchronization between various data sources and our Figure 1: (a) A workflow of data pipeline Hospitals, retire- database server. We employ the dimensional fact model ment houses, and schools of Germany, Czechia and Poland due to more flexibility and versatility in building spa- update the data of COVID-19 cases, vaccines and tests to the tiotemporal aggregation functions than the nanocubes representative government institutions. A daily automatic ETL model [44, 45]. Next, in subsec 2.2 we will describe the step is performed to synchronize the data sources and central time-series forecasting models which are supported by database of CASUS. A daily and weekly automatic forecast the presence of the ODS. Furthermore, the automatic employing, e.g. Arima-Holt model, is applied to provide rapid system of daily forecasts owing to the presence of the predictions. The predictions and the actual data are shown in the where2test website; (b) The scalable dimensional fact pipeline will be laid out in this sub section. In Sec. 3, we model. Datavalues and datavalue types represent measures, will describe facilities that have been established due to while region types and timeperiod types represent spatial and the presence of the ODS. In order to demonstrate the temporal dimensions, respectively. inter-country top-down spatiotemporal observations, an analysis will begin from the macroscopic scale in which the study of the virus spread across the national borders is described in subsec 3.1. Herein we consider the border 2. Methods among Germany, Czechia and Poland as a study case. In subsec 3.2, we explore more microscopic level by ap- 2.1. Data Pipeline plying a daily-updated-datasets assisted forecast for the Fig. 1a shows a workflow of the data pipeline. The hos- prevalence in the state of Saxony, Germany. Last but pitals, retirement houses and schools register the daily not least, in subsec 3.3, most microscopic level that we number of the COVID-19 cases and vaccines to the rep- will demonstrate is a superspreading event at a slaugh- resentative government. In order to consolidate these ter house in Gütersloh, Lower Saxony, Germany. As the data, the relational database is built based on dimen- COVID-19 situation begins to enter an endemic phase, sional fact model [46]. Having established the relational a study of superspreading event will provide essential database, the daily automatic extract, transfer and load information to trace the COVID-19 transmission after a (ETL) step is performed to migrate and integrate the data mass event. sources to the PostgreSQL database of CASUS HZDR (see Suplementary materials 7.1). Next, we create SQL inquiries-based views to be analysed by our researchers using the forecast and machine learning methods. The tested and completed analysis methods are set in the master stage and the other tested methods are set in the 65 develop stage. Only the forecasting method in the master 2.2. Forecasts stage is integrated in the automatic pipeline. We employ auto regression integrated moving average The dimensional fact model is shown in Fig. 1b. The (ARIMA) and Holt’s linear trend models to forecast the model consists of three main concepts: (i) Facts, that infected, test, and hospitalised data of COVID-19 for refer to a subject of study (e.g., the study of infected, Saxony (Germany), Czechia, and Poland. The ARIMA dead, recovered, hospitalised, test and vaccinated cases model has been successfully employed in predicting other due to COVID-19); (ii) Measures, that refer to the quan- endemic diseases [47, 48, 49, 50]. The model features titative data of the concept (i). The measured data are suitable prediction based on time analysis series which stored in the table of datavalues. The tables of datavalues is capable of providing short horizon forecast for most contain the number of infected, dead, recovered, hospi- COVID-19 cases around the world [38, 39, 40, 41, 42, 43]. talised, test, and vaccinated cases due to COVID-19 in To make the model consistent and avoid overfitting, the a given time and place. To date, the schema consists of order parameter of the ARIMA model is fixed instead of three datavalues, i.e., datavalues of Germany, Czechia using the auto ARIMA model. The ARIMA is improved and Poland; (iii) Dimensions, that refer to temporal and by employing the Holt’s linear trend model [51]. The spatial attributes. As the measured data are provided in a Holt’s model uses the exponential smoothing method to given time and place, the table of time period types and compute the weighted average of the past observation regions is necessary. The former stores the type of time data [52]. The forecasts from the Holt’s linear model have period which consists of day and week data type; and the a trend, so the damped parameter is turned on to avoid latter stores the necessary information of regions which this trend [53, 54, 52]. A self-defined mix function is used consist of the name, abbreviation, ID of regions, ID of to compute the probability parameter m to combine the region type, geometry and population. The table of re- forecasts from two models and minimize the error. The gions depends on the table of region types. The regions Box-Cox transformation is used to normalize the input are categorised based on their sizes. The order of as- data [55, 52]. cending sizes starts from municipality, county, state and Our model provides a weekly forecast at first. In order nation. For Germany, the order of region type starts from to improve the daily variation and provide more real- Gemeinde, Kreise and Bundesland. Similar to Germany, time forecasts, we have built a daily forecast model. As Poland consist of Gmina, Powiat, and Wojewodztwo. Dif- the daily data have a clear weekly variation, the sea- ferent from Germany and Poland, Czechia consist of 4 sonal parameters are added to the model; and seasonal level, Obec, Orp, Okres and Kraj. The spatial and tempo- ARIMA (SARIMA) and Holt-Winters’ seasonal model are ral attributes are connected by means of hierarchies to employed for the daily forecasts [56, 51, 57]. Similar to represent a -to-one relationship between them. The table the ARIMA model, the seasonal ARIMA model uses the of mapping_types contains the hierarchical type of the fixed order and seasonal parameters. After comparing spatial attributes, e.g., for Germany (Gemeinde to Kreise, the errors from multiple methods, the additive method is Kreise to Bundesland), for Czechia (Obec to Orp, Orp to selected for the Holt-Winters’ seasonal model. The mix Okres and Okres to Kraj), and Poland (Gmina to Powiat function is also used for the daily forecasts to combine and Powiat to Wojewodztwo). Next, a many-to-one re- the forecasts from two models and improve the forecast- lationship between those spatial hierarchies are stored ing accuracy. For study cases of (S)Arima-Holt model, in the table of mapping_regions. Moreover, the table of in Sec. 3.2, we will provide the number of infections for timeperiod_types consists of the hierarchical type of the Saxony, Germany. In addition to (S)ARIMA-Holt model, temporal attributes. we employ outlier detection to identify and quantify Su- Aggregation functions are applicable on the measures perspreading events. As suggested in [35], we identify along the temporal and spatial dimensions. For the for- and quantify superspreading events by using time se- mer dimension, the weekly data are cumulative 7–day ries analysis based outlier detection methods. The rate data. For example, a 7–day case reported on 13.03.2022 is of newly infected is modeled by an appropriate model, an accumulation of the daily case for 07-13.03.2022. More- which could be something as simple as a rolling average over, for the latter dimension, county data are cumulative- to more elaborate ones as SIR-based models. The residues municipality data. Not only accumulating the data from of the reported cases is used to identify outliers. At the the municipality to a county level, in the presence of same time, the residues can be used to quantify the size mapping regions table, it is possible to accumulate the of a superspreading event. data from the county to the state level as well as the state to the nation level. This allows us to scale the pipeline to other areas provided that the data of municipality are 3. Results available from the sources. The presence of the pipeline has allowed us to provide following facilities: (i) The released data hub for dead and 66 infected cases of all counties and states in Germany [58], which allows a collaboration between CASUS research staffs and other external collaborators. The post-processing data serve as the clean data of daily infected and dead cases for county and state levels. In addition, we have also pre-processed the vaccination and hospitalization data for the county and municipal levels; (ii) The daily updated value of background risk for optimisation [1] and risk calculator apps [2], which defines the chance of an average person who lives in the focal area, and car- ries out daily activities, will be infected over a one week period; (iii) Blog posts which update current COVID-19 Figure 2: Difference in the pair-wise correlations for regions situations in Germany. An interesting example of the within a 100 kilometer radius inside and outside the country. posts would be the relation between the vaccination rate The red color represents the regions with the strongest dif- and the 7-day incidence in all states of Germany [59]; ference, indicating the spread of the virus across the national (iv) Forecast- and model-based analysis. We explore the borders. study cases mentioned in Sec. 1, and begin by investi- gating of the virus spread across the national borders of Germany, Czechia, and Poland. wise correlations for each region considering the regions in the radius of 100 kilometers, (i) within the same coun- 3.1. Analysis of the virus spread across try, (ii) outside this country. The difference of these val- the national borders ues can be seen in Fig. 2. The bigger difference represents regions where the incidence correlates much better than COVID-19 spread among people. Therefore, human mo- the regions within the same country, indicating a strong bility is one of the most important factors defining the national border effect on the virus spread. trend of spatiotemporal spreading of the virus. Under- In the next step [63], we quantified the mitigation effect standing human mobility allows us to predict the spa- of the national border in more detail. We picked the state tiotemporal character of spread, evaluate the government of Saxony in Germany and the neighboring regions in steps restrictions, and provide effective non-pharmaceutical Czechia. For both countries, we collected and integrated interventions. Primarily due to the heterogeneity of the the incidence data on the level of single municipalities. sources and the interest scope of the particular research For each municipality, we constructed a local regression groups and communities, most of the COVID-19 research model which estimated the effect of three parameters, (i) stays within the boundaries of one country. While most border presence, (ii) municipality size, and (iii) temporal human mobility happens in the extent of one country or distance from other municipalities, on the spread of the region, notably in Europe, the national border’s mitigat- virus. Based on this model, we identified very small- ing effect is generally diminishing. To study the impact scale areas susceptible to a more intensive inter-national of the national border, several research papers [60, 61] ap- spread of the COVID-19. plied various methodologies of geostatistics and geospa- The top-down approach we selected for the study on tial modeling. More thorough quantification of the effect the national border effect is possible thanks to the scala- of border presence and international mobility on the epi- bility of the implemented dimensional-fact model. This demy requires a data storage integrating heterogeneous principle allows the ODS to comprise various adminis- datasets across more countries. trative levels and combine various relevant topics within The presented ODS infrastructure offers a possibility the perspective of spacetime. to study the spatiotemporal character of the virus spread on more levels, considering the effect of the national border. First, for our case study comprising the coun- 3.2. Weekly and daily forecast of tries of Germany, Poland, and Czechia, we explored the Arima-Holt and Sarima-Holt correlation of new cases in the region, the distance and For the case study, we provide a short-time forecast of 7- the border presence. We observed that the neighbour day incidence up to 4 horizons performed on 13-04-2022 regions tend to have similar incidence values in the ab- using Arima-Holt model for Saxony, Germany. We used sence of barrier in the form of a national border among a training dataset of 13-04-2022 version which consists of them. This step followed the research of McMahon et al. the historical weekly data of Saxony and its counties from [62], which showed a strong spatial autocorrelation of 01-03-2020 to 10-04-2022. The weekly data are automated- incidence values in the USA. daily-updated data which are aggregated on Sunday (see Further, we calculated the average time-lagged pair- 67 a different day, a deviation from the actual data for the following 4 horizons is likely to occur. Additional realisa- tions of Arima-Holt forecast in Saxony and its counties, therefore, were performed to improve statistics. The re- alisations were performed every Wednesday from 05-01- 2022 to 18-05-2022 in which the version-control dataset were employed as training and test datasets. An example would be a realisation of the Forecast on 05-01-2022. We Figure 3: 7-day incidence of infected cases Jan - 8 May 2022 used the weekly data version of 05-01-2022 as its training for Saxony, Germany. The black dots denote the historical dataset and the weekly data version of the following 1st, data, the blue line (—) denotes a line guidance for the historical 2nd, 3rd and 4th week as its test datasets. For each region, data, and the green (—), orange (—), and red line (—) denotes we then recorded a deviation of the forecast result from the result of forecast using the Arima-Holt model performed the historical data and quantified it as mean absolute on 10-04-2022, 11-04-2022, and 13-04-2022, respectively. The grey area shows the lower and upper limits of the forecast for percentage error (MAPE). As shown in Fig. 4, the weekly 13-04-2022. Arima-Holt provides relatively low MAPE for the first and second horizon. For the third and fourth horizon, however, the range of MAPE tends to be wider than the first and second. Sec. 2.1). Although we update the data daily, for the Therefore, we performed the Sarima-Holt model to case of Germany, the current and previous-day data are improve the performance of forecast for the third and unavailable. In addition, the previous third day data are fourth horizon. Owing to daily-updated data, the version- still to be updated from the source. When the forecast was control of daily data is employed as the seasonal pa- performed on Sunday 10-04-2022, the number of infection rameters. In addition to the daily data, the Sarima-Holt on that day was less than the number of the same day forecast was performed using the same version-control for the following-day version. As a result, this produces weekly data employed to the Arima-Holt model. For the inaccurate forecast (see Fig. 3). As the day elapsed, more daily data, we removed the current and two previous- cases were automatically added and aggregated to the last day data due to zero values for current and yesterday Sunday data. Consequently, the performed forecast on data, and inconsistent data for the previous third day. We 13-04-2022 provides higher exponent than the one with then compared its performance in the presence and the the dataset version of 10 and 11-04-2022. Moreover, the absence of the Box-Cox transformation (BCT) used to dataset of Wednesday consists of relatively-stable version. normalize the input data. As shown in Fig. 4, the Sarima- Therefore, the forecast is performed every Wednesday Holt model in the absence of the BCT provides lower due to the consistency of data source for the last Sunday. MAPE than either the Arima-Holt or the Sarima-Holt in the presence of the BCT for not only the first and second horizons, but also the third and four horizons. 3.3. Superspreading events Superspreding events play an important role in the dis- persion dynamics of COVID-19 [64]. However, one of the most commonly used epidemiological model types, the compartment models, are not able to accuratly cap- ture these events [35, 65]. We are currently working on a Figure 4: Mean absolute percentage error of Arima-Holt solution to the problem by using outlier detection meth- (weekly), Sarima-Holt in the presence of Box-cox transfor- ods on a county level. Many different methods exist and mation (daily_originT), and Sarima-Holt in the absence of they can produce more robust results, when more than Box-cox transformation (daily_originF) for 1𝑡ℎ - 4𝑡ℎ horizon. one timeseries is taken into account. A database as pre- sented in this work is very advantageous, as it makes it In order to check the four-horizon forecast, we com- very convenient to query the reported infections from all pare it to the weekly-historical data updated on 11-05- neighboring counties and use this additional data to more 2022. The latter consists of relatively stable data from robustly identify outliers, which might be superspread- 17-04-2022 to 08-05-2022. As shown in Fig. 3, the weekly- ing events. The largest confirmed superspreading to date historical data is surprisingly in quantitative agreement in Germany with 1766 infections happened in a meat pro- with the four-horizon forecast. However, this agreement cessing facility in the North Rhine-Westphalian district occurs occasionally. When the forecast is performed in of Gütersloh in June 2020. The facilities’ environmental 68 conditions combined with relatively close physical dis- The Sarima-Holt model is trained by the daily data, and tance between workers were likely the main reason for the variation of the data could make the model more efficient aerosol transmission [66]. We take this event as sensitive to the infection change compared to the Arima- an example to show the result of a Z-score based outlier Holt model trained by the weekly data. However, the detection method (Fig. 5). BCT reduces the variation of the daily data, and conse- quently the daily forecasts perform worse than in the absence of the BCT. 5. Conclusion Our work has demonstrated the utility of the data pipeline for top-down spatiotemporal analysis. We have first shown the macroscopic analysis, in which the investi- gation of the virus spread across the national border is presented. At more microscopic level, we have demon- Figure 5: The official reported COVID-19 daily incidence strated data-driven approach due to the presence of the per 100.000 inhabitants in the district of Gütersloh. A super- spreading event in a meat processing plant in June 2020 is pipeline which is applied to the prevalence of the county successfully identified by an outlier detection method based region. The daily-updated data has improved the preci- on the Z-score (the black dot). sion of the model for longer horizon. This data-driven epidemic models provide more realistic forecast results than either the parsimonious [34] or more number of parameters with agent-based method [27] due to the us- 4. Discussions age of daily-updated data. This may contribute to public health policy making, including contributing to public Our analysis, implementing the pipeline in the presence health forecasting teams. Last but not least, exploring of dimensional fact model has allowed us to daily mi- to lower level of region, we have demonstrated that the grate the data efficiently due to the functions of spa- outlier model is applicable to capture the superspreading tiotemporal aggregation. To provide the weekly data of event which occurred in 2020. These have shown that counties, states, and nations, we only migrate the data of our work is capable of performing top-down analysis as daily municipalities/counties (depends on the data avail- well as rapid and precise forecasts due to the presence of ability of each nation) to the database server which are the pipeline. then aggregated to the higher spatiotemporal level. This model provides more advantages than the nanocubes model [44, 45]. For the nanocubes model, each spatial 6. Data sources (municipality, county, state and nation) and temporal • COVID-19 data for Germany, Czechia and Poland. (daily and weekly) data are required to be migrated to the database server. Consequently, this leads to a longer – Robert Koch Institute migration process than the one performed using the di- – Czech Ministry of Health mensional fact model. Moreover, its spatiotemporal map- – Polish Ministry of Health ping enables us to perform an efficient table join among – Age-based hospitalisation of state level for national data which is confirmed by the application on Germany (https://github.com/KITmetricsl the Subsec. 3.1. ab/hospitalization-nowcast-hub/blob/ma The presence of daily-updated data due to the presence in/data-truth/COVID-19/). of the pipeline has allowed us to develop the Sarima-Holt – Age-based and type-based doses of vaccine model. The model shows more robust prediction for for county level (https://github.com/rober longer horizon than the Arima-Holt one. More specifi- t-koch-institut/COVID-19-Impfungen_i cally, the Sarima-Holt in the absence of the BCT outper- n_Deutschland/blob/master/Aktuell_De forms the Arima-Holt model for the third and fourth hori- utschland_Landkreise_COVID-19-Impfu zon. This performance is due to the seasonal-parameter ngen.csv). contribution to the model. As a result, the forecast tends – COVID-19 infected, recovered, hospitalised to better predict for the third and fourth horizon. In con- and dead cases of Dresden (http://daten.dr tradiction, the Sarima-Holt in the presence of the BCT esden.de/duva2ckan/files/de-sn-dresden provides lower performance than the absence one due to -corona_-_covid-19_-_fallzahlen_md1_d less variation of the training data after BCT (see Fig. 7). resden_2020ff/content). 69 – COVID-19 infected, dead, and test cases 7. Supplementary information of Czechia for Municipality level (https: //onemocneni-aktualne.mzcr.cz/api/v2/c 7.1. Data workflow ovid-19/). We use https://www.talend.com/products/talend-open- – Age-based and gender-based infected and studio/ to perform data migration. The migration be- dead cases for county level of Germany tween the data sources and the PostgreSQL database of (https://experience.arcgis.com/experience CASUS HZDR has been performed as follows: /478220a4c454480e823b17327b2bf1d4). – COVID-19 cases for municipality level of Saxony, Germany (https://www.coronavi rus.sachsen.de/corona-statistics/rest/inf ectionOverview.jsp). – COVID-19 cases for county level of Saxony, Germany (https://media.githubuserconten t.com/media/robert-koch-institut/SARS -CoV-2_Infektionen_in_Deutschland/ma ster/Aktuell_Deutschland_SarsCov2_Infe ktionen.csv) Figure 6: Data workflow of the ETL process (see texts for its – COVID-19 infected, dead, and test cases description). for county level of Poland (https://wojewo dztwa-rcb-gis.hub.arcgis.com/pages/dane -do-pobrania). 1. Data acquisition – COVID-19 vaccine for county level of Poland The data are automatically downloaded from sources 6. (https://www.gov.pl/web/szczepimysie/ra They are subsequently stored on the repository port-szczepien-przeciwko-covid-19). of where2test server. The downloaded data serve – COVID-19 types in Sachsen (https://www. as data inputs of a migration process. coronavirus.sachsen.de/infektionsfaelle-i 2. Dictionaries and data augmentation n-sachsen-4151.html). To integrate and further augment data from het- • Dictionaries of regions. erogeneous sources (various forms, schema, tem- – Administrative areas in Germany (https: poral and spatial extent), we needed to prepare //gdz.bkg.bund.de/index.php/default/digi a list of dictionaries. We formed a dictionary for tale-geodaten/verwaltungsgebiete.html). each spatial level in every country to cover all – Administrative areas in Poland (https://gi regions in our datasets. Here we included the s-support.pl/baza-wiedzy-2/dane-do-pob unique region id, all alternative names, full names, rania/granice-administracyjne/) geometries, and population numbers. This con- cept can be further extended to other values such – Administrative areas in Czechia (https://ge as socioeconomical parameters, and information oportal.cuzk.cz/(S(1nhx02lray0vkrhce1y2 about the region. This way we are able to main- d53d))/Default.aspx?mode=TextMeta&te tain the consistency in all datasets and enable xt=dSady_RUIAN&side=dSady_RUIAN) their integration process. The list of sources used – Population numbers in Czech municipali- for building the dictionaries can be found in sec- ties (https://www.czso.cz/csu/czso/pocet tion Data Sources 6. -obyvatel-v-obcich-k-112021) 3. Data cleaning – Postal codes in Germany (https://www.ge We migrate first timeperiod_types, region_types, onames.org/postal-codes/postleitzahle datavalues_types, and mapping_types. While n-deutschland.html) migrating the data to those tables, primary key – Population numbers in Poland (https://st are automatically set by a transformator (The at.gov.pl/obszary-tematyczne/ludnosc/lu script which migrates the data to the postgreSQL dnosc/ludnosc-stan-i-struktura-ludnosc database.). Next, the primary key of those tables i-oraz-ruch-naturalny-w-przekroju-teryt serves as the foreign key of other tables following orialnym-stan-w-dniu-30-06-2021,6,30.h the table relation shown in Fig. 1b. An example tml) would be a table of regions which contains intrin- sic ID set by representative governments. In order to differentiate ID among Germany, Czechia and 70 Poland, we add ’DE’, ’CZ’, ’PL’, respectively, fol- lowed by the intrinsic ID. For the table of regions, the primary key of region_types serves as its foreign key. The intrinsic IDs are categorised based on the ID of region types. A specific ex- ample would be Dresden, whose the intrinsic ID 14162. After cleaning processes, the intrinsic ID will be DE 14162 and categorised to the state level of Kreise. Having migrated the data to the aforementioned tables, the table of mapping_regions is occu- pied by the spatial-relation data. It contains the foreign key of the mapping type ID. An example would be a county Dresden. Dresden are mapped Figure 7: Time series of daily infected cases from onto the state of Saxony and categorized to the Aug. 5, 2020 to Apr. 30, 2022 (a) before and (b) after mapping type Kreis_To_Bundesland. Next, the Box-Cox transformation, respectively. table of datavalues for nations is occupied by the data input. The datavalues table consists of three foreign keys which originate from the tables of Research (BMBF) and by the Saxon Ministry for Science, timeperiod_types, regions, datavalues_types. Culture and Tourism (SMWK) with tax funds on the basis In the presence of these foreign keys, a data merg- of the budget approved by the Saxon State Parliament. ing process is feasible, which is described on the We thank to Jens Steiner for providing us virtual server following item. of HZDR. 4. Data merging In addition to the aforementioned three-foreign keys, date is set as the fourth at- tribute which allow us to perform data merging References through inner join of tables. The inner join is employed to cleanly merge and avoid duplicated [1] M. Davoodi, A. Batista, A. Senapati, W. Schlechte- data on the table of datavalues. For instance, daily Welnicz, B. Wagner, J. M. Calabrese, Modeling infected data of the lowest-level region for pe- COVID-19 optimal testing strategies in long-term riod of date are migrated to the table of dataval- care facilities: An optimization-based approach, ues_germany. When the data sources are up- arXiv (2022). URL: https://arxiv.org/abs/2204.02062. dated, they sometimes update the cases of the doi:10.48550/ARXIV.2204.02062. elapsed date. Inner join method allows us to au- [2] M. Davoodi, A. Senapati, A. Mertel, W. Schlechte- tomatically update the value of the elapsed date Welnicz, J. M. Calabrese, Optimal Workplace Occu- by the latest value. Moreover, when the new data pancy Strategies during the COVID-19 Pandemic, with the latest date are present from the source, it arXiv (2022). URL: https://arxiv.org/abs/2204.01444. allows automatic addition of the data to the table. doi:10.48550/ARXIV.2204.01444. 5. Data aggregation The presence of daily data of [3] J. L. Raisaro, others, SCOR: A secure international the lowest regions allow us to perform both time informatics infrastructure to investigate COVID- and spatial aggregations. Using functions, the 19, Journal of the American Medical Informatics time aggregation from daily to weekly period is Association 27 (2020) 1721–1726. doi:10.1093/ja feasible. Moreover, as mentioned on the Sec. 2, mia/ocaa172. the spatial aggregation from the low to the high [4] A. v. Wangenheim, A. Savaris, A. F. Borgatto, region level is allowable in the presence of the A. d. S. Inácio, Integrating Online Georefer- mapping_regions table. enced Epidemiological Analysis and Visualization into a Telemedicine Infrastructure – First Results, medRxiv (2019). URL: https://www.medrxiv.org/co 7.2. Additional forecasting results ntent/10.1101/19000554v1.full. [5] C. Fähnrich, others, Surveillance and Outbreak Acknowledgments Response Management System (SORMAS) to sup- port the control of the Ebola virus disease out- This work was partially funded by the Center of Ad- break in West Africa, Euro Surveill 20 (2015) 21071. vanced Systems Understanding (CASUS), which is fi- doi:https://doi.org/10.2807/1560-7917. nanced by Germany’s Federal Ministry of Education and es2015.20.12.21071. 71 [6] R. N. Smith, others, InterMine: a flexible data ware- /cases. house system for the integration and analysis of [18] g. maryland, Coronavirus Disease 2019 (COVID-19) heterogeneous biological data, Bioinformatics 28 Outbreak, 2022. URL: https://coronavirus.maryland (2012) 3163–3165. doi:https://doi.org/10.1 .gov. 093/bioinformatics/bts577. [19] c. d. rki, Robert Koch-Institut: COVID-19- [7] C. Pfander, B. Anar, F. Schwach, T. D. Otto, M. Bro- Dashboard, 2022. URL: https://experience.arcgi chet, K. Volkmann, M. A. Quail, A. Pain, B. Rosen, s.com/experience/478220a4c454480e823b17327b2 W. Skarnes, J. C. Rayner, O. Billker, A scalable bf1d4/page/Landkreise/. pipeline for highly effective genetic modification of [20] c. dresden, Corona-Dashboard Dresden, 2022. URL: a malaria parasite, Nature Methods 8 (2011) 1078– https://experience.arcgis.com/experience/d2386f3 1082. URL: https://doi.org/10.1038/nmeth.1742. 214c1451c81b242be69bb3d50. doi:10.1038/nmeth.1742. [21] E. Dong, H. Du, L. Gardner, An interactive web- [8] P. Kostkova, others, Data and Digital Solutions based dashboard to track COVID-19 in real time, to Support Surveillance Strategies in the Context The Lancet Infectious Diseases 20 (2020) 533–534. of the COVID-19 Pandemic, Frontiers in Digital URL: https://doi.org/10.1016/S1473-3099(20)301 Health 3 (2021). doi:https://doi.org/10.338 20-1. doi:10.1016/S1473-3099(20)30120-1, 9/fdgth.2021.707902. publisher: Elsevier. [9] J. Budd, others, Digital technologies in the public- [22] D. Sha, Y. Liu, Q. Liu, Y. Li, Y. Tian, F. Beaini, health response to COVID-19, Nature medicine 26 C. Zhong, T. Hu, Z. Wang, H. Lan, Y. Zhou, Z. Zhang, (2020) 1183–1192. URL: https://www.nature.com/a C. Yang, A spatiotemporal data collection of viral rticles/s41591-020-1011-4. cases for COVID-19 rapid response, Big Earth Data [10] F. A. Binti Hamzah, C. Hau, H. Nazri, D. Ligot, 5 (2021) 90–111. URL: https://doi.org/10.1080/2096 G. Lee, M. Shaib, U. Zaidon, A. Abdullah, M. H. 4471.2020.1844934. doi:10.1080/20964471.202 Chung, C. Ong, P. Chew, CoronaTracker: World- 0.1844934, publisher: Taylor & Francis. wide COVID-19 Outbreak Data Analysis and Pre- [23] Han Xiaoyi, Xu Yilan, Fan Linlin, Huang Yi, Xu diction (2020). doi:10.2471/BLT.20.255695. Minhong, Gao Song, Quantifying COVID-19 impor- [11] E. Centre, European Centre for Disease Prevention tation risk in a dynamic network of domestic cities and Control, 2022. URL: https://qap.ecdc.europa.eu and international countries, Proceedings of the Na- /public/extensions/covid-19/covid-19.html#globa tional Academy of Sciences 118 (2021) e2100201118. l-overview-tab. URL: https://doi.org/10.1073/pnas.2100201118. [12] A. Naqvi, COVID-19 European regional tracker, doi:10.1073/pnas.2100201118, publisher: Pro- Scientific Data 8 (2021) 181. URL: https://doi.org/10 ceedings of the National Academy of Sciences. .1038/s41597-021-00950-7. doi:10.1038/s41597 [24] D. Laroze, E. Neumayer, T. Plümper, COVID-19 -021-00950-7. does not stop at open borders: Spatial contagion [13] c. eudata, covid19-eu-data, 2020. URL: https://gith among local authority districts during England’s ub.com/covid19-eu-zh/covid19-eu-data. first wave, Social Science & Medicine 270 (2021) [14] c.-. latinoamerica, Latin America Covid-19 Data 113655. URL: https://www.sciencedirect.com/scie Repository by DSRP, 2020. URL: https://github.com nce/article/pii/S0277953620308741. doi:10.1016/ /DataScienceResearchPeru/covid-19_latinoameric j.socscimed.2020.113655. a. [25] M. Grimée, M. Bekker-Nielsen Dunbar, F. Hofmann, [15] G. Agapito, C. Zucco, M. Cannataro, COVID- L. Held, Modelling the effect of a border closure WAREHOUSE: A Data Warehouse of Italian COVID- between Switzerland and Italy on the spatiotem- 19, Pollution, and Climate Data, Environmental poral spread of COVID-19 in Switzerland, Spatial Research and Public Health 17 (2020). doi:https: Statistics (2021) 100552. URL: https://www.scienced //doi.org/10.3390/ijerph17155596. irect.com/science/article/pii/S2211675321000622. [16] R. K. Arora, A. Joseph, J. Van Wyk, S. Rocco, A. At- doi:10.1016/j.spasta.2021.100552. maja, E. May, T. Yan, N. Bobrovitz, J. Chevrier, [26] M. P. Hossain, A. Junus, X. Zhu, P. Jia, T.-H. Wen, M. P. Cheng, T. Williamson, D. L. Buckeridge, Sero- D. Pfeiffer, H.-Y. Yuan, The effects of border control Tracker: a global SARS-CoV-2 seroprevalence dash- and quarantine measures on the spread of COVID- board, The Lancet Infectious Diseases 21 (2021) 19, Epidemics 32 (2020) 100397. URL: https://ww e75–e76. URL: https://doi.org/10.1016/S1473-309 w.sciencedirect.com/science/article/pii/S1755436 9(20)30631-9. doi:10.1016/S1473-3099(20)3 520300244. doi:10.1016/j.epidem.2020.1003 0631-9, publisher: Elsevier. 97. [17] c. GovUK, Interactive map of cases, 2022. URL: https: [27] Q.-H. Liu, others, Model-based evaluation of al- //coronavirus.data.gov.uk/details/interactive-map ternative reactive class closure strategies against 72 COVID-19, Nat. Com. 13 (2022). doi:10.1038/s4 09945. 1467-021-27939-5. [38] S. Roy, G. S. Bhunia, P. K. Shit, Spatial prediction [28] H. Bastani, others, Efficient and targeted COVID-19 of COVID-19 epidemic using ARIMA techniques in border testing via reinforcement learning, Nature India, Modeling Earth Systems and Environment 7 559 (2021). URL: https://www.nature.com/articles/ (2021) 1385–1391. URL: https://doi.org/10.1007/s4 s41586-021-04014-z. 0808-020-00890-y. doi:10.1007/s40808-020-0 [29] S. Flaxman, others, Estimating the effects of non- 0890-y. pharmaceutical interventions on COVID-19 in Eu- [39] M.-J. Geng, H.-Y. Zhang, L.-J. Yu, C.-L. Lv, T. Wang, rope, Nature 584 (2020) 257. URL: https://www.na T.-L. Che, Q. Xu, B.-G. Jiang, J.-J. Chen, S. I. Hay, ture.com/articles/s41586-020-2405-7. Z.-J. Li, G. F. Gao, L.-P. Wang, Y. Yang, L.-Q. Fang, [30] N. Haug, others, Ranking the effectiveness of world- W. Liu, Changes in notifiable infectious disease wide COVID-19 government interventions, Nature incidence in China during the COVID-19 pandemic, Human Behaviour 4 (2020) 1303–1312. URL: https: Nature Communications 12 (2021) 6923. URL: https: //www.nature.com/articles/s41562-020-01009-0. //doi.org/10.1038/s41467-021-27292-7. doi:10.103 [31] A. Liu, L. Vici, V. Ramos, S. Giannoni, A. Blake, Vis- 8/s41467-021-27292-7. itor arrivals forecasts amid COVID-19: A perspec- [40] Y. Wang, C. Xu, S. Yao, L. Wang, Y. Zhao, J. Ren, tive from the Europe team, Annals of Tourism Re- Y. Li, Estimating the COVID-19 prevalence and search 88 (2021) 103182. URL: https://www.scienced mortality using a novel data-driven hybrid model irect.com/science/article/pii/S016073832100044X. based on ensemble empirical mode decomposition, doi:10.1016/j.annals.2021.103182. Scientific Reports 11 (2021) 21413. URL: https://doi. [32] S. Lai, others, Effect of non-pharmaceutical inter- org/10.1038/s41598-021-00948-6. doi:10.1038/s4 ventions to contain COVID-19 in China, Nature 1598-021-00948-6. 585 (2020) 410. URL: https://www.nature.com/artic [41] V. K. Sharma, U. Nigam, Modeling and Forecasting les/s41586-020-2293-x. of COVID-19 Growth Curve in India, Transactions [33] D. Fanelli, F. Piazza, Analysis and forecast of of the Indian National Academy of Engineering 5 COVID-19 spreading in China, Italy and France, (2020) 697–710. URL: https://doi.org/10.1007/s414 Chaos, Solitons & Fractals 134 (2020) 109761. URL: 03-020-00165-z. doi:10.1007/s41403-020-001 https://www.sciencedirect.com/science/article/pi 65-z. i/S0960077920301636. doi:10.1016/j.chaos.20 [42] A. K. Sahai, N. Rath, V. Sood, M. P. Singh, ARIMA 20.109761. modelling & forecasting of COVID-19 in top five [34] Bertozzi Andrea L., Franco Elisa, Mohler George, affected countries, Diabetes & Metabolic Syndrome: Short Martin B., Sledge Daniel, The challenges of Clinical Research & Reviews 14 (2020) 1419–1427. modeling and forecasting the spread of COVID-19, URL: https://www.sciencedirect.com/science/arti Proceedings of the National Academy of Sciences cle/pii/S1871402120302903. doi:10.1016/j.dsx. 117 (2020) 16732–16738. URL: https://doi.org/10 2020.07.042. .1073/pnas.2006520117. doi:10.1073/pnas.200 [43] D. Benvenuto, M. Giovanetti, L. Vassallo, S. An- 6520117, publisher: Proceedings of the National geletti, M. Ciccozzi, Application of the ARIMA Academy of Sciences. model on the COVID-2019 epidemic dataset, Data [35] L. Schüler, J. M. Calabrese, S. Attinger, Data driven in Brief 29 (2020) 105340. URL: https://www.scienc high resolution modeling and spatial analyses of edirect.com/science/article/pii/S235234092030234 the COVID-19 pandemic in Germany, PLOS ONE 1. doi:10.1016/j.dib.2020.105340. 16 (2021) e0254660. URL: https://doi.org/10.1371/ [44] L. Lins, J. T. Klosowski, C. Scheidegger, Nanocubes journal.pone.0254660. doi:10.1371/journal.po for Real-Time Exploration of Spatiotemporal ne.0254660, publisher: Public Library of Science. Datasets, IEEE Transactions on Visualization and [36] I. Rahimi, F. Chen, A. H. Gandomi, A review on Computer Graphics 19 (2013) 2456–2465. doi:10.1 COVID-19 forecasting models, Neural Computing 109/TVCG.2013.179. and Applications (2021). URL: https://doi.org/10.1 [45] A. Bosworth, J. Gray, A. Layman, H. Pirahesh, Data 007/s00521-020-05626-8. doi:10.1007/s00521-0 Cube: A Relational Aggregation Operator General- 20-05626-8. izing Group-By, Cross-Tab, and Sub-Totals, Tech- [37] R. Salgotra, M. Gandomi, A. H. Gandomi, Time nical Report MSR-TR-95-22, Institute of Electrical Series Analysis and Forecast of the COVID-19 Pan- and Electronics Engineers, Inc., 1995. URL: https: demic in India using Genetic Programming, Chaos, //www.microsoft.com/en-us/research/publication Solitons & Fractals 138 (2020) 109945. URL: https: /data-cube-a-relational-aggregation-operator-g //www.sciencedirect.com/science/article/pii/S096 eneralizing-group-by-cross-tab-and-sub-totals/. 0077920303441. doi:10.1016/j.chaos.2020.1 [46] M. Golfarelli, D. Mario, S. Rizzi, The dimensional 73 fact model: a conceptual model for data warehouses, American Statistical Association 77 (1982) 63–70. International Journal of Cooperative Information URL: https://www.tandfonline.com/doi/abs/10.108 Systems 7 (1998) 215–247. doi:https://doi.or 0/01621459.1982.10477767. doi:10.1080/016214 g/10.1142/S0218843098000118. 59.1982.10477767, publisher: Taylor & Francis. [47] E. O. Nsoesie, O. Oladeji, A. S. A. Abah, M. L. Ndeffo- [57] P. R. Winters, Forecasting Sales by Exponentially Mbah, Forecasting influenza-like illness trends in Weighted Moving Averages, Management Science 6 Cameroon using Google Search Data, Scientific (1960) 324–342. URL: https://doi.org/10.1287/mnsc Reports 11 (2021) 6713. URL: https://doi.org/10.1 .6.3.324. doi:10.1287/mnsc.6.3.324, publisher: 038/s41598-021-85987-9. doi:10.1038/s41598-0 INFORMS. 21-85987-9. [58] W. Abdussalam, Post-processing data of daily dead [48] Y. Chen, Y. Zhang, Z. Xu, X. Wang, J. Lu, W. Hu, and infected COVID-19 in Germany (2022). URL: Avian Influenza A (H7N9) and related Internet https://zenodo.org/badge/latestdoi/462876343. search query data in China, Scientific Reports 9 doi:DOI:10.5281/zenodo.6336637. (2019) 10434. URL: https://doi.org/10.1038/s41598 [59] A. Mertel, M. Laqua, Where2Test visualization high- -019-46898-y. doi:10.1038/s41598-019-46898 lights strong link between pace of vaccinations and -y. incidences, 2022. URL: https://www.where2test.de/ [49] Z. He, H. Tao, Epidemiology and ARIMA model blog#vaccination-maps. of positive-rate of influenza viruses among chil- [60] M. Eckardt, K. Kappner, N. Wolf, Covid-19 across eu- dren in Wuhan, China: A nine-year retrospective ropean regions: The role of border controls (2020). study, International Journal of Infectious Diseases [61] M. Grimée, M. B.-N. Dunbar, F. Hofmann, L. Held, 74 (2018) 61–70. URL: https://www.sciencedirec et al., Modelling the effect of a border closure be- t.com/science/article/pii/S1201971218344618. tween switzerland and italy on the spatiotemporal doi:10.1016/j.ijid.2018.07.003. spread of covid-19 in switzerland, Spatial statistics [50] Q. Zeng, D. Li, G. Huang, J. Xia, X. Wang, Y. Zhang, (2021) 100552. W. Tang, H. Zhou, Time series analysis of temporal [62] T. McMahon, A. Chan, S. Havlin, L. K. Gallos, Spa- trends in the pertussis incidence in Mainland China tial correlations in geographical spreading of covid- from 2005 to 2016, Scientific Reports 6 (2016) 32367. 19 in the united states, Scientific Reports 12 (2022) URL: https://doi.org/10.1038/srep32367. doi:10.1 1–10. 038/srep32367. [63] A. Mertel, J. Vyskočil, L. Schüler, W. Schlechte- [51] C. C. Holt, Forecasting seasonals and trends by Wełnicz, J. M. Calabrese, Fine-scale variation in the exponentially weighted moving averages, Interna- effect of national border on covid-19 spread: A case tional Journal of Forecasting 20 (2004) 5–10. URL: study of the saxon-czech border region, medRxiv https://www.sciencedirect.com/science/article/pi (2022). i/S0169207003001134. doi:10.1016/j.ijforeca [64] J. E. Lemieux, K. J. Siddle, B. M. Shaw, C. Loreth, S. F. st.2003.09.015. Schaffner, A. Gladden-Young, G. Adams, T. Fink, [52] R. J. Hyndman, G. Athanasopoulos, Forecasting: C. H. Tomkins-Tinch, L. A. Krasilnikova, K. C. Principles and Practice., OTexts, 2018. URL: https: DeRuff, M. Rudy, M. R. Bauer, K. A. Lagerborg, //otexts.com/fpp2/. E. Normandin, S. B. Chapman, S. K. Reilly, M. N. [53] E. S. Gardner, E. McKenzie, Why the damped trend Anahtar, A. E. Lin, A. Carter, C. Myhrvold, M. E. works, Journal of the Operational Research Society Kemball, S. Chaluvadi, C. Cusick, K. Flowers, 62 (2011) 1177–1180. URL: https://doi.org/10.1 A. Neumann, F. Cerrato, M. Farhat, D. Slater, J. B. 057/jors.2010.37. doi:10.1057/jors.2010.37, Harris, J. Branda, D. Hooper, J. M. Gaeta, T. P. publisher: Taylor & Francis. Baggett, J. O’Connell, A. Gnirke, T. D. Lieberman, [54] E. S. Gardner, E. Mckenzie, Forecasting Trends in A. Philippakis, M. Burns, C. M. Brown, J. Luban, E. T. Time Series, Management Science 31 (1985) 1237– Ryan, S. E. Turbett, R. C. LaRocque, W. P. Hanage, 1246. URL: https://doi.org/10.1287/mnsc.31.10.1 G. R. Gallagher, L. C. Madoff, S. Smole, V. M. Pierce, 237. doi:10.1287/mnsc.31.10.1237, publisher: E. Rosenberg, P. C. Sabeti, D. J. Park, B. L. Maclnnis, INFORMS. Phylogenetic analysis of SARS-CoV-2 in the Boston [55] V. M. Guerrero, Time-series analysis supported area highlights the role of recurrent importation by power transformations, Journal of Forecasting and superspreading events, preprint, Epidemiology, 12 (1993) 37–48. URL: https://doi.org/10.1002/fo 2020. doi:10.1101/2020.08.23.20178236. r.3980120104. doi:10.1002/for.3980120104, [65] G. B. Libotte, L. dos Anjos, R. C. Almeida, S. M. C. publisher: John Wiley & Sons, Ltd. Malta, R. S. Silva, Framework for enhancing the [56] S. C. Hillmer, G. C. Tiao, An ARIMA-Model-Based estimation of model parameters for data with a high Approach to Seasonal Adjustment, Journal of the level of uncertainty, preprint, Epidemiology, 2020. 74 URL: http://medrxiv.org/lookup/doi/10.1101/2020. 12.17.20248389. doi:10.1101/2020.12.17.202 48389. [66] T. Günther, M. Czech-Sioli, D. Indenbirken, A. Ro- bitaille, P. Tenhaken, M. Exner, M. Ottinger, N. Fis- cher, A. Grundhoff, M. M. Brinkmann, SARS-CoV-2 outbreak investigation in a German meat process- ing plant, EMBO Molecular Medicine 12 (2020) e13296. URL: https://doi.org/10.15252/emmm.20 2013296. doi:10.15252/emmm.202013296, pub- lisher: John Wiley & Sons, Ltd. 75