Statistical data governance based on the SDMX Haïrou-Dine BIAO K.1,*,† , Emery ASSOGBA1,† 1 Department of Computer Engineering and Telecommunications, EPAC / University of Abomey-Calavi, Abomey-Calavi, Benin Abstract Statistics are essential for development of a nation. With arise of technologies such as AI and big data efficient data governance become more and more important to overcome challenges and opportunities evolved by them. Unfortunately, most of databases in our public and private companies and organizations lack interoperability. This work proposes a statistical data governance mechanism based on the Statistical Data and Metadata eXchange (SDMX) standard, designed specifically for statistical data sharing and exchange between organizations.We designed and implemented a statistical database based on SDMX. This system allows more than 10 benin public organizations to be able to produce, publish and share statistical data from various theme. They can express the indicators and levels of disaggregation of these indicators in a flexible way, without having to create a new database. keywords statistical data, database, interoperability, SDMX 1. Introduction 2.2. Interoperability Today, digitization and increasing information exchange, IEEE defines interoperability as “the ability of two or more statistics play an essential role in the development of na- systems or components to exchange information and to use tions [1]. To be able to get insight from data, data need the information that has been exchanged” [3]. A specific to be collected, validated, published and treated. That is challenge to interoperability arises from the fact that there is made possible by building database and application over generally no single way of representing information. Thus, these databases to access these data. Unfortunately, the the same information content is often represented in dif- multiplicity of these databases does not allow for efficient ferent (usually incompatible) ways across different systems data governance. Because it is more difficult to exchange and organizations [4]. Data interoperability therefore re- and maintain data between different systems. This work quires not only the use of standards and metadata, but also consists in setting up a statistical data governance frame- the provision of standardized datasets in formats that can work based on the Statistical Data and Metadata eXchange be accessed by both humans and machines. (SDMX) standard, thus enabling other platforms implement- International standards exist for this purpose : ing this standard to easily consume the data produced by • Open data this database, guaranteeing a high degree of interoperabil- • Statistical Data and Metadata eXchange (SDMX) ity and reducing the number of databases needed to collect statistical data. 2.2.1. Open data Open data refers to data that is freely available to everyone 2. Background and state of the art to use, modify and share without restrictions. For optimal The rapid advent of information technology has led to a interoperability, data and metadata files must be published massive explosion of data, creating unprecedented opportu- in such a way as to be editable by humans and usable by nities, but also posing complex governance challenges. machines, while remaining independent of language, tech- nology and infrastructure. A first step is to make data avail- able via mass downloads in open data formats. There are 2.1. Data governance various fully documented and widely accepted schemas for Data governance is defined as “an overall framework within constructing digital data files, such as CSV, JSON, XML, the company for assigning rights and duties to decisions in and GeoJSON, among others [4]. In the context of open order to manage data appropriately by as a corporate asset” data, several catalogs list portals publishing public data [5]. [2]. It is therefore a set of principles designed to manage the Initiatives include : entire data lifecycle, from acquisition to disposal, including • Transnational initiatives such as : use. Good data governance facilitates exchange and compati- – World Bank [6],which is one of the promoters bility between different systems and organizations. It thus of data sources, promotes greater interoperability. – the databases of the Food and Agriculture Organization (FAO) [7], which cover a wide range of topics related to food security and agriculture. These include : ∗ FAOSTAT, which provides free access International Conference of Information and Communication Technologies to statistics on food and agriculture (in- of ANSALB (CITA): Security issues in the age of AI, June 27-28, 2024, cluding crop and livestock sub-sectors, Cotonou, BENIN etc.); * Corresponding author. † ∗ AQUASTAT, which gives users access to These authors contributed equally. $ dineb90@gmail.com (H. B. K.); emery.assogba@uac.bj the main database of country statistics, (E. ASSOGBA) focusing on water resources, water use © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). and agricultural water management. CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings • Continental initiatives such as : • API-8 : Security Misconfiguration – openAFRICA, an Africa volunteer-driven • API-9 : Improper Inventory Management open data platform that aims to be the largest • API-10 : Unsafe Consumption of APIs independent repository of open data on the To prevent such situations, developers need to focus on African continent [8]. writing secure code and ensuring that APIs are configured • National initiatives such as : securely. To guarantee API security, it is essential to con- – Benin open data portal [9]; sider three fundamental pillars of security : confidentiality, integrity, and availability [12]. 2.2.2. Statistical Data and Metadata eXchange (SDMX) 3. Material and methodology SDMX is an international initiative aimed at standardizing and modernizing statistical data and metadata exchange The aim of this initiative is to ensure interoperability be- mechanisms. This standard encompasses a data model tween databases and facilitate the distribution, availability, (the multidimensional data cube), standard vocabularies use and reuse of information. The same applies to data (content-oriented guidelines), a formal schema definition, security. and various data serialization formats for building data files and electronic messages for data exchange. In the SDMX 3.1. Material ecosystem, data providers can choose between different data A set of technological tools including a development envi- serialization formats for sharing datasets, including XML, ronment, programming languages and software tools was CSV, JSON, or even EDIFACT [4]. used. These standards are implemented through new technolo- gies such as application programming interfaces (APIs). APIs facilitate interaction between two different applica- 3.2. Methodology tions so that they can communicate with each other. They There are several stages to the process. act as intermediaries. APIs use the Hypertext Transfer Proto- col (HTTP) for cooperation between different programs and web services (REST or SOAP) [10]. They are reusable pieces 3.2.1. Description of the design of conventional of software that enable several applications to interact with statistical database systems an information system. They offer machine-to-machine ac- Setting up a statistical database system involves a series of cess to data services and provide a standardized means of steps. managing security and errors. The diagram below summarizes the process : APIs are therefore catalysts for interoperability. How- ever, as their use increases, data security becomes a major concern. 2.3. Data security Protecting sensitive data is an important part of data gov- ernance. It involves implementing measures and protocols to prevent unauthorized access, leakage or manipulation of confidential information. However, despite efforts to ensure data security, informa- tion leaks are still a reality. Due to the rise in API-related vulnerabilities, the Open Web Application Security Project (OWASP), a foundation dedicated to improving software security, has been issuing its list of the top 10 web security vulnerabilities every 2-3 years since 2003. The OWASP foun- dation’s separate classification of the top 10 vulnerabilities for web applications and APIs highlights the divergence between modern APIs and traditional web applications, re- quiring a tailored security approach. The OWASP foundation provides a list of the top 10 OWASP API vulnerabilities in 2023 [11] : • API-1 : Broken Object Level Authorization (BOLA) • API-2 : Broken Authentication • API-3 : Broken Object Property Level Authorization • API-4 : Unrestricted Resource Consumption • API-5 : Broken Function Level Authorization Figure 1: Illustration of the process of setting up a statistical (BFLA) database system. • API-6 : Unrestricted Access to Sensitive Business Flows These are mainly : • API-7 : Server Side Request Forgery • Identifying and validating indicators : indicators are quantitative or qualitative measures used to assess the performance or state of a spe- cific domain. They can include statistics such as the unemployment rate, the economic growth rate, the number of new businesses created [13], and so on. This stage involves determining the statistical indicators to be monitored. Indicators are chosen to measure phenomena or evaluate the performance of an action. • Identifying and validating the producers of indicator values : this involves identifying the data sources and the producers responsible for collecting the indicator values. Data producers include the United Nations (UN), the World Bank or the World Health Organization (WHO), etc. Figure 2: Example of observation representation using the data • Identification and validation of cube model. disaggregation levels for each indicator: this involves determining at what level of detail data will be collected and reported (e.g., by region, whatever their level of complexity. It uses a multidimen- by gender, by age group, etc). sional approach based on the data cube model. The data • Database design : this involves creating a struc- cube model, also known as the Online Analytical Processing ture for storing indicators and their values in an (OLAP) model, is a data modeling method designed to make organized and efficient way. data analysis and visualization more accessible by present- • Implementation of the web application ing it in multidimensional form. Data is organized in cubes, for data collection : this involves developing with each dimension representing a distinct aspect of the a web application enabling data producers to submit data. their data systematically and securely. Tools can be A multidimensional data cube can be thought of as a developed to facilitate this stage. The World Bank model focused on the simultaneous measurement, identi- provides the Survey Solution, a tool for the produc- fication and description of multiple instances of an entity ing data collection forms. However, Survey Solution type. A multidimensional dataset consists of several mea- requires the installation of a server or the use of the surement records (observed values) organized along a group World Bank’s demo server and is limited to mobile of dimensions (e.g., “period,” “location,” “gender,” and “age terminals [14]. group”) [4]. Using this type of representation, it is possible • Data production and validation : this to identify each individual data point according to its “posi- involves validating and integrating data into the tion” on a coordinate system defined by a common set of database. To ensure data reliability, several levels dimensions. In addition to measurements and dimensions, of validation are often implemented. In our case, the data cube model can also incorporate metadata at the three levels of data validation were necessary before level of the individual data point in the form of attributes. publication. Attributes provide the information needed to correctly inter- pret individual observations (e.g., an attribute may specify For example, modeling an observation on the unemploy- “percentage” as the unit of measurement). ment rate for women in rural areas for the year 2018 in a The multidimensional data cube model can support data database can be done through the following parameters : interoperability across many different systems, indepen- • Indicator : unemployment rate for women in ru- dent of their technology platform and internal architecture. ral areas What’s more, the content of a multidimensional data cube • Disaggregation levels : model need not be limited to small datasets. In fact, “with the advent of cloud and big data technologies, data cube – Commune : BAN (Banikoara) infrastructures have become effective instruments for man- – Department : ALI (Alibori) aging earth observation resources and services” [15]. • Observed value : 2.6% (percentage) Taking again, the example on the unemployment rate • Period : 2018 for women in rural areas for the year 2018 its data cube • Producer : the organization or entity responsible representation can be identified by the following dimensions for collecting and publishing these statistical data. (figure 2): However, when it comes to integrating data with any type • Period = 2018 of indicator and several levels of variable disaggregation, the • Sex = female task quickly becomes arduous because it sometimes requires • Geographical distribution = rural area rebuilding the database. Hence, the need for a standard that • Observed value = 2.6 takes these complexities into account. In this way, SDMX enables data to be clearly represented by associating measures with dimensions and attributes. 3.2.2. Modeling with SDMX SDMX data structures, known as Data Structure Defini- SDMX is an essential standard for simplifying statistical tions (DSD), describe how data is organized, identifying data modeling and supporting different types of database, key dimensions, measures and associated attributes. It also Table 1 Use of SDMX for improved data representation of an existing platform Indicator DEP COM SEX TRANCHE_D_AGE TYPE_HANDICAP Period Value 1 ALI BAN TOTAL_F _T _T 2024 1 provides standardized terminology for naming commonly • put an end to the disparity and scattering of used dimensions and attributes, as well as code lists for monitoring-evaluation data ; populating some of these dimensions and attributes. More • an integrated database for storing indicators ; specifically, a DSD in SDMX describes the structure of a • effectively operationalize the statistics development dataset by assigning descriptor concepts to statistical data strategy ; elements, which include : • establish a coherent and scalable governance system for statistical data ; • dimensions that form the unique identifier (key) of • standardized data ; individual observations ; • SDMX interoperable APIs, which focus on retriev- • measurement(s) conventionally associated with the ing metadata and data in XML-JSON-CSV formats. concept of “observation value” (OBS_VALUE); and They can used as intermediaries between SDMX- • attributes that provide more information about a standardized systems or platforms. These platforms part of the dataset. include : the .Stat Suite ; In addition, SDMX offers a set of globally agreed DSDs • an environment for producing and disseminating for different application domains, ensuring consistency and statistical data. interoperability between statistical organizations. [4]. This standardized data is available on a dedicated platform. This eliminates the need for a multitude of statistical databases. A single database is enough to federate all an organization’s data. The various players can then define 5. Conclusion the indicators and levels of disaggregation which will be en- coded in the database, ready to receive any type of statistical Data governance, data interoperability and data security are data. interdependent and essential elements in maximizing the value and minimizing the risks associated with the growing 3.2.3. Modeling case with SDMX use of data in our society. Implementing the SDMX standard has enabled us to standardize data and obtain interopera- Taking the example of the indicator “Total number of sup- ble SDMX APIs enabling statistical data to be exchanged port requests for multiple births met” [16], the lack of an between different systems or platforms. The result is an in- effective modeling framework forced the designer to define formation system based on this standard. There is no longer age ranges as variables, making it difficult to render the data. any need for a multitude of statistical databases ; a single Using SDMX, the following dimension/attribute levels can one is sufficient to federate all an organization’s data. The be defined : various players can then define the indicators and levels of disaggregation which will be encoded in the database, ready • SEX with the following code list : F, H, TOTALF, to receive any type of statistical data. TOTALH • TRANCHE_D_AGE with the following code list : 0-17- ANS, 18-34-ANS, 35-59-ANS, 60-ANS-PLUS References • TYPE_HANDICAP with the following code list : HMI, HMS, HA, HV, HM, AFH [1] N. Curien, P.-A. Muet, E. Cohen, M. Didier, G. Bor- des, La société de l’information, La Documentation • DEP with the following code list : ALI, ATA, etc française, 2004. • COM with the following code list : BAN, NATI, etc [2] B. Otto, Organizing data governance: Findings from the telecommunications industry and consequences The data representation presented in [16] would amount for large service providers, Communications of the to this simplified representation (Table 1 ). Association for Information Systems 29 (2011) 3. This has the advantage of a more simplified representa- [3] A. Cooper, Learning analytics interoperability-the tion and saves storage space eliminating redundancy. The big picture in brief, Learning Analytics Community ability to create data structures to define data representation Exchange (2014). offers great flexibility to data producers, who could define [4] L. G. González Morales, T. Orrell, Data interoperabil- context-dependent data structures for the same indicator. ity: A practitioner’s guide to joining up data in the development sector. (2018). 4. Results [5] M. TRAORE, Les banques de données environnemen- tales (????). SDMX provides the tools and standards needed to structure [6] W. Bank, World bank open data, 2024. URL: https: open data in a way that maximizes its usefulness and impact. //data.worldbank.org/. Its use has enabled us to meet a number of challenges : [7] Food, A. O. of the United Nations, Statistiques, 2024. URL: https://www.fao.org/statistics/fr. • eliminate the multiplicity of databases used to collect [8] openAFRICA, Africa’s largest volunteer driven open and process statistical indicators ; data platform, 2024. URL: https://open.africa/. [9] Bénin, Un coup d’oeil sur les données du benin, 2024. URL: https://benin.opendataforafrica.org/. [10] A. Soni, V. Ranga, Api features individualizing of web services: Rest and soap, International Journal of Innovative Technology and Exploring Engineering 8 (2019) 664–671. [11] D. Timsina, L. Decker, Securing the next generation of digital infrastructure: The importance of protecting modern apis (2023). [12] H. Asemi, A study on api security pentesting (2023). [13] G. Zakhidov, Economic indicators: tools for analyz- ing market trends and predicting future performance, International Multidisciplinary Journal of Universal Scientific Prospectives 2 (2024) 23–29. [14] L. J. Young, G. Carletto, G. Márquez, D. A. Rozkrut, S. Stefanou, The production of official agricultural statistics in 2040: What does the future hold?, Statisti- cal Journal of the IAOS 40 (2024) 203–210. [15] S. Nativi, P. Mazzetti, M. Craglia, A view-based model of data-cube to support big earth data systems inter- operability, Big Earth Data 1 (2017) 75–99. [16] SiDoFFe-NG, Statistiques detaillees du domaine protection sociale et solidarite nationale, 2024. URL: https://2019a2024.sidoffe-ng.social.gouv.bj/ sidoffepublic/stats/details/pssn.