=Paper=
{{Paper
|id=Vol-3135/dataplat_paper1
|storemode=property
|title=Unidata - A Modern Master Data Management Platform
|pdfUrl=https://ceur-ws.org/Vol-3135/dataplat_paper1.pdf
|volume=Vol-3135
|authors=Sergey Kuznetsov,Alexey Tsyryulnikov,Vlad Kamensky,Ruslan Trachuk,Mikhail Mikhailov,Sergey Murskiy,Dmitrij Koznov,George Chernishev
|dblpUrl=https://dblp.org/rec/conf/edbt/KuznetsovTKTMMK22
}}
==Unidata - A Modern Master Data Management Platform==
Unidata — A Modern Master Data Management Platform Sergey Kuznetsov1 , Alexey Tsyryulnikov1 , Vlad Kamensky1 , Ruslan Trachuk1 , Mikhail Mikhailov1 , Sergey Murskiy1 , Dmitrij Koznov2 and George Chernishev1,2 1 Unidata 2 Saint-Petersburg State University Abstract Organizations need to ensure the quality of data that is used for analytics and to maintain its consistency across multiple analytical and operational systems. Master data is a term that refers to domain-specific data concerning business objects, crucial for organization operation, e.g., contracts, suppliers, employees and so on. Usually, such source data is scattered around different applications across the organization and is of varying quality. Master Data Management (MDM) is a set of practices, information management methods, and data management tools intended for producing accurate, consistent, and complete master data. At the same time, data management tools play a vital role in setting up and supporting MDM-related processes. In this paper, we describe the Unidata platform: a toolkit for constructing MDM solutions. Its modular architecture allows to construct solutions tailored for a specific domain and case requirements. We start with a short introduction to MDM, discussing its aims, user-facing benefits, and approaches to working with data. Then, we describe the architecture of the Unidata platform and present the data storage approach and query processing algorithms. Finally, we discuss use-cases and put forward our thoughts regarding the future of MDM systems. Keywords Master Data Management, Data Platform, Data Cleaning, Information Integration, Golden Record, Registry 1. Introduction tracking data lineage and so on. Master Data Management (MDM) solutions aim to ad- Contemporary companies and public institutions manage dress these technical issues by consolidating available huge volumes of data, which is now a strategic asset [1]. information while interacting with existing systems of Data became an enabler of organization business models organization in a minimally intrusive way. As a research and value propositions [2]. At the same time, each or- discipline, MDM is an actively developing area that con- ganization usually possesses a complex data ecosystem, cerns all aspects of enterprise data management practices. which makes it hard to make use of information con- Its practical counterpart is closely monitored by Gartner, tained in it. For example, a recent study [3] concerning which releases yearly reviews describing established and local governmental organizations in Denmark showed promising products. The recent publication of the 2nd that one of the significant obstacles to using this data edition of the fundamental reference [4] describing all is the lack of its overview. Thus, if providing even an aspects of MDM is also worth noting. overview is hard, then efficient use will require much Large organizations that are principal customers of more effort which will include ensuring data quality (du- MDM vendors have many individual characteristics, and, plicate detection and data conflict resolution), setting up consequently, individual pressing tasks. Therefore, by ETL pipelines, providing proper metadata management, implementing MDM they aim to fulfill these tasks, thus adhering to the organization pull strategy [5]. The other Published in the Workshop Proceedings of the EDBT/ICDT 2022 Joint strategy — technology push — is essentially large-scale Conference (March 29-April 1, 2022), Edinburgh, UK adoption of a new technology based on the belief in its $ sergey.kuznetsov@unidata-platform.ru (S. Kuznetsov); usefulness instead of focusing on particular tasks at hand. alexey.tsyryulnikov@unidata-platform.ru (A. Tsyryulnikov); vlad.kamensky@unidata-platform.ru (V. Kamensky); The organization pull approach significantly reduces the ruslan.trachuk@unidata-platform.ru (R. Trachuk); cost of MDM adoption by narrowing its application area. mikhail.mikhailov@unidata-platform.ru (M. Mikhailov); Moreover, its iterative nature enables progressive build- sergey.murskiy@unidata-platform.ru (S. Murskiy); up of MDM functionality via step-by-step incorporation d.koznov@spbu.ru (D. Koznov); g.chernyshev@spbu.ru of different organization business areas. However, imple- (G. Chernishev) https://www.math.spbu.ru/user/chernishev/ (G. Chernishev) menting MDM in the organization pull manner requires 0000-0001-7509-7888 (S. Kuznetsov); 0000-0001-6752-6742 flexibility of the MDM toolkit used to build the solution, (A. Tsyryulnikov); 0000-0002-9907-8945 (R. Trachuk); and major MDM toolkits fail to provide it since initially 0000-0001-6202-9034 (S. Murskiy); 0000-0003-2632-3193 they were built as monolithic systems. (D. Koznov); 0000-0002-4265-9642 (G. Chernishev) © 2022 Copyright for this paper by its authors. Use permitted under Creative Usually, the structure of large organizations is the re- CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) sult of lengthy historical processes, which may include Master Data Management Solutions 2021” [8]. The architecture of the Unidata platform aims to ad- dress the above-mentioned concerns. Its core ideas are as follows. 1. Tune-ability. An MDM platform should be adaptable to various software ecosystems, take into account specific priorities and issues of the concrete organization, and comply with external Figure 1: Approaches to handling data using an MDM solu- constraints. This will ensure that target MDM tion solutions meet the organization needs. 2. Component-based architecture. In order to provide the required flexibility, the MDM plat- several acquisitions and mergers of other companies or form architecture has to follow the component- organizations. Another important feature of such orga- based principle. The platform should consist of nizations is a unique software ecosystem comprised of modules — i.e., building blocks which can be core technologies from various vendors (e.g. SAP, IBM, freely combined together in order to create ef- Microsoft). Next, specific priorities and issues of the or- ficient and specialized MDM solutions. ganization should be taken into account as well. Finally, 3. Metamodeling. A metamodel describing models there are various external constraints which an organi- of individual areas can be used to handle combi- zation has to follow and which define the structure of nations of different domains. This will provide the business processes. For example, organizations in the product with the required extensionability. different countries may have different rules, e.g. either a declaratory or an authorization system. Building an MDM platform upon these principles Next, the need to fulfill specific tasks of the organi- makes it possible to achieve synergy between individual zation leads to a specific subset of the MDM functional- MDM instruments such as data quality, data provenance, ity of the solution being “clipped out” from an abstract metadata management, and so on. That is, building a general-purpose MDM toolkit. It may be due to tuning to solution from tools that are intended to work together organization software ecosystem or implementing spe- from the start (as opposed to integrating a set of different cific functionality. Such functionality usually concerns tools) will result in shorter development times and solu- areas that border on MDM or that may be completely tions of better quality. Furthermore, it will be possible external to MDM. Obviously, organizations do not want to deliver different versions of solutions with different to have functionality and code that they do not need for functionality (e.g. standard vs enterprise edition) and the task at hand. Finally, there is a trend of implementing prices. The overall idea of this approach is presented in various enterprise applications in the cloud in order to Figure 2. The core set of modules has been released as lower expenses. an open-source version1 of the platform. It is published Multi-domain MDM, i.e. MDM covering several key under the GNU GPLv3 license. business entity areas (products, vendors, employees and Overall, the contributions of this paper are: so on), is the state-of-the-art [6] approach. Aside from the 1. A description of the Unidata platform architec- aforementioned generic domains there may be a num- ture. ber of business-specific ones, such as patients in case 2. A discussion of data storage and processing algo- of a hospital or an inventory of land plots managed by rithms. forestry. Furthermore, each customer may have its own 3. A list of use-cases and our vision of the future of set of domains, with its specifics. In order to support com- MDM. binations of different domains, an MDM platform should be able to describe them inside itself and be sufficiently The structure of this paper is as follows. We start with independent from any particular one. the description of the background in Section 2. In it, we In this paper we describe the Unidata platform — a provide basic definitions and describe user roles. Next, core technology of an industrial product line [7] that in Section 3 we discuss several use-cases in order to illus- was used to produce dozens of target MDM solutions trate benefits that can be obtained by using such systems. for large organizations. These solutions are focused on All presented use cases are real and describe actual MDM the specific needs of organizations such as specialized deployments that were driven by the organization pull business segments, heterogeneous IT infrastructure, and strategy. Section 4 contains the description of the plat- particular business tasks. The platform has received an form architecture and its modules. The way data is stored honourable mention by Gartner in “Magic Quadrant for 1 https://gitlab.com/unidata-community-group/unidata- platform-deploy cannot be copied or “moved” for various (includ- ing legal) reasons. 2. Consolidation. Data is uploaded into the com- mon repository on a regular basis, appropriately processed, and then the hub itself provides data consuming systems with access to this data. How- ever, new data is supplied by live data source sys- tems, i.e., new data is uploaded to the hub on a regular basis. 3. Centralization. This architecture is very similar to the previous one, but here the hub takes over data upload as well: i.e., data is uploaded once, and then all changes are performed on the hub itself, thus turning all systems that initially were Figure 2: Modular structure of the Unidata platform data sources into data consumers. 4. Coexistence. This architecture implements a com- bination of the Consolidation and Centralization and how queries are processed is discussed in Section 5. for different master data of an organization. Addi- Next, present our view on the future of MDM systems tionally, if some data fragments are not “movable”, and describe related work in Sections 6 and 7. Finally, we they can be handled using the Registry. conclude this paper with Section 8. All of them are supported by the Unidata platform. 2. Background 2.2. Basic definitions 2.1. Master Data Management Golden (master) record. According to [6], one of the core goals of MDM systems is to create and maintain a According to David Loshin [9], Master Data “are those single version of the truth for an entity. The information core business objects used in the different applications which constitutes it is stored in multiple sources and across the organization, along with their associated meta- thus, it should be assembled. All information concerning data, attributes, definitions, roles, connections, and tax- a particular entity is called the golden record. onomies”. Examples are product data, customer data, Data models and Metamodel. In order to support supplier data, location data, party data, reference data, a particular domain, the objects which the platform will asset data, employee data, ledger data, and vendor data. work with must be defined. A Metamodel consists of In turn, Master Data Management is “a set of practices, the description of the data itself (data schema) and re- information management methods, and data manage- lated procedures. For example, it is possible to add a ment tools to implement the policies, procedures, ser- supplementary metamodel of data quality for a particu- vices, and infrastructure to support the capture, inte- lar domain, e.g. specific duplicate detection procedures. gration, and subsequent shared use of accurate, timely, In other words, a metamodel specifies how individual consistent, and complete master data”. data models will be created and processed. The purpose of an MDM platform is to obtain data A registry is a collection of records that are related to from source information system(s), then process it to en- some entity, such as a person, an organization, etc. This sure data quality by, for example, performing data dedu- information comes from many different source informa- plication, filling in missing values, removing outdated tion systems. A registry has a schema which consists of information, etc. Eventually, it must obtain a golden a list of its attributes and nested entities. Similarly to ta- record for each item — an error-free version conforming bles, registries can have references to other registries, e.g. to the defined quality criteria. suppliers and items: each supplier can have references How exactly data is processed is defined by the MDM to items which it sells. implementation architecture/style [6, 10] (sometimes A lookup table is a referential table that contains data called the data hub architecture). Gartner identifies four which is rarely changed, but at the same time frequently approaches, presented in Figure 1). used. For example, lookup tables may be created to list 1. Registry. The hub does not contain the data itself, countries, timezones, currencies, and so on. Similarly to instead storing only the corresponding references registries, lookup tables consist of attributes. (indexes). This approach is relevant for data that System of a Record (SOR). During the golden record assembly process, the same attribute may turn out to be Figure 3: Representation of the Saint-Petersburg name validity periods stored in several sources, having different values. In this a data steward is the person who is accountable for the case, either a system of a record or a conflict resolution quality of the data. However, in practice, a data steward process should be set up. SOR is a primary data source is not necessarily a single person, but rather, a group of which contains the “true” value. several people. Each of them may be responsible for a dif- Validity Period — is an interval in which the data ferent key business area or even a part of it. The Unidata describing an entity is valid. Each golden record may platform supports two types of users that implement data have several validity periods which should be taken into stewardship: account while querying the data. Moreover, there are two temporal dimensions: time of an event and time • Administrator manages the platform in general. of addition of this new version of information into the Their duties include data model administration, system. This leads to a need for a special scheme for classifier2 administration, managing rules of du- managing this information. plicate detection and so on. Consider the example presented in Figure 3 which de- • Data operator manages records (e.g. database scribes the history of Saint Petersburg’s name changes. population), registries and lookup tables, and par- The Y-axis denotes update times, while the X-axis shows ticipates in related business processes. the validity periods. In this example, we assume that before 2019 the system contained only the basic version Consequently, there are two types of interfaces, for of name information: from 1703 until the present time each type of user. Finally, Unidata platform supports role- (2019), the city was called Saint Petersburg. Then, some- based access model, where each user may be assigned where between 2019 and 2020, the knowledge of the rights to perform operations (e.g. CRUD) with various Leningrad name was added into the system, and later, objects. somewhere before 2021, a similar update was done for Petrograd. A user may pose queries like “what was the name of 3. Use Cases the city in 1921”. There are two possible answers: up All of the presented use-cases are real projects that were until 2020 (point 1 ) the system would have returned completed using the Unidata platform, driven by the “Saint Petersburg”, which was the correct answer at the organization pull strategy. They clearly illustrate the time. Currently, (point 2 ) it should output “Petrograd”. concrete goals of organizations, as well as the benefits Thus, the golden record should be constructed on-the- the platform has provided to the final users. fly, taking into account updates in two temporal dimen- They also demonstrate that each particular deploy- sions which is done by looking into the origin history ment needs a different set of data governance tools, i.e. (which is discussed later). they highlight the importance of component-based ar- chitecture, which we have discussed in the Introduction. 2.3. Humans in the MDM system From a high-level perspective, an MDM system produces 3.1. Case 1: managing logistical resources a new role for a human user — a Data Steward. According The first case is a system for managing logistical re- to [9], a data steward is a person responsible for collect- sources of a large company in the energy sector. The ing, collating, and evaluating issues and problems with data and the data life cycle. Their duty includes manag- 2 In this paper, “classifier” refers mainly to a product class hier- ing standard business definitions and metadata. Finally, archy, such as the Global Product Classification [11]. covered domains included raw materials, equipment, and The aim of the system was to find those clients in the replacement parts. The purpose of this system was to: client database that have popular social media accounts and a lot of followers. The company wanted to improve • Provide high-quality data for business processes their loyalty by offering additional discounts and per- that cover equipment maintenance and repairs, forming various other actions in order to obtain more inventory management. customers from their follower bases. The core data gov- • Consolidate available information using data stan- ernance tools that were used were data enrichment and dardization and unification. consolidation. Additionally, this project succeeded at automating complex company regulations that involved more than 3.5. Case 5: smart personal account ten different divisions. Furthermore, a classifier of logis- tical resources was deployed. The goal of this project was to create a smart personal account of a city resident. It was necessary to integrate it with various federal and regional information systems. 3.2. Case 2: product catalogue for a The reason for this was to enable exporting relevant in- telecommunication company formation concerning vehicles, real estate, bank accounts The next case is a product catalogue development for and so on. The primary focus of this project was ac- a large telecommunication company. The goals of the cess control, security, and ensuring real-time as well as project were to consolidate: publish/subscribe model master data acquisition. • information regarding product offers for different customer segments, 3.6. Case 6: energy and heavy industry • information related to service availability, company • financial information from the billing system and This project was developed for a multi-sector interna- accounting records. tional company focused on energy and heavy industry As the result, a product hierarchy (a product tree) that areas. Since this company has hundreds of thousands of contains various product details (including financial ones) clients from all over the world, the procedure of adding was constructed. It is now utilized by the sales depart- a new client was very complicated. Before applying our ment and financial officers. MDM solution, it took 21 days on average. However, it has shortened to only eight days after. The solution automatized various checks, finding final beneficiaries in 3.3. Case 3: data consolidation for a corporate hierarchies, and centralized information input. transport company This project mainly concerned data inventory, data This project was dedicated to consolidation of items and quality (duplicate search), and implementing real-time services purchased by a large transport company. The access to master data in order to speed up a particular goal was to unify information contained in several dif- business process. ferent product classifiers and to produce a list of services offered by contractors. 4. Architecture The stakeholder of this MDM solution was the procure- ment department of the company. After the deployment, items and services that had different prices were identi- Now, let us turn to the architecture of the Unidata plat- fied and analyzed. This resulted in creation of monetary form. It follows the component-based principle, which metrics, i.e., calculated total savings on purchases. The means its building blocks (modules) can be freely com- system made it possible to perform automatic purchases bined with each other in order to obtain a solution with with the minimal available price. the desired set of features. This data consolidation project relied on the pub- lish/subscribe model. 4.1. Preliminaries 3.4. Case 4: data enrichment for a fashion A module is a self-sufficient set of functionality that is intended for solving a particular problem. Each mod- vendor ule contains a number of services that cover parts of This project concerned an MDM system for a company this functionality. For example, the Meta module which that sells fashion products. The system needed to per- encompasses all metadata-related activities contains ser- form client base segmentation and sales support in the vices that cover managing lookup tables, registries, units premium segment. of measurement, enumerations, etc. Figure 4: Architecture of UniData Platform Modules have rules of creation (a contract), behavior, Modules of this component are rarely modified and they can interact with other modules via being a part and are never adapted for a particular deployment of a pipeline. or domain. In a broad sense, a Pipeline is a sequence of operations 2. Storages concern everything related to data which are performed either on data or on the model. stores that are employed by the platform. These Pipelines implement dataflows, which consist of service modules make it possible to abstract functionality calls and may contain utility nodes such as branching, required by the platform from the specific DBMS parallelism (applying an operation to each record inside and to restrict and simplify it (as not all DBMS the batch), calling another pipeline, and so on. functionality is required by an MDM solution). Each module contains services that work with the data 3. MDM includes all modules that implement basic and services that interact with and modify the model. MDM functionality, such as metadata manage- Therefore, pipelines may modify not only the data, but ment, rules for computing master record alterna- the model itself too. tives, data quality management, duplicate detec- Thus, modules and pipelines implement tuneability tion, and business process implementation. This and composition aspects discussed in the Introduction component is responsible for integrating and syn- section. chronizing all necessary data stores that reside on the previous level. 4.2. Architectural Overview 4. Extra MDM implements advanced MDM func- tionality. It contains modules that are either ad- The overall architecture of the Unidata platform is pre- vanced variants of some existing module from sented in Figure 4. It consists of four main components: the MDM component, or modules that provide 1. Platform Core contains basic modules that im- functionality usually implemented “outside” of plement core services of the platform and mini- MDM. An example of the first case is the fol- mally depend on other modules. These services lowing: Match Extra is a sophisticated machine are: system boot, job batches, and data types. learning-based inexact match module and the MDM match module is a straightforward exact 1. Reference — a reference which ensures that for one. The second case is illustrated by the data each individual record there can be only one ref- delivery set. Usually, in MDM systems the ETL erenced object per each validity period. is implemented separately as a standalone appli- 2. Many-to-many — a classic many-to-many refer- cation. In our case, it is possible to have it inside ence. the system. The same idea is employed with the 3. Contains — a reference which is created by user Pub/Sub module, which make it possible to imple- pointing to an entity with all attributes filled in. ment sophisticated patterns of sending records This type is used to define entities which do not to various consumers. This is an advanced com- exist without its parent entity. ponent which is not present in the open-source version. The platform lets the user browse both direct and back- ward references (those that point to a specific record). It These components are organized in a hierarchical way, also provides rich search capabilities that allow the user which means that: to query attributes of references. • components residing in the bottom levels of the The Core module contains interfaces and abstract im- figure are more low-level, they contain basic fea- plementations for all data types that the platform uses. It tures, essential for implementing high-level func- also implements metamodel support, the need for which tionality, and, was discussed in the Introduction. Finally, this module • components interact with each other in hierarchi- supports additional services which may be used by other cal way, i.e. their interactions rarely “jump” over modules such as roles, logging, license verification and the immediate neighbor. so on. The next important entity is a draft. Drafts are an essential part of MDM since master data needs to be 4.3. Entities, modules, and their synchronized over data producers (sources) and data con- relationships sumers. Such synchronization frequently results in the In this section, we will overview core entities that the creation of temporary intermediates. These need to pass system works with, as well as important modules and various reconciliation procedures and be agreed upon in their functionality relevant to the entities. order to become fair copies. Drafts may also emerge as a System. All modules that constitute the platform result of conflicts that arise during record consolidation share a single interface that contains methods of ini- process, after various data quality procedures, etc. tialization, configuration, launch, and verification. The Drafts may concern individual items or the model. system module orchestrates system boot process and Drafts can have revisions: all of them are stored in the ensures that all modules have everything necessary for database in a serialized form. trouble-free launch and operation. Therefore, drafts need to have general support: oper- Data Types. An attribute is a basic entity representing ations such as creation, publication (transforming into some key aspect of a stored object, similar to attributes fair copy), and merge must be implemented. Draft is a in RDBMS tables. In Unidata, Registries, lookup tables module that enables all kinds of drafts in the platform. and references may have attributes. Logging capabilities. The Unidata platform supports There are three attribute types in the system: extensive and configurable logging capabilities. It is pos- sible to record user actions such as record upsert, user 1. Simple attribute is a basic data type which de- login or logout, etc. It is also possible to select the log- scribes some entity. Its type can be: string, nu- ging level. Coarse-grained logging logs errors only, while meric, boolean, file, date and time, reference, and fine-grained enables the logging of search and browsing. enum. Enum is a domain-specific enumeration Storages and Adapters. For its operation, Unidata which describes mutually exclusive states of a needs a number of storages, e.g. data storage, graph real-life entity, e.g. a subject which can be either a storage, match storage, and so on. In order to achieve legal person, a natural person, or a self-employed flexibility and independence from a particular database person. vendor, a collection of adapters was implemented. For 2. Array attribute is used to represent a series of example, the platform can use either Neo4J or OrientDB similar entities, such as property owners. for graph storage. 3. Complex attribute is used to represent nested Match storage is a module that concerns data repre- tree-like structures. It can contain simple at- sentation for performing data deduplication. There is tributes, arrays, and other complex attributes. a separate representation for both records and clusters The Unidata platform also supports references, which that differ from the original data by, for example, omitted can be of the following types: fields that do not participate in the matching. For the deduplication itself, matching engines such as Senzing, Elasticsearch or RDBMS can be used. Match and Match Extra. While match storage imple- ments basic matching functionality, this module imple- ments deduplication in MDM entities: in the data itself, drafts, business processes, etc. The Match Storage mod- ule does not possess any specifics regarding these entities and therefore separate modules are needed. These mod- ules also let the user define deduplication procedures and manage them. The Match module operates with rules, while Match Extra employs various machine learning approaches. Workflow and Workflow Extra. In MDM, it is fre- quently necessary to implement various business pro- cesses that may involve various officials and span mul- Figure 5: Tables used for data storage in the Unidata platform tiple departments (e.g. resolving a surname conflict in various personal documents). To run such workflows, a subset of the BPMN 2.0 standard that includes events, ac- ter data. Classifiers are tree-like data structures in which tivities, and gateways is supported. For this, integration each node describes some entity and may contain vari- with several third-party engines such as Activiti BPMN ous attributes (e.g. see [11]). They are frequently used and Camunda BPMN is implemented. in MDM domains, for example, to describe a hierarchy Workflow Extra extends basic capability by enabling of product types. Unidata platform enables interactions the implementation of various scenarios involving ma- of master data with such classifiers (e.g. arranging arriv- chine learning approaches. It also provides the ability to ing data according to a classifier) and supports various perform SLA enforcement for data operations. operations on classifiers themselves with versioning. Meta is a module that is responsible for metamodel Data Catalog implements a comprehensive body of (schema) management such as creating and editing en- knowledge concerning all data of the organization. It in- tities (attributes, references, etc). It provides a GUI that cludes provenance, current storage location, its domain, allows users of the MDM solution (data stewards) to work use, relation to other data and so on. This module is of with the metamodel while automatically generating the critical importance since an organization may have hun- corresponding API for data access. dreds of source information systems and manual tracking Data Quality and Complex Data Quality. These of such information would be impossible. modules are responsible for ensuring data quality — checking data for errors and performing data enrich- ment. While the Data Quality component works with a 5. Data Storage and Processing single record, Complex Data Quality may involve several records (e.g. checking aggregate values). 5.1. Requirements The core instruments are the enrichment and valida- The specifics of the platform’s application scope lead tion rules. The enrichment rule enables using attributes to the following requirements imposed on storage and of other registries in order to generate new attributes of processing of the data. the target registry. The validation rule does not generate any data, instead, it generates an error if it is violated by 1. Tombstone deletes. Information should be input. Each rule can have a number of ports which may deleted only when it is absolutely necessary. Real be either incoming or outgoing. Rules take input from deletions should be performed only by an admin- the incoming ports and produce results into the outgoing istrator while regular users should not actually ones. delete data. Instead, if a regular user attempts to In the UI, a user can construct rules using a special perform a deletion, the data should be marked UPath language (an XPath derivative) and a number of as deleted and the system should take this into pre-defined functions such as string manipulation func- account. tions (case conversion, concatenation), a boolean func- 2. Versioning support should be pervasive. Users tion and so on. It is possible to upload custom rules that and administrators should be allowed to recon- are implemented in Java or Python. struct previous versions of any object. At the Finally, rules can be grouped into sets and applied on same time, querying using validity periods, dis- a per-set basis. cussed earlier in Section 2.2 should be supported. Classifiers. This module employs classifiers for mas- 3. Provenance (traceability) should be provided for 1. Etalon_id — an identifier of etalon which this any operation. For example, if a bulk-loading op- origin belongs to. eration inserted records into a database, it should 2. Source_system — an identifier of the system be possible to reverse it by removing newly- where this record came from. inserted records while keeping the rest. 3. External_id — an identifier of the record in the system where this record came from. These points should be fulfilled for all data handled by 4. Enrichment — a boolean flag which shows the system, even for the manually entered. whether this origin is a result of the enrichment rule. 5.2. Storage 5. Similarly to etalon, attributes update_date and In order to meet these requirements, an approach based update_by that bear the same semantics but per- on the following three logical tables was used. Each table taining to this particular origin. represents an entity: Finally, the vistory table contains the data itself. Its important attributes (apart from the basic ones) are the • Etalon is the metadata of the golden record itself. following: • Origin is the metadata related to system from which the record originates. 1. Origin_id — a reference to origin whose part of • Vistory (version history) is the validity period of vistory is contained in this record. origin, which in turn may have revisions. 2. Revision — a version number, which is needed to ensure versioning, described in Section 5.1. Tables describing these entities share the following 3. ValidFrom and validTo attributes — a validity pe- attributes. riod of the vistory entry. 4. Data_b — serialized data in the XML format. 1. Id — an unique identifier of an object. 5. Operation_id attribute, similar to the one in the 2. Shard — an identifier of the shard where the Etalon table. It can be used to, for example, find record resides. Each of these logical tables may and cancel a modification action. However, in be physically represented by a collection of hori- this case, cancelling will affect only this particular zontal partitions (shards). update, while in the etalon case it will cancel the 3. Status — may be ACTIVE, INACTIVE, or creation of the whole record. MERGED. These are used to describe the status of the record and to support tombstone deletes. Note that the vistory table contains no attributes con- Thus, INACTIVE means that the record was cerning updates, since for each update a new record is deleted, while ACTIVE indicates that it is valid. formed. Next, setting the status attribute INACTIVE in The MERGED status indicates records that were case of the vistory table makes it possible to mark some used in the duplicate resolution process and no validity period as “deleted”. longer contain valid information. To illustrate the idea, in Table 1 we provide a vistory 4. Create_date and Created_by — when and by table fragment for the city name example described in whom this object was created. Figure 3. There are three records and all of them belong to a single origin. Therefore, there are no data conflicts The relations between these tables are shown in Fig- possible and the process of calculating the etalon data is ure 5. The links with empty arrowheads denote “shared” rather straightforward. First, it is necessary to perform a attributes and full arrowheads show PK-FK relationship. cross-product of all time periods, and then to select the The etalon table additionally stores: row that fits the necessary combination of creation and queried dates. The result of the period cross-product for 1. Name of the registry or lookup table it refers to. the considered example is shown in Table 2. Thus, it is 2. Update_date and update_by attributes, which possible to find an answer for points 1 – 4 . contain the date and the last user who updated Finally, there are complex rules of attribute interaction this data, respectively. inside the etalon −→ origin −→ vistory hierarchy. For 3. Operation_id attribute which contains the identi- example, if adding a new revision of a particular origin fier of the operation during which this record was leads to a recalculation of the etalon, then its update time created. It is needed to support the provenance is recalculated too. Next, status updates are propagated requirement. in a bottom-up fashion: e.g. if INACTIVE is set for a Apart from the basic set of attributes, the origins table vistory entry, then its etalon will have to be recalculated has the following additional ones: and may have to be set INACTIVE too. However, if etalon is set INACTIVE, then there is no need to set all vistory Table 2 Table 1 Validity periods calculation Vistory representation of the city name example from Figure 3 city_name rev validFrom validTo createDate city_name rev validFrom validTo createDate Saint-Petersburg 1 27.05.1703 1.09.1914 01.01.2018 Saint-Petersburg 1 27.05.1703 31.12.9999 01.01.2018 Petrograd 2 1.09.1914 26.01.1924 01.06.2020 Leningrad 2 26.01.1924 6.09.1991 01.06.2019 Leningrad 3 26.01.1924 6.09.1991 01.06.2019 Petrograd 3 1.09.1914 26.01.1924 01.06.2020 Saint-Petersburg 1 6.09.1991 31.12.9999 01.01.2018 entries INACTIVE since they will never be reachable for 2. For each attribute sort obtained versions accord- queries. ing to the weights of sources and update_date The same approach is used to represent not only 3. Compute values of each attribute by iterating over records, but other entities as well, e.g. references and versions obtained on the previous step: classification results. It is possible to create a golden a) if the value of the attribute is not null, then record for a reference. Consider a case where a reference use it for etalon construction. has two origin systems, and in the first system, there is a b) otherwise, proceed to the next iteration. 𝑠𝑒𝑡1 of values and in the second a 𝑠𝑒𝑡2 . Using the BVR or BVT algorithm, it is possible to create a golden record, Note that BVT algorithm is meant to be more robust by, for example, intersecting them. to null values and therefore handles them differently. To illustrate both algorithms, let us consider the exam- ple presented in Table 3. In this table, we denote entities 5.3. Query Processing: BVR and BVT selected on each step in bold. The first seven rows con- algorithms tain the data itself. Note that for presentation purposes Each vistory entry has a date of creation and a valid- we have joined all three tables that contain it. ity period, which are used to construct golden records, Our first step (regardless of the selected algorithm) is as was demonstrated in the previous section. However, to select most recent vistories for each origin, which is what if there are two origin systems which have the same done inside the DBMS. Rows 15–16 contain the answer. attribute and there is a data conflict, i.e. each system re- On the second step it is necessary to compute valid- ports different values? In other words, how is a SOR ity periods, which is done in Java code (all consequent selected for this attribute? computations are performed in Java code, too). Applying For this, two special algorithms — the BVR (Best Value cross-product to validFrom and validTo attributes, we ob- Record) and BVT (Best Value of the Truth) — were de- tain three periods: (1989–2000), (2000–2005), (2005–9999). vised. The BVR algorithm is used to construct a golden Rows 20–23 contain our data partitioned by validity pe- record by resolving data conflicts for all attributes of an riods. etalon using a set of thresholds, one for each origin sys- One can note that there is a data conflict, namely, rows tem. The general idea of this algorithm is to pick values 20 and 23 concern the same period and contain different from source systems that have higher weights. values. Periods of rows 21 and 22 do not have alternatives More formally, the BVR algorithm is as follows. and therefore may be used as is. In order to resolve the conflict, the BVR algorithm will 1. For each origin obtain its latest version, except require weights of source systems. Suppose that they are the ones that have the MERGED status. as follows: source1 = 50, source2 = 100. In this case, the 2. For each source_system select the latest version record on row 32 will be selected with all its attributes. according to its creation_date. The BVT algorithm will additionally require a set of 3. The golden record will be created out of the attribute weights. In this example, we have weights for record that pertains to the source_system with a single attribute — year_of_birth, which are as follows: the maximum weight. source1 = 100, source2 = 50. These attribute weights are used to override source The BVT algorithm is used to construct the golden weights that act over the whole record. Therefore, the record when system administrator wishes to form it on a year_of_birth attribute will be set as 1991, while name per-attribute basis. attribute will be set as John. The algorithm itself is as follows: The BVT algorithm additionally follows the “null is 1. Similarly to the BVR, obtain the latest version for not a value” rule. It never picks null value, even if it has each origin, except origins that have the MERGED to according to an attribute rule. This is why we select status. Saint-Petersburg in the city attribute. At the same time, Table 3 BVT and BVR illustration 1 Initial data 2 etalon_d origin_d source_system name city year_of_birth validFrom validTo rev create_date 3 etalon1 origin1 source1 John Moscow 1991 2000 9999 1 9.11.2021 4 etalon1 origin1 source1 John Saint-Petersburg 1991 2000 9999 2 10.11.2021 5 etalon1 origin2 source2 John 1992 1989 2005 1 8.11.2021 6 etalon1 origin2 source2 John 1993 1989 2005 2 11.11.2021 7 1. Selecting actual vistories 8 etalon_d origin_d source_system name city year_of_birth validFrom validTo rev create_date 9 etalon1 origin1 source1 John Moscow 1991 2000 9999 1 9.11.2021 10 etalon1 origin1 source1 John Saint-Petersburg 1991 2000 9999 2 10.11.2021 11 etalon1 origin2 source2 John 1992 1989 2005 1 8.11.2021 12 etalon1 origin2 source2 John 1993 1989 2005 2 11.11.2021 13 Result 14 etalon_d origin_d source_system name city year_of_birth validFrom validTo rev create_date 15 etalon1 origin1 source1 John Saint-Petersburg 1991 2000 9999 2 10.11.2021 16 etalon1 origin2 source2 John 1993 1989 2005 2 11.11.2021 17 2. Calculation of the validity periods. 18 Using cross-product to obtain the following periods: (1989-2000), (2000-2005), (2005-9999). Result in the vistory form: 19 etalon_d origin_d source_system name city year_of_birth validFrom validTo rev create_date 20 etalon1 origin1 source1 John Saint-Petersburg 1991 2000 2005 2 10.11.2021 21 etalon1 origin1 source1 John Saint-Petersburg 1991 2005 9999 2 10.11.2021 22 etalon1 origin2 source2 John 1993 1989 2000 2 11.11.2021 23 etalon1 origin2 source2 John 1993 2000 2005 2 11.11.2021 24 3. Calculation of the golden record. 25 Records participating in consolidation: 26 etalon_d origin_d source_system name city year_of_birth validFrom validTo rev create_date 27 etalon1 origin1 source1 John Saint-Petersburg 1991 2000 2005 2 10.11.2021 28 etalon1 origin2 source2 John 1993 2000 2005 2 11.11.2021 29 Example 1: BVR. BVR settings: source1 = 50, source2 = 100. 30 etalon_d origin_d source_system name city year_of_birth validFrom validTo rev create_date 31 etalon1 origin1 source1 John Saint-Petersburg 1991 2000 2005 2 10.11.2021 32 etalon1 origin2 source2 John 1993 2000 2005 2 11.11.2021 33 Example 2: BVT. BVT Settings: year_of_birth: source1 = 100, source2 = 50 34 etalon_d origin_d source_system name city year_of_birth validFrom validTo rev create_date 35 etalon1 origin1 source1 John Saint-Petersburg 1991 2000 2005 2 10.11.2021 36 etalon1 origin2 source2 John 1993 2000 2005 2 11.11.2021 this rule is absent in the BVR case, as was demonstrated lems in an imperative way, specifying everything that in row 31. needs to be done in order to obtain answers. This ap- Note that we have computed data for the period that proach requires a lot of effort, which often comes in the had data conflicts, but the considered golden record spans form of duplicated work since processes share a large multiple periods. The overall result for both algorithms degree of similarity in many organizations. that includes data for all periods is shown in Table 4. At the same time, humans think in terms of problems such as: “Why did sales drop in the last quarter?” or “How did the introduction of the new discount system 6. Beyond MDM impact profits?”. A declarative approach would suit these problems better, and thus there is a need for it. Despite recent significant technological advances, hu- There are two possible ways to achieve it — exhaus- man approaches to handling information have not really tive standardization and machine learning. The former changed. This is also true in case of MDM systems, in is very challenging since there are too many details that which interaction scenarios have stayed the same. Users should be reflected inside the standards. Companies and still have to think about data layouts and procedures, public institutions exist all over the world and each has e.g., define registries and referential tables, set up data to conform to various local and international regulations. pipelines and so on. In other words, people solve prob- Standardizing them all will either require a coordinated Table 4 BVT and BVR result comparison BVR etalon_id name city year_of_birth validFrom validTo etalon1 John 1993 1989 2000 etalon1 John 1993 2000 2005 etalon1 John Saint-Petersburg 1991 2005 9999 BVT etalon1 name city year_of_birth validFrom validTo etalon1 John 1993 1989 2000 etalon1 John Saint-Petersburg 1991 2000 2005 etalon1 John Saint-Petersburg 1991 2005 9999 effort of multiple governmental entities or lead to im- to be flexible in terms of the used DBMS, search, BPMN mense labor costs required to pay the workers who will implementation and so on. Not every customer is will- perform this standardization. ing to add more dependencies, which may also require Machine learning, on the other hand, will incur smaller additional expenses. costs and will not depend on any external collaboration. However, the next generation of MDM toolkits such as While a full-fledged AI with a natural language interface Egeria3 , Fuyuko4 , AtroCore MDM5 and many others offer is a very distant vision, machine learning has already open-sourced versions and actively attempt to implement been successfully adopted in individual components of a modular architecture. MDM systems. For example, semantic column type de- The reason for this are changes in the MDM landscape tection [12, 13, 14], database schema matching [15], du- and emerging requirements. The contemporary environ- plicate detection [16, 17, 18], and various types of table ment favors, if not requires modular and open products. autocompletion [19, 20]. Cloud-ready systems have become mainstream, and these Another promising direction is digital storytelling [21, properties are a must in ensuring extension-ability. 22, 23] — automatically extracting and presenting facts Finally, aiming for the organization pull strategy, one contained in the data in a human-friendly way. Employ- must prefer the latter approach, since such applications ing such techniques will lower the qualification require- require increased flexibility. The Unidata platform aims ments and open analytics to broader public. for this niche and is therefore modular, open-source and There is also a novel class of so-called visual analyt- extension-able. ics [24, 25] systems that contain collaborative tools that allow users to employ various visualization primitives, machine learning models, and other objects. These are 8. Conclusion dragged onto a dashboard and connected to each other In this paper we have presented the Unidata platform — in order to construct pipelines. Thus, machine learning a software product line intended for the creation of vari- will be present not only inside such systems, but outside ous MDM solutions. We have described its architecture, as well, i.e. allow users to build and use custom machine use-cases, data storage and query processing algorithms. learning models inside their decision-making pipelines. Finally, we have shared our vision regarding the future of MDM systems. 7. Related Work The established MDM market vendors [8] such as IBM, Acknowledgments SAP, Informatica and others offer a wide range of prod- We would like to thank Alexander Konstantinov and Ro- ucts for the creation of all types of MDM systems. man Strekalovsky for their comments. We would also However, their toolkits 1) were largely started as mono- like to thank Anna Smirnova for her help with the prepa- lithic products, 2) are heavily oriented towards vendors’ ration of the paper. infrastructure, 3) are frequently proprietary software which is not open-sourced. While the monolithic ap- proach greatly simplifies architecture, it has a number of drawbacks, such as hindering extension-ability and thus making open-sourcing largely useless. Vendor ori- 3 https://github.com/odpi/egeria entation is not necessarily a bad thing, but the need to 4 https://github.com/tmjeee/fuyuko cope with a systems zoo requires modern MDM products 5 https://github.com/atrocore/atrocore References 1835–1848. [15] T. Sahay, A. Mehta, S. Jadon, Schema matching us- [1] V. Khatri, C. V. Brown, Designing data governance, ing machine learning, CoRR abs/1911.11543 (2019). Commun. ACM 53 (2010) 148–152. URL: https: arXiv:1911.11543. //doi.org/10.1145/1629175.1629210. doi:10.1145/ [16] N. Barlaug, J. A. Gulla, Neural networks for entity 1629175.1629210. matching: A survey, ACM Trans. Knowl. Discov. [2] M. Jagals, E. Karger, F. Ahlemann, Already grown- Data 15 (2021). up or still in puberty? a bibliometric review of [17] W.-C. Tan, Deep data integration, in: Proceedings 16 years of data governance research, Corporate of the 2021 International Conference on Manage- Ownership & Control 19 (2021) 105–120. ment of Data, SIGMOD/PODS ’21, Association for [3] O. B. Nielsen, et al., Why governing data is diffi- Computing Machinery, New York, NY, USA, 2021, cult: Findings from danish local government, in: p. 2. Smart Working, Living and Organising, Springer [18] Y. Li, et al., Deep entity matching: Challenges and International Publishing, Cham, 2019, pp. 15–29. opportunities, J. Data and Information Quality 13 [4] D. International, DAMA-DMBOK: Data Manage- (2021). ment Body of Knowledge (2nd Edition), Technics [19] S. Zhang, K. Balog, Web table extraction, retrieval Publications, LLC, Denville, NJ, USA, 2017. and augmentation, in: B. Piwowarski, M. Cheva- [5] R. W. Zmud, An Examination of ’Push-Pull’ The- lier, É. Gaussier, Y. Maarek, J. Nie, F. Scholer (Eds.), ory Applied to Process Innovation in Knowledge Proceedings of the 42nd International ACM SIGIR Work, Management Science 30 (1984) 727–738. Conference on Research and Development in In- URL: https://www.jstor.org/stable/2631752, pub- formation Retrieval, SIGIR 2019, Paris, France, July lisher: INFORMS. 21-25, 2019, ACM, 2019, pp. 1409–1410. [6] M. Allen, D. Cervo, Multi-Domain Master Data [20] S. Zhang, K. Balog, Web table extraction, retrieval, Management: Advanced MDM and Data Gover- and augmentation: A survey, ACM Trans. Intell. nance in Practice, 1st ed., Morgan Kaufmann Pub- Syst. Technol. 11 (2020). lishers Inc., San Francisco, CA, USA, 2015. [21] F. El Outa, et al., Towards a conceptual model for [7] Software Product Lines. Carnegie Mellon data narratives, in: Conceptual Modeling, Springer Software Engineering Institute Web Site., International Publishing, Cham, 2020, pp. 261–270. https://resources.sei.cmu.edu/library/asset- [22] P. Vassiliadis, P. Marcel, S. Rizzi, Beyond roll- view.cfm?assetid=513819, 2022. up’s and drill-down’s: An intentional analytics [8] Walker S., Parker S, Hawker M., Radhakrish- model to reinvent OLAP (long-version), CoRR nan D., Dayley A., Magic Quadrant for Master abs/1812.07854 (2018). arXiv:1812.07854. Data Management. Gartner. ID G00466922., [23] P. Vassiliadis, P. Marcel, S. Rizzi, Beyond roll- https://www.gartner.com/en/documents/ up’s and drill-down’s: An intentional analytics 3995999/magic-quadrant-for-master-data- model to reinvent OLAP, Inf. Syst. 85 (2019) 68– management-solutions, 27 January 2021. 91. URL: https://doi.org/10.1016/j.is.2019.03.011. [9] D. Loshin, Master Data Management, Morgan Kauf- doi:10.1016/j.is.2019.03.011. mann Publishers Inc., San Francisco, CA, USA, 2009. [24] E. Wu, Systems for human data interaction [10] Andrew White. The Five Vectors of Complexity (keynote), in: D. Mottin, et al. (Eds.), Proc of That Define Your MDM Strategy. ID: G00276267., the 2nd Workshop on Search, Exploration, and https://www.gartner.com/en/documents/ Analysis in Heterogeneous Datastores (SEA-Data 3038017/the-five-vectors-of-complexity-that- 2021@VLDB’21), 2021. define-your-mdm-stra, 27 April 2015. [25] Z. Shang, et al., Davos: A system for interactive [11] Global Product Classification (GPC). GS1 Web Site., data-driven decision making, Proc. VLDB Endow. https://www.gs1.org/standards/gpc, 2022. 14 (2021) 2893–2905. [12] M. Hulsebos, et al., Sherlock: A deep learning ap- proach to semantic data type detection, in: Pro- ceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Min- ing, KDD ’19, 2019, p. 1500–1508. [13] X. Deng, et al., Turl: Table understanding through representation learning, Proc. VLDB Endow. 14 (2020) 307–319. [14] D. Zhang, et al., Sato: Contextual semantic type detection in tables, Proc. VLDB Endow. 13 (2020)