=Paper= {{Paper |id=Vol-3135/dataplat_paper1 |storemode=property |title=Unidata - A Modern Master Data Management Platform |pdfUrl=https://ceur-ws.org/Vol-3135/dataplat_paper1.pdf |volume=Vol-3135 |authors=Sergey Kuznetsov,Alexey Tsyryulnikov,Vlad Kamensky,Ruslan Trachuk,Mikhail Mikhailov,Sergey Murskiy,Dmitrij Koznov,George Chernishev |dblpUrl=https://dblp.org/rec/conf/edbt/KuznetsovTKTMMK22 }} ==Unidata - A Modern Master Data Management Platform== https://ceur-ws.org/Vol-3135/dataplat_paper1.pdf
Unidata — A Modern Master Data Management Platform
Sergey Kuznetsov1 , Alexey Tsyryulnikov1 , Vlad Kamensky1 , Ruslan Trachuk1 ,
Mikhail Mikhailov1 , Sergey Murskiy1 , Dmitrij Koznov2 and George Chernishev1,2
1
    Unidata
2
    Saint-Petersburg State University


                                             Abstract
                                             Organizations need to ensure the quality of data that is used for analytics and to maintain its consistency across multiple
                                             analytical and operational systems. Master data is a term that refers to domain-specific data concerning business objects,
                                             crucial for organization operation, e.g., contracts, suppliers, employees and so on. Usually, such source data is scattered
                                             around different applications across the organization and is of varying quality. Master Data Management (MDM) is a set of
                                             practices, information management methods, and data management tools intended for producing accurate, consistent, and
                                             complete master data. At the same time, data management tools play a vital role in setting up and supporting MDM-related
                                             processes.
                                                 In this paper, we describe the Unidata platform: a toolkit for constructing MDM solutions. Its modular architecture allows
                                             to construct solutions tailored for a specific domain and case requirements.
                                                 We start with a short introduction to MDM, discussing its aims, user-facing benefits, and approaches to working with
                                             data. Then, we describe the architecture of the Unidata platform and present the data storage approach and query processing
                                             algorithms. Finally, we discuss use-cases and put forward our thoughts regarding the future of MDM systems.

                                             Keywords
                                             Master Data Management, Data Platform, Data Cleaning, Information Integration, Golden Record, Registry



1. Introduction                                                                                                       tracking data lineage and so on.
                                                                                                                         Master Data Management (MDM) solutions aim to ad-
Contemporary companies and public institutions manage                                                                 dress these technical issues by consolidating available
huge volumes of data, which is now a strategic asset [1].                                                             information while interacting with existing systems of
Data became an enabler of organization business models                                                                organization in a minimally intrusive way. As a research
and value propositions [2]. At the same time, each or-                                                                discipline, MDM is an actively developing area that con-
ganization usually possesses a complex data ecosystem,                                                                cerns all aspects of enterprise data management practices.
which makes it hard to make use of information con-                                                                   Its practical counterpart is closely monitored by Gartner,
tained in it. For example, a recent study [3] concerning                                                              which releases yearly reviews describing established and
local governmental organizations in Denmark showed                                                                    promising products. The recent publication of the 2nd
that one of the significant obstacles to using this data                                                              edition of the fundamental reference [4] describing all
is the lack of its overview. Thus, if providing even an                                                               aspects of MDM is also worth noting.
overview is hard, then efficient use will require much                                                                   Large organizations that are principal customers of
more effort which will include ensuring data quality (du-                                                             MDM vendors have many individual characteristics, and,
plicate detection and data conflict resolution), setting up                                                           consequently, individual pressing tasks. Therefore, by
ETL pipelines, providing proper metadata management,                                                                  implementing MDM they aim to fulfill these tasks, thus
                                                                                                                      adhering to the organization pull strategy [5]. The other
Published in the Workshop Proceedings of the EDBT/ICDT 2022 Joint                                                     strategy — technology push — is essentially large-scale
Conference (March 29-April 1, 2022), Edinburgh, UK                                                                    adoption of a new technology based on the belief in its
$ sergey.kuznetsov@unidata-platform.ru (S. Kuznetsov);
                                                                                                                      usefulness instead of focusing on particular tasks at hand.
alexey.tsyryulnikov@unidata-platform.ru (A. Tsyryulnikov);
vlad.kamensky@unidata-platform.ru (V. Kamensky);                                                                      The organization pull approach significantly reduces the
ruslan.trachuk@unidata-platform.ru (R. Trachuk);                                                                      cost of MDM adoption by narrowing its application area.
mikhail.mikhailov@unidata-platform.ru (M. Mikhailov);                                                                 Moreover, its iterative nature enables progressive build-
sergey.murskiy@unidata-platform.ru (S. Murskiy);                                                                      up of MDM functionality via step-by-step incorporation
d.koznov@spbu.ru (D. Koznov); g.chernyshev@spbu.ru
                                                                                                                      of different organization business areas. However, imple-
(G. Chernishev)
€ https://www.math.spbu.ru/user/chernishev/ (G. Chernishev)                                                           menting MDM in the organization pull manner requires
 0000-0001-7509-7888 (S. Kuznetsov); 0000-0001-6752-6742                                                             flexibility of the MDM toolkit used to build the solution,
(A. Tsyryulnikov); 0000-0002-9907-8945 (R. Trachuk);                                                                  and major MDM toolkits fail to provide it since initially
0000-0001-6202-9034 (S. Murskiy); 0000-0003-2632-3193                                                                 they were built as monolithic systems.
(D. Koznov); 0000-0002-4265-9642 (G. Chernishev)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative      Usually, the structure of large organizations is the re-
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       Commons License Attribution 4.0 International (CC BY 4.0).
                                       CEUR Workshop Proceedings (CEUR-WS.org)                                        sult of lengthy historical processes, which may include
                                                               Master Data Management Solutions 2021” [8].
                                                                  The architecture of the Unidata platform aims to ad-
                                                               dress the above-mentioned concerns. Its core ideas are
                                                               as follows.
                                                                   1. Tune-ability. An MDM platform should be
                                                                      adaptable to various software ecosystems, take
                                                                      into account specific priorities and issues of the
                                                                      concrete organization, and comply with external
Figure 1: Approaches to handling data using an MDM solu-              constraints. This will ensure that target MDM
tion
                                                                      solutions meet the organization needs.
                                                                   2. Component-based architecture. In order to
                                                                      provide the required flexibility, the MDM plat-
several acquisitions and mergers of other companies or                form architecture has to follow the component-
organizations. Another important feature of such orga-                based principle. The platform should consist of
nizations is a unique software ecosystem comprised of                 modules — i.e., building blocks which can be
core technologies from various vendors (e.g. SAP, IBM,                freely combined together in order to create ef-
Microsoft). Next, specific priorities and issues of the or-           ficient and specialized MDM solutions.
ganization should be taken into account as well. Finally,          3. Metamodeling. A metamodel describing models
there are various external constraints which an organi-               of individual areas can be used to handle combi-
zation has to follow and which define the structure of                nations of different domains. This will provide
the business processes. For example, organizations in                 the product with the required extensionability.
different countries may have different rules, e.g. either a
declaratory or an authorization system.                           Building an MDM platform upon these principles
   Next, the need to fulfill specific tasks of the organi-     makes it possible to achieve synergy between individual
zation leads to a specific subset of the MDM functional-       MDM instruments such as data quality, data provenance,
ity of the solution being “clipped out” from an abstract       metadata management, and so on. That is, building a
general-purpose MDM toolkit. It may be due to tuning to        solution from tools that are intended to work together
organization software ecosystem or implementing spe-           from the start (as opposed to integrating a set of different
cific functionality. Such functionality usually concerns       tools) will result in shorter development times and solu-
areas that border on MDM or that may be completely             tions of better quality. Furthermore, it will be possible
external to MDM. Obviously, organizations do not want          to deliver different versions of solutions with different
to have functionality and code that they do not need for       functionality (e.g. standard vs enterprise edition) and
the task at hand. Finally, there is a trend of implementing    prices. The overall idea of this approach is presented in
various enterprise applications in the cloud in order to       Figure 2. The core set of modules has been released as
lower expenses.                                                an open-source version1 of the platform. It is published
   Multi-domain MDM, i.e. MDM covering several key             under the GNU GPLv3 license.
business entity areas (products, vendors, employees and           Overall, the contributions of this paper are:
so on), is the state-of-the-art [6] approach. Aside from the       1. A description of the Unidata platform architec-
aforementioned generic domains there may be a num-                    ture.
ber of business-specific ones, such as patients in case            2. A discussion of data storage and processing algo-
of a hospital or an inventory of land plots managed by                rithms.
forestry. Furthermore, each customer may have its own              3. A list of use-cases and our vision of the future of
set of domains, with its specifics. In order to support com-          MDM.
binations of different domains, an MDM platform should
be able to describe them inside itself and be sufficiently        The structure of this paper is as follows. We start with
independent from any particular one.                           the description of the background in Section 2. In it, we
   In this paper we describe the Unidata platform — a          provide basic definitions and describe user roles. Next,
core technology of an industrial product line [7] that         in Section 3 we discuss several use-cases in order to illus-
was used to produce dozens of target MDM solutions             trate benefits that can be obtained by using such systems.
for large organizations. These solutions are focused on        All presented use cases are real and describe actual MDM
the specific needs of organizations such as specialized        deployments that were driven by the organization pull
business segments, heterogeneous IT infrastructure, and        strategy. Section 4 contains the description of the plat-
particular business tasks. The platform has received an        form architecture and its modules. The way data is stored
honourable mention by Gartner in “Magic Quadrant for                1
                                                                      https://gitlab.com/unidata-community-group/unidata-
                                                               platform-deploy
                                                                     cannot be copied or “moved” for various (includ-
                                                                     ing legal) reasons.
                                                                  2. Consolidation. Data is uploaded into the com-
                                                                     mon repository on a regular basis, appropriately
                                                                     processed, and then the hub itself provides data
                                                                     consuming systems with access to this data. How-
                                                                     ever, new data is supplied by live data source sys-
                                                                     tems, i.e., new data is uploaded to the hub on a
                                                                     regular basis.
                                                                  3. Centralization. This architecture is very similar
                                                                     to the previous one, but here the hub takes over
                                                                     data upload as well: i.e., data is uploaded once,
                                                                     and then all changes are performed on the hub
                                                                     itself, thus turning all systems that initially were
Figure 2: Modular structure of the Unidata platform                  data sources into data consumers.
                                                                  4. Coexistence. This architecture implements a com-
                                                                     bination of the Consolidation and Centralization
and how queries are processed is discussed in Section 5.             for different master data of an organization. Addi-
Next, present our view on the future of MDM systems                  tionally, if some data fragments are not “movable”,
and describe related work in Sections 6 and 7. Finally, we           they can be handled using the Registry.
conclude this paper with Section 8.                             All of them are supported by the Unidata platform.


2. Background                                                 2.2. Basic definitions

2.1. Master Data Management                                      Golden (master) record. According to [6], one of the
                                                              core goals of MDM systems is to create and maintain a
According to David Loshin [9], Master Data “are those
                                                              single version of the truth for an entity. The information
core business objects used in the different applications
                                                              which constitutes it is stored in multiple sources and
across the organization, along with their associated meta-
                                                              thus, it should be assembled. All information concerning
data, attributes, definitions, roles, connections, and tax-
                                                              a particular entity is called the golden record.
onomies”. Examples are product data, customer data,
                                                                 Data models and Metamodel. In order to support
supplier data, location data, party data, reference data,
                                                              a particular domain, the objects which the platform will
asset data, employee data, ledger data, and vendor data.
                                                              work with must be defined. A Metamodel consists of
   In turn, Master Data Management is “a set of practices,
                                                              the description of the data itself (data schema) and re-
information management methods, and data manage-
                                                              lated procedures. For example, it is possible to add a
ment tools to implement the policies, procedures, ser-
                                                              supplementary metamodel of data quality for a particu-
vices, and infrastructure to support the capture, inte-
                                                              lar domain, e.g. specific duplicate detection procedures.
gration, and subsequent shared use of accurate, timely,
                                                                 In other words, a metamodel specifies how individual
consistent, and complete master data”.
                                                              data models will be created and processed.
   The purpose of an MDM platform is to obtain data
                                                                 A registry is a collection of records that are related to
from source information system(s), then process it to en-
                                                              some entity, such as a person, an organization, etc. This
sure data quality by, for example, performing data dedu-
                                                              information comes from many different source informa-
plication, filling in missing values, removing outdated
                                                              tion systems. A registry has a schema which consists of
information, etc. Eventually, it must obtain a golden
                                                              a list of its attributes and nested entities. Similarly to ta-
record for each item — an error-free version conforming
                                                              bles, registries can have references to other registries, e.g.
to the defined quality criteria.
                                                              suppliers and items: each supplier can have references
   How exactly data is processed is defined by the MDM
                                                              to items which it sells.
implementation architecture/style [6, 10] (sometimes
                                                                 A lookup table is a referential table that contains data
called the data hub architecture). Gartner identifies four
                                                              which is rarely changed, but at the same time frequently
approaches, presented in Figure 1).
                                                              used. For example, lookup tables may be created to list
    1. Registry. The hub does not contain the data itself,    countries, timezones, currencies, and so on. Similarly to
       instead storing only the corresponding references      registries, lookup tables consist of attributes.
       (indexes). This approach is relevant for data that        System of a Record (SOR). During the golden record
                                                              assembly process, the same attribute may turn out to be
Figure 3: Representation of the Saint-Petersburg name validity periods



stored in several sources, having different values. In this       a data steward is the person who is accountable for the
case, either a system of a record or a conflict resolution        quality of the data. However, in practice, a data steward
process should be set up. SOR is a primary data source            is not necessarily a single person, but rather, a group of
which contains the “true” value.                                  several people. Each of them may be responsible for a dif-
   Validity Period — is an interval in which the data             ferent key business area or even a part of it. The Unidata
describing an entity is valid. Each golden record may             platform supports two types of users that implement data
have several validity periods which should be taken into          stewardship:
account while querying the data. Moreover, there are
two temporal dimensions: time of an event and time                      • Administrator manages the platform in general.
of addition of this new version of information into the                   Their duties include data model administration,
system. This leads to a need for a special scheme for                     classifier2 administration, managing rules of du-
managing this information.                                                plicate detection and so on.
   Consider the example presented in Figure 3 which de-                 • Data operator manages records (e.g. database
scribes the history of Saint Petersburg’s name changes.                   population), registries and lookup tables, and par-
The Y-axis denotes update times, while the X-axis shows                   ticipates in related business processes.
the validity periods. In this example, we assume that
before 2019 the system contained only the basic version              Consequently, there are two types of interfaces, for
of name information: from 1703 until the present time             each type of user. Finally, Unidata platform supports role-
(2019), the city was called Saint Petersburg. Then, some-         based access model, where each user may be assigned
where between 2019 and 2020, the knowledge of the                 rights to perform operations (e.g. CRUD) with various
Leningrad name was added into the system, and later,              objects.
somewhere before 2021, a similar update was done for
Petrograd.
   A user may pose queries like “what was the name of
                                                                  3. Use Cases
the city in 1921”. There are two possible answers: up             All of the presented use-cases are real projects that were
until 2020 (point 1 ) the system would have returned              completed using the Unidata platform, driven by the
“Saint Petersburg”, which was the correct answer at the           organization pull strategy. They clearly illustrate the
time. Currently, (point 2 ) it should output “Petrograd”.         concrete goals of organizations, as well as the benefits
   Thus, the golden record should be constructed on-the-          the platform has provided to the final users.
fly, taking into account updates in two temporal dimen-             They also demonstrate that each particular deploy-
sions which is done by looking into the origin history            ment needs a different set of data governance tools, i.e.
(which is discussed later).                                       they highlight the importance of component-based ar-
                                                                  chitecture, which we have discussed in the Introduction.
2.3. Humans in the MDM system
From a high-level perspective, an MDM system produces 3.1. Case 1: managing logistical resources
a new role for a human user — a Data Steward. According The first case is a system for managing logistical re-
to [9], a data steward is a person responsible for collect- sources of a large company in the energy sector. The
ing, collating, and evaluating issues and problems with
data and the data life cycle. Their duty includes manag-       2
                                                                 In this paper, “classifier” refers mainly to a product class hier-
ing standard business definitions and metadata. Finally,
                                                                  archy, such as the Global Product Classification [11].
covered domains included raw materials, equipment, and          The aim of the system was to find those clients in the
replacement parts. The purpose of this system was to:        client database that have popular social media accounts
                                                             and a lot of followers. The company wanted to improve
      • Provide high-quality data for business processes
                                                             their loyalty by offering additional discounts and per-
         that cover equipment maintenance and repairs,
                                                             forming various other actions in order to obtain more
         inventory management.
                                                             customers from their follower bases. The core data gov-
      • Consolidate available information using data stan-
                                                             ernance tools that were used were data enrichment and
         dardization and unification.
                                                             consolidation.
   Additionally, this project succeeded at automating
complex company regulations that involved more than 3.5. Case 5: smart personal account
ten different divisions. Furthermore, a classifier of logis-
tical resources was deployed.                                The goal of this project was to create a smart personal
                                                             account of a city resident. It was necessary to integrate
                                                             it with various federal and regional information systems.
3.2. Case 2: product catalogue for a
                                                             The reason for this was to enable exporting relevant in-
       telecommunication company                             formation concerning vehicles, real estate, bank accounts
The next case is a product catalogue development for and so on. The primary focus of this project was ac-
a large telecommunication company. The goals of the cess control, security, and ensuring real-time as well as
project were to consolidate:                                 publish/subscribe model master data acquisition.
     • information regarding product offers for different
       customer segments,                                    3.6. Case 6: energy and heavy industry
     • information related to service availability,               company
     • financial information from the billing system and     This project was developed for a multi-sector interna-
       accounting records.                                   tional company focused on energy and heavy industry
   As the result, a product hierarchy (a product tree) that areas. Since this company has hundreds of thousands of
contains various product details (including financial ones) clients from all over the world, the procedure of adding
was constructed. It is now utilized by the sales depart- a new client was very complicated. Before applying our
ment and financial officers.                                 MDM solution, it took 21 days on average. However,
                                                             it has shortened to only eight days after. The solution
                                                             automatized various checks, finding final beneficiaries in
3.3. Case 3: data consolidation for a
                                                             corporate hierarchies, and centralized information input.
      transport company                                         This project mainly concerned data inventory, data
This project was dedicated to consolidation of items and quality (duplicate search), and implementing real-time
services purchased by a large transport company. The access to master data in order to speed up a particular
goal was to unify information contained in several dif- business process.
ferent product classifiers and to produce a list of services
offered by contractors.                                      4. Architecture
   The stakeholder of this MDM solution was the procure-
ment department of the company. After the deployment,
items and services that had different prices were identi-       Now, let us turn to the architecture of the Unidata plat-
fied and analyzed. This resulted in creation of monetary form. It follows the component-based principle, which
metrics, i.e., calculated total savings on purchases. The means its building blocks (modules) can be freely com-
system made it possible to perform automatic purchases bined with each other in order to obtain a solution with
with the minimal available price.                            the desired set of features.
   This data consolidation project relied on the pub-
lish/subscribe model.
                                                             4.1. Preliminaries
3.4. Case 4: data enrichment for a fashion A module is a self-sufficient set of functionality that is
                                           intended for solving a particular problem. Each mod-
     vendor
                                                             ule contains a number of services that cover parts of
This project concerned an MDM system for a company           this functionality. For example, the Meta module which
that sells fashion products. The system needed to per-       encompasses all metadata-related activities contains ser-
form client base segmentation and sales support in the       vices that cover managing lookup tables, registries, units
premium segment.                                             of measurement, enumerations, etc.
Figure 4: Architecture of UniData Platform



   Modules have rules of creation (a contract), behavior,        Modules of this component are rarely modified
and they can interact with other modules via being a part        and are never adapted for a particular deployment
of a pipeline.                                                   or domain.
   In a broad sense, a Pipeline is a sequence of operations   2. Storages concern everything related to data
which are performed either on data or on the model.              stores that are employed by the platform. These
Pipelines implement dataflows, which consist of service          modules make it possible to abstract functionality
calls and may contain utility nodes such as branching,           required by the platform from the specific DBMS
parallelism (applying an operation to each record inside         and to restrict and simplify it (as not all DBMS
the batch), calling another pipeline, and so on.                 functionality is required by an MDM solution).
   Each module contains services that work with the data      3. MDM includes all modules that implement basic
and services that interact with and modify the model.            MDM functionality, such as metadata manage-
Therefore, pipelines may modify not only the data, but           ment, rules for computing master record alterna-
the model itself too.                                            tives, data quality management, duplicate detec-
   Thus, modules and pipelines implement tuneability             tion, and business process implementation. This
and composition aspects discussed in the Introduction            component is responsible for integrating and syn-
section.                                                         chronizing all necessary data stores that reside
                                                                 on the previous level.
4.2. Architectural Overview                                   4. Extra MDM implements advanced MDM func-
                                                                 tionality. It contains modules that are either ad-
The overall architecture of the Unidata platform is pre-         vanced variants of some existing module from
sented in Figure 4. It consists of four main components:         the MDM component, or modules that provide
    1. Platform Core contains basic modules that im-             functionality usually implemented “outside” of
       plement core services of the platform and mini-           MDM. An example of the first case is the fol-
       mally depend on other modules. These services             lowing: Match Extra is a sophisticated machine
       are: system boot, job batches, and data types.            learning-based inexact match module and the
       MDM match module is a straightforward exact                  1. Reference — a reference which ensures that for
       one. The second case is illustrated by the data                 each individual record there can be only one ref-
       delivery set. Usually, in MDM systems the ETL                   erenced object per each validity period.
       is implemented separately as a standalone appli-             2. Many-to-many — a classic many-to-many refer-
       cation. In our case, it is possible to have it inside           ence.
       the system. The same idea is employed with the               3. Contains — a reference which is created by user
       Pub/Sub module, which make it possible to imple-                pointing to an entity with all attributes filled in.
       ment sophisticated patterns of sending records                  This type is used to define entities which do not
       to various consumers. This is an advanced com-                  exist without its parent entity.
       ponent which is not present in the open-source
       version.                                                    The platform lets the user browse both direct and back-
                                                                ward references (those that point to a specific record). It
  These components are organized in a hierarchical way,
                                                                also provides rich search capabilities that allow the user
which means that:
                                                                to query attributes of references.
     • components residing in the bottom levels of the             The Core module contains interfaces and abstract im-
       figure are more low-level, they contain basic fea-       plementations for all data types that the platform uses. It
       tures, essential for implementing high-level func-       also implements metamodel support, the need for which
       tionality, and,                                          was discussed in the Introduction. Finally, this module
     • components interact with each other in hierarchi-        supports additional services which may be used by other
       cal way, i.e. their interactions rarely “jump” over      modules such as roles, logging, license verification and
       the immediate neighbor.                                  so on.
                                                                   The next important entity is a draft. Drafts are an
                                                                essential part of MDM since master data needs to be
4.3. Entities, modules, and their                               synchronized over data producers (sources) and data con-
     relationships                                              sumers. Such synchronization frequently results in the
In this section, we will overview core entities that the        creation of temporary intermediates. These need to pass
system works with, as well as important modules and             various reconciliation procedures and be agreed upon in
their functionality relevant to the entities.                   order to become fair copies. Drafts may also emerge as a
   System. All modules that constitute the platform             result of conflicts that arise during record consolidation
share a single interface that contains methods of ini-          process, after various data quality procedures, etc.
tialization, configuration, launch, and verification. The          Drafts may concern individual items or the model.
system module orchestrates system boot process and              Drafts can have revisions: all of them are stored in the
ensures that all modules have everything necessary for          database in a serialized form.
trouble-free launch and operation.                                 Therefore, drafts need to have general support: oper-
   Data Types. An attribute is a basic entity representing      ations such as creation, publication (transforming into
some key aspect of a stored object, similar to attributes       fair copy), and merge must be implemented. Draft is a
in RDBMS tables. In Unidata, Registries, lookup tables          module that enables all kinds of drafts in the platform.
and references may have attributes.                                Logging capabilities. The Unidata platform supports
   There are three attribute types in the system:               extensive and configurable logging capabilities. It is pos-
                                                                sible to record user actions such as record upsert, user
    1. Simple attribute is a basic data type which de-          login or logout, etc. It is also possible to select the log-
       scribes some entity. Its type can be: string, nu-        ging level. Coarse-grained logging logs errors only, while
       meric, boolean, file, date and time, reference, and      fine-grained enables the logging of search and browsing.
       enum. Enum is a domain-specific enumeration                 Storages and Adapters. For its operation, Unidata
       which describes mutually exclusive states of a           needs a number of storages, e.g. data storage, graph
       real-life entity, e.g. a subject which can be either a   storage, match storage, and so on. In order to achieve
       legal person, a natural person, or a self-employed       flexibility and independence from a particular database
       person.                                                  vendor, a collection of adapters was implemented. For
    2. Array attribute is used to represent a series of         example, the platform can use either Neo4J or OrientDB
       similar entities, such as property owners.               for graph storage.
    3. Complex attribute is used to represent nested               Match storage is a module that concerns data repre-
       tree-like structures. It can contain simple at-          sentation for performing data deduplication. There is
       tributes, arrays, and other complex attributes.          a separate representation for both records and clusters
  The Unidata platform also supports references, which          that differ from the original data by, for example, omitted
can be of the following types:                                  fields that do not participate in the matching. For the
deduplication itself, matching engines such as Senzing,
Elasticsearch or RDBMS can be used.
   Match and Match Extra. While match storage imple-
ments basic matching functionality, this module imple-
ments deduplication in MDM entities: in the data itself,
drafts, business processes, etc. The Match Storage mod-
ule does not possess any specifics regarding these entities
and therefore separate modules are needed. These mod-
ules also let the user define deduplication procedures and
manage them. The Match module operates with rules,
while Match Extra employs various machine learning
approaches.
   Workflow and Workflow Extra. In MDM, it is fre-
quently necessary to implement various business pro-
cesses that may involve various officials and span mul-         Figure 5: Tables used for data storage in the Unidata platform
tiple departments (e.g. resolving a surname conflict in
various personal documents). To run such workflows, a
subset of the BPMN 2.0 standard that includes events, ac-       ter data. Classifiers are tree-like data structures in which
tivities, and gateways is supported. For this, integration      each node describes some entity and may contain vari-
with several third-party engines such as Activiti BPMN          ous attributes (e.g. see [11]). They are frequently used
and Camunda BPMN is implemented.                                in MDM domains, for example, to describe a hierarchy
   Workflow Extra extends basic capability by enabling          of product types. Unidata platform enables interactions
the implementation of various scenarios involving ma-           of master data with such classifiers (e.g. arranging arriv-
chine learning approaches. It also provides the ability to      ing data according to a classifier) and supports various
perform SLA enforcement for data operations.                    operations on classifiers themselves with versioning.
   Meta is a module that is responsible for metamodel              Data Catalog implements a comprehensive body of
(schema) management such as creating and editing en-            knowledge concerning all data of the organization. It in-
tities (attributes, references, etc). It provides a GUI that    cludes provenance, current storage location, its domain,
allows users of the MDM solution (data stewards) to work        use, relation to other data and so on. This module is of
with the metamodel while automatically generating the           critical importance since an organization may have hun-
corresponding API for data access.                              dreds of source information systems and manual tracking
   Data Quality and Complex Data Quality. These                 of such information would be impossible.
modules are responsible for ensuring data quality —
checking data for errors and performing data enrich-
ment. While the Data Quality component works with a             5. Data Storage and Processing
single record, Complex Data Quality may involve several
records (e.g. checking aggregate values).                       5.1. Requirements
   The core instruments are the enrichment and valida-
                                                                The specifics of the platform’s application scope lead
tion rules. The enrichment rule enables using attributes
                                                                to the following requirements imposed on storage and
of other registries in order to generate new attributes of
                                                                processing of the data.
the target registry. The validation rule does not generate
any data, instead, it generates an error if it is violated by       1. Tombstone deletes. Information should be
input. Each rule can have a number of ports which may                  deleted only when it is absolutely necessary. Real
be either incoming or outgoing. Rules take input from                  deletions should be performed only by an admin-
the incoming ports and produce results into the outgoing               istrator while regular users should not actually
ones.                                                                  delete data. Instead, if a regular user attempts to
   In the UI, a user can construct rules using a special               perform a deletion, the data should be marked
UPath language (an XPath derivative) and a number of                   as deleted and the system should take this into
pre-defined functions such as string manipulation func-                account.
tions (case conversion, concatenation), a boolean func-             2. Versioning support should be pervasive. Users
tion and so on. It is possible to upload custom rules that             and administrators should be allowed to recon-
are implemented in Java or Python.                                     struct previous versions of any object. At the
   Finally, rules can be grouped into sets and applied on              same time, querying using validity periods, dis-
a per-set basis.                                                       cussed earlier in Section 2.2 should be supported.
   Classifiers. This module employs classifiers for mas-
    3. Provenance (traceability) should be provided for            1. Etalon_id — an identifier of etalon which this
       any operation. For example, if a bulk-loading op-              origin belongs to.
       eration inserted records into a database, it should         2. Source_system — an identifier of the system
       be possible to reverse it by removing newly-                   where this record came from.
       inserted records while keeping the rest.                    3. External_id — an identifier of the record in the
                                                                      system where this record came from.
  These points should be fulfilled for all data handled by
                                                                   4. Enrichment — a boolean flag which shows
the system, even for the manually entered.
                                                                      whether this origin is a result of the enrichment
                                                                      rule.
5.2. Storage                                                       5. Similarly to etalon, attributes update_date and
In order to meet these requirements, an approach based                update_by that bear the same semantics but per-
on the following three logical tables was used. Each table            taining to this particular origin.
represents an entity:                                           Finally, the vistory table contains the data itself. Its
                                                             important attributes (apart from the basic ones) are the
      • Etalon is the metadata of the golden record itself.
                                                             following:
      • Origin is the metadata related to system from
        which the record originates.                              1. Origin_id — a reference to origin whose part of
      • Vistory (version history) is the validity period of           vistory is contained in this record.
        origin, which in turn may have revisions.                 2. Revision — a version number, which is needed to
                                                                      ensure versioning, described in Section 5.1.
   Tables describing these entities share the following           3. ValidFrom and validTo attributes — a validity pe-
attributes.                                                           riod of the vistory entry.
                                                                  4. Data_b — serialized data in the XML format.
     1. Id — an unique identifier of an object.
                                                                  5. Operation_id attribute, similar to the one in the
     2. Shard — an identifier of the shard where the                  Etalon table. It can be used to, for example, find
        record resides. Each of these logical tables may              and cancel a modification action. However, in
        be physically represented by a collection of hori-            this case, cancelling will affect only this particular
        zontal partitions (shards).                                   update, while in the etalon case it will cancel the
     3. Status — may be ACTIVE, INACTIVE, or                          creation of the whole record.
        MERGED. These are used to describe the status
        of the record and to support tombstone deletes.         Note that the vistory table contains no attributes con-
        Thus, INACTIVE means that the record was             cerning   updates, since for each update a new record is
        deleted, while ACTIVE indicates that it is valid.    formed.   Next, setting the status attribute INACTIVE in
        The MERGED status indicates records that were case of the vistory table makes it possible to mark some
        used in the duplicate resolution process and no validity period as “deleted”.
        longer contain valid information.                       To illustrate the idea, in Table 1 we provide a vistory
     4. Create_date and Created_by — when and by             table  fragment for the city name example described in
        whom this object was created.                        Figure 3. There are three records and all of them belong
                                                             to a single origin. Therefore, there are no data conflicts
   The relations between these tables are shown in Fig- possible and the process of calculating the etalon data is
ure 5. The links with empty arrowheads denote “shared” rather straightforward. First, it is necessary to perform a
attributes and full arrowheads show PK-FK relationship. cross-product of all time periods, and then to select the
   The etalon table additionally stores:                     row that fits the necessary combination of creation and
                                                             queried dates. The result of the period cross-product for
     1. Name of the registry or lookup table it refers to.
                                                             the considered example is shown in Table 2. Thus, it is
     2. Update_date and update_by attributes, which possible to find an answer for points 1 – 4 .
        contain the date and the last user who updated          Finally, there are complex rules of attribute interaction
        this data, respectively.                             inside the etalon −→ origin −→ vistory hierarchy. For
     3. Operation_id attribute which contains the identi- example, if adding a new revision of a particular origin
        fier of the operation during which this record was leads to a recalculation of the etalon, then its update time
        created. It is needed to support the provenance is recalculated too. Next, status updates are propagated
        requirement.                                         in a bottom-up fashion: e.g. if INACTIVE is set for a
   Apart from the basic set of attributes, the origins table vistory entry, then its etalon will have to be recalculated
has the following additional ones:                           and may have to be set INACTIVE too. However, if etalon
                                                             is set INACTIVE, then there is no need to set all vistory
                                                                      Table 2
Table 1
                                                                      Validity periods calculation
Vistory representation of the city name example from Figure 3
                                                                       city_name            rev      validFrom    validTo      createDate
 city_name            rev   validFrom    validTo      createDate
                                                                       Saint-Petersburg     1        27.05.1703   1.09.1914    01.01.2018
 Saint-Petersburg     1     27.05.1703   31.12.9999   01.01.2018
                                                                       Petrograd            2        1.09.1914    26.01.1924   01.06.2020
 Leningrad            2     26.01.1924   6.09.1991    01.06.2019
                                                                       Leningrad            3        26.01.1924   6.09.1991    01.06.2019
 Petrograd            3     1.09.1914    26.01.1924   01.06.2020
                                                                       Saint-Petersburg     1        6.09.1991    31.12.9999   01.01.2018



        entries INACTIVE since they will never be reachable for           2. For each attribute sort obtained versions accord-
        queries.                                                             ing to the weights of sources and update_date
           The same approach is used to represent not only                3. Compute values of each attribute by iterating over
        records, but other entities as well, e.g. references and             versions obtained on the previous step:
        classification results. It is possible to create a golden               a) if the value of the attribute is not null, then
        record for a reference. Consider a case where a reference                   use it for etalon construction.
        has two origin systems, and in the first system, there is a             b) otherwise, proceed to the next iteration.
        𝑠𝑒𝑡1 of values and in the second a 𝑠𝑒𝑡2 . Using the BVR
        or BVT algorithm, it is possible to create a golden record,      Note that BVT algorithm is meant to be more robust
        by, for example, intersecting them.                           to null values and therefore handles them differently.
                                                                         To illustrate both algorithms, let us consider the exam-
                                                                      ple presented in Table 3. In this table, we denote entities
        5.3. Query Processing: BVR and BVT                            selected on each step in bold. The first seven rows con-
             algorithms                                               tain the data itself. Note that for presentation purposes
       Each vistory entry has a date of creation and a valid-         we have joined all three tables that contain it.
       ity period, which are used to construct golden records,           Our first step (regardless of the selected algorithm) is
       as was demonstrated in the previous section. However,          to select most recent vistories for each origin, which is
       what if there are two origin systems which have the same       done inside the DBMS. Rows 15–16 contain the answer.
       attribute and there is a data conflict, i.e. each system re-      On the second step it is necessary to compute valid-
       ports different values? In other words, how is a SOR           ity periods, which is done in Java code (all consequent
       selected for this attribute?                                   computations are performed in Java code, too). Applying
          For this, two special algorithms — the BVR (Best Value      cross-product to validFrom and validTo attributes, we ob-
       Record) and BVT (Best Value of the Truth) — were de-           tain three periods: (1989–2000), (2000–2005), (2005–9999).
       vised. The BVR algorithm is used to construct a golden         Rows 20–23 contain our data partitioned by validity pe-
       record by resolving data conflicts for all attributes of an    riods.
       etalon using a set of thresholds, one for each origin sys-        One can note that there is a data conflict, namely, rows
       tem. The general idea of this algorithm is to pick values      20 and 23 concern the same period and contain different
       from source systems that have higher weights.                  values. Periods of rows 21 and 22 do not have alternatives
          More formally, the BVR algorithm is as follows.             and therefore may be used as is.
                                                                         In order to resolve the conflict, the BVR algorithm will
             1. For each origin obtain its latest version, except     require weights of source systems. Suppose that they are
                the ones that have the MERGED status.                 as follows: source1 = 50, source2 = 100. In this case, the
             2. For each source_system select the latest version      record on row 32 will be selected with all its attributes.
                according to its creation_date.                          The BVT algorithm will additionally require a set of
             3. The golden record will be created out of the          attribute weights. In this example, we have weights for
                record that pertains to the source_system with        a single attribute — year_of_birth, which are as follows:
                the maximum weight.                                   source1 = 100, source2 = 50.
                                                                         These attribute weights are used to override source
          The BVT algorithm is used to construct the golden           weights that act over the whole record. Therefore, the
        record when system administrator wishes to form it on a       year_of_birth attribute will be set as 1991, while name
        per-attribute basis.                                          attribute will be set as John.
          The algorithm itself is as follows:                            The BVT algorithm additionally follows the “null is
             1. Similarly to the BVR, obtain the latest version for   not a value” rule. It never picks null value, even if it has
                each origin, except origins that have the MERGED      to according to an attribute rule. This is why we select
                status.                                               Saint-Petersburg in the city attribute. At the same time,
       Table 3
       BVT and BVR illustration
1    Initial data
2    etalon_d origin_d source_system name city                                  year_of_birth validFrom validTo rev            create_date
3    etalon1      origin1      source1         John     Moscow                  1991              2000           9999      1   9.11.2021
4    etalon1      origin1      source1         John     Saint-Petersburg        1991              2000           9999      2   10.11.2021
5    etalon1      origin2      source2         John                             1992              1989           2005      1   8.11.2021
6    etalon1      origin2      source2         John                             1993              1989           2005      2   11.11.2021
7    1. Selecting actual vistories
8    etalon_d origin_d source_system name city                                  year_of_birth validFrom validTo rev            create_date
9    etalon1      origin1      source1         John     Moscow                  1991              2000           9999      1   9.11.2021
10   etalon1      origin1      source1         John     Saint-Petersburg        1991              2000           9999      2   10.11.2021
11   etalon1      origin2      source2         John                             1992              1989           2005      1   8.11.2021
12   etalon1      origin2      source2         John                             1993              1989           2005      2   11.11.2021
13   Result
14   etalon_d origin_d source_system name city                                  year_of_birth validFrom validTo rev            create_date
15   etalon1      origin1      source1         John     Saint-Petersburg        1991              2000           9999      2   10.11.2021
16   etalon1      origin2      source2         John                             1993              1989           2005      2   11.11.2021
17   2. Calculation of the validity periods.
18   Using cross-product to obtain the following periods: (1989-2000), (2000-2005), (2005-9999). Result in the vistory form:
19   etalon_d origin_d source_system name city                                  year_of_birth validFrom validTo rev            create_date
20   etalon1      origin1      source1         John     Saint-Petersburg        1991              2000           2005      2   10.11.2021
21   etalon1      origin1      source1         John     Saint-Petersburg        1991              2005           9999      2   10.11.2021
22   etalon1      origin2      source2         John                             1993              1989           2000      2   11.11.2021
23   etalon1      origin2      source2         John                             1993              2000           2005      2   11.11.2021
24   3. Calculation of the golden record.
25   Records participating in consolidation:
26   etalon_d origin_d source_system name city                                  year_of_birth validFrom validTo rev            create_date
27   etalon1      origin1      source1         John     Saint-Petersburg        1991              2000           2005      2   10.11.2021
28   etalon1      origin2      source2         John                             1993              2000           2005      2   11.11.2021
29   Example 1: BVR. BVR settings: source1 = 50, source2 = 100.
30   etalon_d origin_d source_system name city                                  year_of_birth validFrom validTo rev            create_date
31   etalon1      origin1      source1         John     Saint-Petersburg        1991              2000           2005      2   10.11.2021
32   etalon1      origin2      source2         John                             1993              2000           2005      2   11.11.2021
33   Example 2: BVT. BVT Settings: year_of_birth: source1 = 100, source2 = 50
34   etalon_d origin_d source_system name city                                  year_of_birth validFrom validTo rev            create_date
35   etalon1      origin1      source1         John      Saint-Petersburg        1991             2000           2005      2   10.11.2021
36   etalon1      origin2      source2         John                             1993               2000           2005     2   11.11.2021



       this rule is absent in the BVR case, as was demonstrated        lems in an imperative way, specifying everything that
       in row 31.                                                      needs to be done in order to obtain answers. This ap-
          Note that we have computed data for the period that          proach requires a lot of effort, which often comes in the
       had data conflicts, but the considered golden record spans      form of duplicated work since processes share a large
       multiple periods. The overall result for both algorithms        degree of similarity in many organizations.
       that includes data for all periods is shown in Table 4.            At the same time, humans think in terms of problems
                                                                       such as: “Why did sales drop in the last quarter?” or
                                                                       “How did the introduction of the new discount system
       6. Beyond MDM                                                   impact profits?”. A declarative approach would suit these
                                                                       problems better, and thus there is a need for it.
       Despite recent significant technological advances, hu-
                                                                          There are two possible ways to achieve it — exhaus-
       man approaches to handling information have not really
                                                                       tive standardization and machine learning. The former
       changed. This is also true in case of MDM systems, in
                                                                       is very challenging since there are too many details that
       which interaction scenarios have stayed the same. Users
                                                                       should be reflected inside the standards. Companies and
       still have to think about data layouts and procedures,
                                                                       public institutions exist all over the world and each has
       e.g., define registries and referential tables, set up data
                                                                       to conform to various local and international regulations.
       pipelines and so on. In other words, people solve prob-
                                                                       Standardizing them all will either require a coordinated
Table 4
BVT and BVR result comparison
                    BVR
                    etalon_id    name     city                year_of_birth       validFrom      validTo
                    etalon1      John                         1993                1989           2000
                    etalon1      John                         1993                2000           2005
                    etalon1      John     Saint-Petersburg    1991                2005           9999
                    BVT
                    etalon1      name     city                year_of_birth       validFrom      validTo
                    etalon1      John                         1993                1989           2000
                    etalon1      John     Saint-Petersburg    1991                2000           2005
                    etalon1      John     Saint-Petersburg    1991                2005           9999


effort of multiple governmental entities or lead to im-       to be flexible in terms of the used DBMS, search, BPMN
mense labor costs required to pay the workers who will        implementation and so on. Not every customer is will-
perform this standardization.                                 ing to add more dependencies, which may also require
   Machine learning, on the other hand, will incur smaller    additional expenses.
costs and will not depend on any external collaboration.         However, the next generation of MDM toolkits such as
While a full-fledged AI with a natural language interface     Egeria3 , Fuyuko4 , AtroCore MDM5 and many others offer
is a very distant vision, machine learning has already        open-sourced versions and actively attempt to implement
been successfully adopted in individual components of         a modular architecture.
MDM systems. For example, semantic column type de-               The reason for this are changes in the MDM landscape
tection [12, 13, 14], database schema matching [15], du-      and emerging requirements. The contemporary environ-
plicate detection [16, 17, 18], and various types of table    ment favors, if not requires modular and open products.
autocompletion [19, 20].                                      Cloud-ready systems have become mainstream, and these
   Another promising direction is digital storytelling [21,   properties are a must in ensuring extension-ability.
22, 23] — automatically extracting and presenting facts          Finally, aiming for the organization pull strategy, one
contained in the data in a human-friendly way. Employ-        must prefer the latter approach, since such applications
ing such techniques will lower the qualification require-     require increased flexibility. The Unidata platform aims
ments and open analytics to broader public.                   for this niche and is therefore modular, open-source and
   There is also a novel class of so-called visual analyt-    extension-able.
ics [24, 25] systems that contain collaborative tools that
allow users to employ various visualization primitives,
machine learning models, and other objects. These are         8. Conclusion
dragged onto a dashboard and connected to each other
                                                              In this paper we have presented the Unidata platform —
in order to construct pipelines. Thus, machine learning
                                                              a software product line intended for the creation of vari-
will be present not only inside such systems, but outside
                                                              ous MDM solutions. We have described its architecture,
as well, i.e. allow users to build and use custom machine
                                                              use-cases, data storage and query processing algorithms.
learning models inside their decision-making pipelines.
                                                              Finally, we have shared our vision regarding the future
                                                              of MDM systems.
7. Related Work
The established MDM market vendors [8] such as IBM,           Acknowledgments
SAP, Informatica and others offer a wide range of prod-
                                                              We would like to thank Alexander Konstantinov and Ro-
ucts for the creation of all types of MDM systems.
                                                              man Strekalovsky for their comments. We would also
   However, their toolkits 1) were largely started as mono-
                                                              like to thank Anna Smirnova for her help with the prepa-
lithic products, 2) are heavily oriented towards vendors’
                                                              ration of the paper.
infrastructure, 3) are frequently proprietary software
which is not open-sourced. While the monolithic ap-
proach greatly simplifies architecture, it has a number
of drawbacks, such as hindering extension-ability and
thus making open-sourcing largely useless. Vendor ori-           3
                                                                     https://github.com/odpi/egeria
entation is not necessarily a bad thing, but the need to         4
                                                                     https://github.com/tmjeee/fuyuko
cope with a systems zoo requires modern MDM products             5
                                                                     https://github.com/atrocore/atrocore
References                                                      1835–1848.
                                                           [15] T. Sahay, A. Mehta, S. Jadon, Schema matching us-
 [1] V. Khatri, C. V. Brown, Designing data governance,         ing machine learning, CoRR abs/1911.11543 (2019).
     Commun. ACM 53 (2010) 148–152. URL: https:                 arXiv:1911.11543.
     //doi.org/10.1145/1629175.1629210. doi:10.1145/       [16] N. Barlaug, J. A. Gulla, Neural networks for entity
     1629175.1629210.                                           matching: A survey, ACM Trans. Knowl. Discov.
 [2] M. Jagals, E. Karger, F. Ahlemann, Already grown-          Data 15 (2021).
     up or still in puberty? a bibliometric review of      [17] W.-C. Tan, Deep data integration, in: Proceedings
     16 years of data governance research, Corporate            of the 2021 International Conference on Manage-
     Ownership & Control 19 (2021) 105–120.                     ment of Data, SIGMOD/PODS ’21, Association for
 [3] O. B. Nielsen, et al., Why governing data is diffi-        Computing Machinery, New York, NY, USA, 2021,
     cult: Findings from danish local government, in:           p. 2.
     Smart Working, Living and Organising, Springer        [18] Y. Li, et al., Deep entity matching: Challenges and
     International Publishing, Cham, 2019, pp. 15–29.           opportunities, J. Data and Information Quality 13
 [4] D. International, DAMA-DMBOK: Data Manage-                 (2021).
     ment Body of Knowledge (2nd Edition), Technics        [19] S. Zhang, K. Balog, Web table extraction, retrieval
     Publications, LLC, Denville, NJ, USA, 2017.                and augmentation, in: B. Piwowarski, M. Cheva-
 [5] R. W. Zmud, An Examination of ’Push-Pull’ The-             lier, É. Gaussier, Y. Maarek, J. Nie, F. Scholer (Eds.),
     ory Applied to Process Innovation in Knowledge             Proceedings of the 42nd International ACM SIGIR
     Work, Management Science 30 (1984) 727–738.                Conference on Research and Development in In-
     URL: https://www.jstor.org/stable/2631752, pub-            formation Retrieval, SIGIR 2019, Paris, France, July
     lisher: INFORMS.                                           21-25, 2019, ACM, 2019, pp. 1409–1410.
 [6] M. Allen, D. Cervo, Multi-Domain Master Data          [20] S. Zhang, K. Balog, Web table extraction, retrieval,
     Management: Advanced MDM and Data Gover-                   and augmentation: A survey, ACM Trans. Intell.
     nance in Practice, 1st ed., Morgan Kaufmann Pub-           Syst. Technol. 11 (2020).
     lishers Inc., San Francisco, CA, USA, 2015.           [21] F. El Outa, et al., Towards a conceptual model for
 [7] Software Product Lines. Carnegie Mellon                    data narratives, in: Conceptual Modeling, Springer
     Software Engineering Institute Web Site.,                  International Publishing, Cham, 2020, pp. 261–270.
     https://resources.sei.cmu.edu/library/asset-          [22] P. Vassiliadis, P. Marcel, S. Rizzi, Beyond roll-
     view.cfm?assetid=513819, 2022.                             up’s and drill-down’s: An intentional analytics
 [8] Walker S., Parker S, Hawker M., Radhakrish-                model to reinvent OLAP (long-version), CoRR
     nan D., Dayley A., Magic Quadrant for Master               abs/1812.07854 (2018). arXiv:1812.07854.
     Data Management. Gartner. ID G00466922.,              [23] P. Vassiliadis, P. Marcel, S. Rizzi, Beyond roll-
     https://www.gartner.com/en/documents/                      up’s and drill-down’s: An intentional analytics
     3995999/magic-quadrant-for-master-data-                    model to reinvent OLAP, Inf. Syst. 85 (2019) 68–
     management-solutions, 27 January 2021.                     91. URL: https://doi.org/10.1016/j.is.2019.03.011.
 [9] D. Loshin, Master Data Management, Morgan Kauf-            doi:10.1016/j.is.2019.03.011.
     mann Publishers Inc., San Francisco, CA, USA, 2009.   [24] E. Wu, Systems for human data interaction
[10] Andrew White. The Five Vectors of Complexity               (keynote), in: D. Mottin, et al. (Eds.), Proc of
     That Define Your MDM Strategy. ID: G00276267.,             the 2nd Workshop on Search, Exploration, and
     https://www.gartner.com/en/documents/                      Analysis in Heterogeneous Datastores (SEA-Data
     3038017/the-five-vectors-of-complexity-that-               2021@VLDB’21), 2021.
     define-your-mdm-stra, 27 April 2015.                  [25] Z. Shang, et al., Davos: A system for interactive
[11] Global Product Classification (GPC). GS1 Web Site.,        data-driven decision making, Proc. VLDB Endow.
     https://www.gs1.org/standards/gpc, 2022.                   14 (2021) 2893–2905.
[12] M. Hulsebos, et al., Sherlock: A deep learning ap-
     proach to semantic data type detection, in: Pro-
     ceedings of the 25th ACM SIGKDD International
     Conference on Knowledge Discovery & Data Min-
     ing, KDD ’19, 2019, p. 1500–1508.
[13] X. Deng, et al., Turl: Table understanding through
     representation learning, Proc. VLDB Endow. 14
     (2020) 307–319.
[14] D. Zhang, et al., Sato: Contextual semantic type
     detection in tables, Proc. VLDB Endow. 13 (2020)