<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using PROV-O to Represent Lineage in Statistical Processes: A Record Linkage Example*</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Franck Cotton</string-name>
          <email>franck.cotton@insee.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guillaume Duffes</string-name>
          <email>guillaume.duffes@insee.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Flavio Rizzolo</string-name>
          <email>flavio.rizzolo@statcan.gc.ca</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institut national de la statistique et des études économiques</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institut national de la statistique et des études économiques</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Statistics Canada</institution>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Provenance and lineage, i.e. information about the data origins and its evolution throughout its life-cycle, are becoming more critical in a world of official statistics, which is under growing pressure to be more open and transparent, and whose statistical outputs are increasingly based on admin data and other sources of varying quality. We focus here on provenance and lineage requirements in the context of record linkage, i.e. the activity of finding and linking records from different sources that refer to the same entity, which provides one of the most complex lineage use cases in statistical production. We define a generic lineage model with different levels of granularity and explore the use of PROV-O to capture its metadata in a machine-actionable standard.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Statistical offices need to find, acquire and integrate data from both traditional and new
data sources at an ever-increasing speed. In a world of fake news and alternate facts,
the demand for trusted data has never been higher. To be able to assess the quality of
statistical outputs, it is essential to understand the data lifecycle across the entire
statistical production (best described by the Generic Statistical Business Process Model [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]).
      </p>
      <p>Provenance and lineage can be information on processes, methods and data sources
and their relationships to data outputs. They provide the complete traceability of where
data has resided and what actions have been performed on the data over the course of
its life. With the advent of Big Data, cloud platforms and IoT, statistical offices have
started to use more distributed data processing approaches, which makes provenance
and lineage more critical than ever before.</p>
      <p>
        The PROV data model [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is a well-established model for provenance metadata and
is mapped to RDF through the PROV ontology [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Since more and more statistical
offices use RDF for metadata, it is valuable to investigate the use of PROV-O for
*
      </p>
      <p>Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
capturing lineage and provenance metadata about the statistical processes. To this end
we apply PROV to describe metadata lineage about an activity at the core of statistical
production: record linkage. Record linkage depends on several capabilities with
complex lineage requirements, e.g. data cleansing, standardization, coding, integration,
entity matching, etc. which need to record traceability at multiple levels of granularity,
e.g. data set, variable, record, etc. In this paper, we explore the use of PROV-O for the
representation of lineage metadata needed to document a record linkage process.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Data Matching and Record Linkage</title>
      <p>
        Data matching, or entity resolution, is the identification, matching and linking of
different instances/representations of the same real-world entity [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Entities of interest
might be people, families, businesses, tax payers, products, countries, etc. In any
complex environment in which information integration of heterogeneous sources plays a
crucial role, entities could multiply very rapidly as a result of differences in
representations, e.g. same business with different spelling in name, address, etc., entirely
different ids, or differences in states/versions/variants, e.g. preliminary, final, raw, imputed,
v2010, v2012, etc. Entity resolution applies to all types of data integration scenarios
and has received special attention in the statistical production domain for resolving
statistical units and linking their records, which is known as record linkage [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        Deciding whether records match or not is often computationally expensive and
domain specific. One way of dealing with this problem is to have entirely different,
specialized approaches for each entity domain, i.e. people, businesses, data sets, metadata,
etc. This is the approach traditionally followed in record linkage. Another approach is
to divide the problem into a generic part (for iterating over the data set, deal with
constraint and merge/link records) and a set of domain-specific similarity functions (to
determine whether two records refer to the same entity) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The latter approach allows
for a separation of concerns between the data management algorithms and the actual
pair-wise domain-specific resolution functions. Both approaches are used extensively
depending on the application.
2.1
      </p>
      <sec id="sec-2-1">
        <title>Lineage requirements for Record Linkage</title>
        <p>Linking two records about the same entity brings about a number of provenance and
lineage requirements at different levels, e.g. between the record themselves, the
variables from different dataset that are used for linking, the data sets produced at different
stages of the record linkage process, etc. These different types of lineage are to be
maintained for audit and reproducibility purposes. The description of the linkage process
should include lineage metadata as specified by the model in Figure 1.</p>
        <p>
          We will use here the Record Linkage Project Process Model [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] as a reference to
describe the types of lineage required for record linkage. For an in-depth presentation
of all aspects of record linkage, please refer to [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] and [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
class Lineage Model
        </p>
        <p>Dimensional Data</p>
        <p>Set
Dimensional Data</p>
        <p>Point
Aggregate
Variable</p>
        <p>Unit Data Sets are aggregated into
Dimensional Data Sets. We want to track
the Unit Data Sets that contributed to
the Dimensional Data Set.</p>
        <p>Data Set
Aggregation</p>
        <p>Lineage
Unit data points are aggregated into
dimensional data points, e.g. cels in a
data cube. Sometimes microdata values
are missing and dimensional data points
are edited or imputed. Sometimes values
are adjusted or normalized. We want to
track the original unit data points and
the transformation applied.</p>
        <p>Data Point
Aggregation</p>
        <p>Lineage
Micro variables are aggregated into
aggregate variables. We want to track
the variables that contributed to the
aggregation and how the aggregation
was computed.</p>
        <p>Variable
Aggregation
Lineage</p>
        <p>Unit Data Set
Unit Data Record
Unit Data Point
Micro Variable</p>
        <p>Data Set Lineage
Record Lineage</p>
        <p>Data Point</p>
        <p>Lineage
Variable Lineage</p>
        <p>A data set is derived from
other data sets as result of a
merge or an integration. We
want to track the original
data sets and the
transformation applied.</p>
        <p>A data set includes records
from other data sets, either (i)
as-is or (i) integrated via
some record linkage process.</p>
        <p>We want to track (i) where
the record came from or (i)
which records are its
contributors and what
integration was applied.</p>
        <p>Data points are changed by
data cleansing, edit &amp;
imputation, normalization,
adjustments, etc. We want to
track the original data point
and the transformation
applied.</p>
        <p>Variables are derived from
other variables. We want to
track the original variable(s)
that contributed to the
derived variable and the
transformation applied.</p>
        <p>The first preliminary step involving data manipulation with lineage requirements is
4.1 Standardize linkage variables. In this step, the source data sets are subset for
privacy protection to contain only the linkage variables. Variables commonly kept include
names, demographic variables (e.g. sex, date of birth), geographic variables (e.g. postal
codes) and unique identifiers (e.g. social or health insurance numbers, business
numbers). Values contained with the linkage variables may be combined or concatenated to
create linkage keys. Structure, format (e.g. data types) and codesets are standardized to
facilitate comparability and integration. The resulting data sets are the standardized data
sets.</p>
        <p>This step requires data set lineage between the source data sets and their respective
standardized data sets, as well as variable lineage between the linkage variables. The
latter will include information about the changes in formats and codesets in the cases
where the variables were transformed in the process of creating the standardized data
sets. The step is illustrated in Figure 2.</p>
        <p>Metadata required: standardized linkage variables, standard formats and codesets.
class Subset variables and standardize</p>
        <p>Select and standardize
linkage variables to
ensure comparability.</p>
        <p>Complete data sets to be
linked</p>
        <p>Source Data Sets</p>
        <p>Subset
variables and
standardize
Data Set Lineage
and Variable</p>
        <p>Lineage</p>
        <p>Source data set
containing only linkage
variables (variable
subsetting)
Standardized Data</p>
        <p>Sets
The next step is 4.3 Identify in-scope records for linkage. This step defines inclusion
and exclusion criteria to identify records from the standardized data sets that are eligible
for record linkage – records with incomplete or missing values may be considered
ineligible for linkage. In all cases, counts of eligible and ineligible records are produced
for each standardized data set. In some cases, a new collection of linkage-ready data
sets containing only eligible records is produced, which requires data set lineage
between them and the standardized data sets with all records. The step is illustrated in
Figure 3.</p>
        <p>Metadata required: inclusion and exclusion criteria, counts of eligible and ineligible
records.</p>
        <p>class Subset records</p>
        <p>Define inclusion and
exclusion criteria to
identify records eligible
for record linkage.</p>
        <p>Source data set
containing only linkage
variables (variable
subsetting)
Standardized Data</p>
        <p>Sets</p>
        <p>Subset records</p>
        <p>Standardized Data Sets
containing only in-scope
records
Data Set Lineage</p>
        <p>Linkage-Ready</p>
        <p>Data Sets
Fig. 3. Subset standardized data sets to in-scope records</p>
        <p>From the linkage-ready data sets containing either all or only eligible records, the
actual record linkage process starts. The first linkage step is 5.1 Identify potential pairs.
This could be done by generating the cross product of the linkage-ready data sets, i.e.
compare all possible pairs of records, or by blocking, i.e. partitioning the linkage-ready
data sets into mutually exclusive and exhaustive blocks of records to reduce the number
of pairs to be compared, since comparison is restricted to pairs of records only within a
block.</p>
        <p>The goal of blocking is to reduce the number of pairs to compare by removing pairs
that are unlikely to produce true matches. Blocking keys are used to define the blocks
of similar pairs. For instance, a common blocking key for person units is postal code,
which is based on the assumption that two entities that have similar postal code are
more likely to actually be the same person. This comparison is based on some similarity
function to be applied to the blocking keys.</p>
        <p>Computing the full cross-product, let alone creating a cross-product data set, is often
unfeasible, but computing the cross-product within blocks is not. This requires both
data set lineage from the linkage-ready data sets to each block, and also record lineage
from each pair of linkage-ready data set records to each resulting block record. The step
is illustrated in Figure 4.</p>
        <p>Metadata required: blocking keys, similarity function.</p>
        <p>class Identify potential pairs</p>
        <p>Identify set of pairs for
consideration as
potential links
Standardized Data Sets
containing only
inscope records</p>
        <p>Linkage-Ready</p>
        <p>Data Sets
Potential pairs of records
containing only linkage
variables (cross-product
within Blocks)</p>
        <p>Identify
potential pairs</p>
        <p>(Blocking)
Data Set Lineage</p>
        <p>Blocks
Once the pairs of records are established, the candidate record pairs in the blocks need
to be fully compared to determine their overall similarity. This similarity is calculated
by comparing several variables beyond the blocking variables used to define the blocks.
For instance, if the blocking was based on postal code, then street name and last name
could be used now to compare pairs of records within the blocks.</p>
        <p>Record comparison could be done deterministically or probabilistically. In
deterministic record linkage, a pair of records is said to be a match if they agree on each element
within a collection of variables called the match key. For example, if the match key
consists of last name, street number and name, and year of birth, a comparison function
could define a match to occur only when names agree on all characters, the years of
birth are the same, and the street numbers are identical.</p>
        <p>Under the Fellegi–Sunter model for probabilistic linkage, pairs of records are
classified as match, possible match, or no match. This is determined by comparing the
likelihood ratio of having a match, namely R, against two thresholds called Upper and
Lower. If R is greater than or equal to Upper, then the pair is considered a match. If R
is smaller than or equal to Lower, then the pair is considered a no match. If R is
somewhere between Lower and Upper then the pair is considered a possible match and is
submitted to manual review for a final decision.</p>
        <p>Deterministic and probabilistic comparisons and validation are done in 5.2 Field and
record comparison, 5.3 Linkage rules, and 6 Assess quality. Record comparison relies
on field comparison, which is done using comparison functions or linkage rules.</p>
        <p>The result of this iterative process is a set of linkage keys. A linkage key is a code
created to identify the records related to the same entity in the source data sets. As such,
it does not contain any identifying information that may have been used to create the
links. Lineage involved in these steps include a combination of record lineage and
variable lineage, since the block records that match need to be traced back to their
respective blocks and are assigned a linkage key, which functions as a new variable. This step
is illustrated in Figure 5.</p>
        <p>Metadata required:
─ Deterministic record linkage: comparison function, match key.
─ Probabilistic record linkage: likelihood ratio R, Lower, Upper, field agreement
weights (per linkage variable), agreement level weights (per record pair), record
similarity, agreement level (match, possible match, no match).
─ Both: linkage keys.</p>
        <p>class Compare records and produce linkage keys</p>
        <p>Perform iterative record
comparison and review,
and produce linkage
keys to identify records
related to the same
entity
Potential pairs of records
containing only linkage
variables (cross-product
within Blocks)</p>
        <p>Candidate Pairs
(within Blocks)</p>
        <p>Compare
records and
produce
linkage keys</p>
        <p>Values contained with
the linkage variables may
be combined or
concatenated to create
linkage keys
Record Lineage and Variable Lineage</p>
        <p>Linkage keys
Fig. 5. Compare records to find matches and produce linkage keys</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Using PROV-O</title>
      <sec id="sec-3-1">
        <title>Vocabularies used</title>
        <p>
          As the title of the paper suggests, we will be using the PROV vocabulary [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] to
represent the different entities and activities that we identified in the record linkage process.
PROV is expected to have a great utility for the representation of statistical processes
at various levels of details, and it is actually one of the pillars of an ongoing work
recently launched by the standardization activity of the UNECE ModernStats initiative1,
which aims at building a Core Ontology for the Official Statistics community2.
        </p>
        <p>
          Apart from PROV, we use in the following examples a few well-known vocabularies
with their usual namespaces and prefixes, as well as elements from the Data Quality
Management Vocabulary [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] in the detailed models.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>A very simple case</title>
        <p>We start in this section from a simplified model of record linkage. Basically, we match
two datasets A and B and the record linkage produces three outputs: the set of records
that are matched, the set of records that are not matched, and the set of records that
could be matches but need additional verification (through human review for example).
Modelling this simplified view with PROV will allow us to grasp the principles of the
modelization and how to attach lineage metadata to the process.</p>
        <p>In PROV-O terms, we can represent the inputs and outputs of the record linkage
operation as instances of prov:Collection, composed of instances of
coos:StatisticalDataset. In a simple PROV representation, a “Match input datasets”
activity uses the input collection and generates the output collection. This output collection
is linked to the input collection by a prov:wasDerivedFrom property. This simple
model is represented in the following figure.</p>
        <p>In the Statistics Canada model (Figure 1), the link between input and output data sets
should be described at high-level with rich data set lineage metadata. For now, we only
have a prov:wasDerivedFrom RDF predicate, which is clearly not fit for purpose
since RDF predicates cannot bear additional information. We thus have to use the
qualification mechanism described in the PROV-O specification. According to this
mechanism, the prov:wasDerivedFrom property can be qualified as a
1 https://statswiki.unece.org/display/hlgbas
2 https://github.com/linked-statistics/COOS/
prov:qualifiedDerivation property pointing from the output entity to a
prov:Derivation instance.</p>
        <p>It is recommended in PROV to define sub-classes of prov:Derivation for specific
use cases. Here, referring to the Statistics Canada model, we can define the
coos:DataSetDerivation class (note that the placement in the COOS namespace is
provisional). An instance of this class, “Derive outputs” is then created, connecting the
output dataset to the inputs and to the activity. More lineage metadata can also be attached
to the “Derive outputs” derivation. The corresponding model is given in the following
figure.</p>
        <p>
          We can enrich the model by adding information on the agents involved in the matching
operation. We will keep the model simple and just suppose that the matching is made
by Istat, the Italian statistical organization, using the RELAIS software [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. The
software will be represented as an instance of the prov:SoftwareAgent class, and Istat
by an instance of the coos:Organization (which is a sub-class of
prov:Organization). The matching activity is then connected to the RELAIS resource by a
prov:wasAssociatedWith property.
        </p>
        <p>It should be noted that the qualification mechanism can be applied to all the properties
of the basic model, which allows to add information on how the inputs were used by
the activity or how the outputs were generated. The fully-qualified model is given in
the following figure.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Digging deeper</title>
      <p>
        For some use cases, the high-level view described in the previous section is sufficient,
but if we want to describe more precisely the statistical process it is necessary to break
it down into more fine-grained steps. In this section, we will use PROV again to
describe the four diagrams introduced in the first part of this paper. In addition, we will
include statements conforming to the Data Quality Management Vocabulary [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. We
will call the four steps “Class 1” (Figure 2) to “Class 4” (Figure 5) in order to make
unambiguous references to the descriptions made above.
      </p>
      <p>Each step has been modelled as a RDF/Turtle fragment, and the four fragment files
can be found on GitHub3. For the sake of brevity, we will only deal with fragments 2
(Figure 3) and 4 (Figure 5) below.
4.1</p>
      <p>From standardized data sets to linkage-ready data sets
The following Turtle fragment and the picture below correspond to the “Class 2”
(Figure 3) step.
reclink:standardizedDatasets a prov:Entity ;</p>
      <p>rdfs:label "Standardized datasets"@en .
3 https://github.com/FranckCo/PROV-Process/blob/master/src/main/resources/data
reclink:subsetInScopeRecords a prov:Activity ;
rdfs:label "Identify in-scope records"@en ;
prov:wasInformedBy reclink:assessLinkageVariable .
reclink:assessLinkageVariable a prov:Activity ;</p>
      <p>rdfs:label "Assess linkage variables"@en .
reclink:inclusionExclusionCriteria a prov:Entity , dqm:MatchingValueRule ;
dqm:assessment "true"^^xsd:boolean .</p>
      <p>The prov:Entity named reclink:standardizedDatasets is a (collection of)
subsets from the source datasets that only include the linkage variables, referred to as
“Standardized Data Sets” in the “Class 1” (Figure 2) diagram. The
reclink:assessLinkageVariables activity evaluates the quality and discriminatory power of the
linkage variables expected to impact both the quality of the linkage and ultimately the
use of the matched data. This activity also generates the
reclink:inclusionExclusionCriteria entity which establishes inclusion and exclusion criteria to identify
records from the source data sets that are eligible for record linkage. These criteria are
a specific form of multi property requirements in which external data are used to
identify data requirements violations in an instance, which makes them instances of
dqm:MatchingValueRule.
4.2</p>
      <sec id="sec-4-1">
        <title>From candidate pairs to linked keys</title>
        <p>This last part of the record linkage operation corresponds to the “Class 4” (Figure 5)
step above. The Turtle code of this fragment is too long to be included below but can
be found in Github4. We see in this fragment that the reclink:blocks PROV entity,
i.e. Blocks in the “Class 3” (Figure 4) diagram, is derived from
reclink:linkageReadyDatasets, i.e. Linkage-Ready Data Sets in the “Class 2” diagram (Figure
3), through the activity reclink:identifyPotentialPairs, i.e. Identify potential
pairs in the “Class 3” diagram (Figure 4). The reclink:compareRecords, i.e.
Compare records in the “Class 4” diagram (Figure 5), involves the comparison of strings or
numeric representation of the paired records. The entity
4
https://github.com/FranckCo/PROV-Process/blob/master/src/main/resources/data/fragmentclass-4.ttl
reclink:comparisonOutcomes generated by the reclink:compareRecords
activity, is used to make a linkage decision to determine whether each record pair is a
match or a non-match: either through a probabilistic method with the
reclink:makeLinkageDecisionsProbabilistic activity or a deterministic one
with the reclink:makeLinkageDecisionsDeterministic activity, i.e. Produce
linkage keys in “Class 4” diagram (Figure 5).</p>
        <p>In a deterministic linkage, this decision is based on a sequence of logical conditions
(i.e. the entities reclink:logicalConditions) following a rule-based approach.
The logical conditions state roughly that values of a given tested property must always
obtain a certain state under the condition that values of another property obtain a certain
state (condition). Therefore they can also be assimilated to instances of
dqm:ConditionRule. In a probabilistic linkage, the entity reclink:linkagePairsWeight is
calculated for each record pair and compared to two thresholds in order to make a
decision. These weights are computed and somehow used as data quality scores, which
makes them instances of dqm:DataQualityScore.</p>
        <p>The final output is a linkage key file, i.e. Linkage keys in the “Class 4” diagram
(Figure 5): the reclink:linkageKeys is an anonymized file that contains only the
unique identifiers necessary for identifying the records related to the same entity in the
source data sets (reclink:sourceDatasets).
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>We showed in this paper that PROV-O can be used to represent lineage of statistical
processes at different levels of detail, from the high-level view where a complex
operation like record linkage is considered as a single activity, to much more fine-grained
representations. An aspect to be addressed in future work (but which is quite easy to
grasp conceptually) is how a dataset can be viewed as a collection of records and
variables that can be modelled as PROV entities, so the operations described in this paper
at the dataset level can be directly extended to the record or variable (or block) levels.</p>
      <p>
        The precise nature and content of the lineage metadata itself, e.g. similarity
functions, codesets, likelihood ratios, weights, in the record linkage example, can also be
represented in RDF to be attached to PROV constructs. One solution to do that could
be to use a flexible mechanism, like the one described in the SDMX information model
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. An ongoing initiative [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] is currently attempting to express this part of the SDMX
model in OWL, which would allow to express in RDF any SDMX Metadata Structure
Definition and Metadata Set.
      </p>
      <p>Metadata is always more powerful when used in an active way to actually implement
(and not only document) the statistical process: this is the paradigm of "active metadata"
which has been developed in the statistical community in recent years. We can actually
change perspectives and use provenance metadata to specify the process, with
automatic process generation from this formal specification: that is where
machine-actionability becomes really interesting.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>The</given-names>
            <surname>Generic Statistical Business Process Model</surname>
          </string-name>
          (GSBPM), https://statswiki.unece.org/display/GSBPM/GSBPM+v5.1, Version 5.1,
          <string-name>
            <surname>January</surname>
          </string-name>
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>PROV-DM: The PROV Data Model</surname>
          </string-name>
          , https://www.w3.org/TR/prov-dm/,
          <source>W3C Recommendation 30 April</source>
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>PROV-O: The PROV Ontology</surname>
          </string-name>
          , https://www.w3.org/TR/prov-o/,
          <source>W3C Recommendation 30 April</source>
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Christen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Data Matching: Concepts and Techniques for Record Linkage</article-title>
          , Entity Resolution, and Duplicate Detection.
          <source>Springer 2012. ISBN 978-3-642-31163-5.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Herzog</surname>
            ,
            <given-names>T. N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scheuren</surname>
            ,
            <given-names>F. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Winkler</surname>
            ,
            <given-names>W. E.</given-names>
          </string-name>
          :
          <article-title>Data quality and record linkage techniques</article-title>
          .
          <source>Springer 2007, ISBN 978-0-387-69502-0</source>
          ,
          <string-name>
            <surname>pp. I-XIII</surname>
          </string-name>
          ,
          <fpage>1</fpage>
          -
          <lpage>227</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Whang</surname>
            ,
            <given-names>S. E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garcia-Molina</surname>
          </string-name>
          , H.:
          <article-title>Developments in Generic Entity Resolution</article-title>
          ,
          <source>IEEE Data Engineering Bulletin</source>
          , vol.
          <volume>34</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>51</fpage>
          -
          <lpage>59</lpage>
          , Sept.
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Sanmartin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Trudeau</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>The Record Linkage Project Process Model</article-title>
          . In: UNECE Workshop on Implementing Standards for Statistical Modernisation,
          <year>Sept 2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>The</given-names>
            <surname>Data Quality Management Vocabulary Specification</surname>
          </string-name>
          , http://semwebquality.org/dqmvocabulary/v1/dqm, V
          <volume>1</volume>
          .00,
          <string-name>
            <surname>Release</surname>
          </string-name>
          09-
          <fpage>10</fpage>
          -
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>RELAIS (REcord Linkage At IStat</surname>
          </string-name>
          ), https://www.istat.it/en/methods-and-tools/methodsand-it-tools/process/processing-tools/relais,
          <source>Last edit: 20 March</source>
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>SDMX</given-names>
            <surname>Information</surname>
          </string-name>
          <article-title>Model: UML Conceptual Design, version 2.1, Chapter 7: Metadata Structure Definition</article-title>
          and Metadata Set, p.
          <fpage>75</fpage>
          -
          <lpage>91</lpage>
          ,
          <year>July 2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>SDMX</surname>
          </string-name>
          <article-title>Metadata: An RDF vocabulary for representing the SDMX metadata model</article-title>
          , https://linked-statistics.github.io/SDMX-Metadata/sdmx-metadata.html,
          <source>W3C Draft Community Group Report 09 April</source>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>