<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using rules for assessing and improving data quality: A case study for the Norwegian State of Estate report</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ling Shi</string-name>
          <email>ling.shi@statsbygg.no</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dumitru Roman</string-name>
          <email>dumitru.roman@sintef.no</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>SINTEF</institution>
          ,
          <addr-line>Pb. 124 Blindern, 0314 Oslo</addr-line>
          ,
          <country country="NO">Norway</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Statsbygg</institution>
          ,
          <addr-line>Pb. 8106 Dep, 0032 Oslo</addr-line>
          ,
          <country country="NO">Norway</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The Norwegian State of Estate (SoE) report service - a service providing information about central government properties in Norway - is a result of integrating cross-domain government data originating from the Norwegian cadastral system, Business Entity Register, Building Accessibility Register and Statsbygg's property management system. This paper presents a rule-based approach to assess and improve the quality of the data upon which the SoE service is built. The approach develops a set of rules to specify a common data schema, rules for data quality assessment, and three dedicated measurement metrics for data integration. Application scenarios of the approach in identifying data inconsistencies in the sources are exemplified with strategies to improve data quality.</p>
      </abstract>
      <kwd-group>
        <kwd>Rule-based approach</kwd>
        <kwd>Data quality assessment</kwd>
        <kwd>Data integration</kwd>
        <kwd>Report service</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        A State of Estate (SoE) report1 produces a complete list of state-owned real estates2,
and represents a key input to the decision making process of the government or other
stakeholders to increase the effectiveness of the public resources allocation. The SoE
report in Norway is published as an attachment3 to the proposed parliamentary
resolution No.1 every four years by Statsbygg4 on behalf of the Ministry of Local
Government and Modernization5. The current reporting process is manual, static and
errorprone and the report is outdated when it is produced, therefore the report does not
1 An example of such a State of Estate report from the UK government can be found at
https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/200448/SOFTE2012_fin
al.pdf.
2 Real estates can also be called real properties, properties or cadastral parcels if the properties are
registered at the national cadastral system.
3 https://www.regjeringen.no/contentassets/f4346335264c4f8495bc559482428908/no/sved/stateigedom.pdf
4 http://www.statsbygg.no/Om-Statsbygg/About-Statsbygg/
5 https://www.regjeringen.no/en/dep/kmd/id504/
properly support the decision making process. A new State of Estate (SoE) report
generation process was introduced in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] to carry out the reporting task in a more
effective way, realized as a reporting service. It aims to provide users with a dynamic
and up-to-date report, including data visualization features, to better support the users’
decision making process.
      </p>
      <p>The new SoE reporting service reuses existing government data, from both open
and proprietary sources, and integrates them in a way that can serve as a basis for the
creation of the SoE service. The data sources include the Norwegian cadastral
system6, Business Entity Register7, Building Accessibility Register8, and Statsbygg's
property management system. Though data are collected from the most authoritative
government agencies, they are not 100% consistent with each other and the
inconsistency is one of the main challenges to create the SoE service. Our focus and
contribution in this paper is to establish a rule-based approach which develops a set of rules
to assess and improve the data quality. A rule-based approach is suitable in this
context, quick to implement, and easy to document and understand.</p>
      <p>The rest of this paper is structured as follows. Section 2 describes the SoE report
service case focusing on the value proposition. Section 3 presents the rule-based
approach for data quality assessment and improvement. Section 4 summarizes the paper
and outlines possibilities for further work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Norwegian SoE value proposition</title>
      <p>The State of Estate (SoE) service is a reporting service for state-owned properties in
Norway. The customers of the service include:
 Ministry of Local Government and Modernization (KMD);
 Property owners in the public sector;
 Public audience including the media;
 Real estate development companies.</p>
      <p>
        The SoE service allows the property owners in the public sector to do quality
assessment [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] on data of their own real estates. It should also provide the reporting and
visualization functions of state-owned properties to the above mentioned customer
groups.
      </p>
      <p>The value proposition canvas9 for property owners in the public sector is shown as
an example in Fig. 1 and Fig. 2. The property owners’ pains and gains are listed up in
the customer segment profile. Improved data quality, reliability, completeness and
accessibility are the main gains against the pains on static reports, manual data
collection and quality control and missing records. The value proposition map designs the
SoE report service and its Gain Creators and Pain Relievers, including data quality
6 http://www.kartverket.no/en/Land-Registry-and-Cadestre/
7 https://www.brreg.no/home/
8 https://byggforalle.no/uu/sok.html?&amp;locale=en
9 https://strategyzer.com/canvas/value-proposition-canvas
requirements on improved quality and completeness of the report and reduced number
of missing buildings.</p>
    </sec>
    <sec id="sec-3">
      <title>Rule-based data quality assessment and improvement</title>
      <p>
        In order to meet the data quality requirements illustrated in the value proposition
canvas, we established a rule-based approach to assess and improve the data quality
firstly of the source data and thereafter the result data of integration. Data quality rules are
contextual [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and the rules developed in this section are therefore valid within the
context of the SoE report service though the general method to categorize and define
the rules can be reusable in other contexts. The following sub-sections cover the rules
to specify a common data schema in Sub-section 3.1, rules for data quality assessment
in Sub-section 3.2, measurement metrics for data integration is introduced in
Subsection 3.3, and strategies for improving data quality is presented in Sub-section 3.4.
3.1
      </p>
      <sec id="sec-3-1">
        <title>Rules to specify a common data schema</title>
        <p>Data inconsistency and redundancy are well-known challenges in cross-domain data
integration. For example, the cadastral ownership information and building
information are registered both in the cadastral system and Statsbygg’s property
management system with different updating status; a property owner’s organization number
and name are registered both in the cadastral system and Business Entity Register but
the cadastral system and the Business Entity Register are not synchronized. This
subsection presents several steps: firstly to decide the master source systems for the
involved domains, afterwards to define rules to specify a common data schema and
integration keys.</p>
        <p>As a first step we make a decision on which source system is the master for each
domain or sub-domain involved in the data integration. The government
organizational structure reflects the domain responsibility for government data. Both the Business
Entity Register and the Cadastral system are national registers and provide data with
relatively high quality, therefore those two systems are defined as the master or
primary data sources for the correspondingly organization domain and cadastral domain.
Statsbygg’s property management system is defined as a supplementary data source
for the cadastral domain. The Building Accessibility Register is defined as a
supplementary data source for the cadastral building sub-domain.</p>
        <p>Though each source system has its own data schema, there is no common data
schema available for this data integration process. The next step is to define rules and
exceptions to build a common data schema on the class and attribute levels.</p>
        <p>Rules to specify a common data schema on the class levels. This type of rule
decides which source system is the master for a specified class. For example, the
“Organization” class from the Business Entity Register and the “Building” class from the
cadastral system are the master classes with national unique identifiers. However
there are also exceptions because of some special business rules in practice. For
example: Buildings less than 15 square meters are not required to be registered in the
cadastral register, neither do the embassy buildings in foreign countries. A
supplementary unique identifier for the “Building” class is needed to handle the exception
buildings without national unique identifier from the cadastral system.</p>
        <p>Rules to specify a common data schema on the attribute levels. The rules decide
which source system is the master for some specified attributes. For example: the
Building Accessibility Register is the master for the accessibility attributes of a
building though the “Building” class in the cadastral system is defined as the master class.</p>
        <p>The last step in this sub-section defines rules to specify attributes that can be used
to connect heterogeneous data sources (integration keys). The integration keys are
normally the unique identifiers of the master classes. For example, the organization
number for a real estate owner is an integration key to connect the cadastral system to
Business Entity Register. There are exceptions in cases such as a supplementary
unique identifier is needed to cover buildings less than 15 square meters. Both the
primary and supplementary unique identifiers are used in the integration to return a
complete building list for the SoE report service.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Rules for data quality assessment</title>
        <p>The result data is an integrated result of multiple source data. Both the source and
result data should be screened for potential syntactic and semantic errors using data
quality rules generated from existing domain models or expert knowledge. Examples
of different types of data quality rules include:
 Mandatory: The property owner is mandatory for a property ownership
record. The rule is broken when the property owner is missing.
 Data type: The area field of a building should be numeric. The rule is broken
when the area field set to text “N/A”.
 Data length: A municipality number should be four digits. The rule is broken
when a municipality number is made of three digits.
 Uniqueness: The cadastral building number should be unique. It breaks the
rule when one cadastral building number is registered on more than one
building in the Statsbygg’s property management system.
 Cardinality: A cadastral parcel is located in a municipality. The rule is
broken if the municipality field is missing.
 Data domain and range: The valid values of cadastral parcel ownership
types in this case should be either owned or leased. Including other values
than those two breaks the rule.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Measurement metrics for data integration</title>
        <p>In addition to the above rules, Table 1 shows three measurement metrics that are
dedicated to identify quality problems in the data integration and measure the quality of
the integration result.</p>
        <p>The integration keys are attributes used to connect heterogeneous data sources, and
they are currently registered manually in the referring systems (systems that refer to
the key attributes). There is no automatic updating of the values of integration keys
after registration. For example the organization number is registered as an identifier
for property owners when an ownership record is created in the cadastral system and
it is then kept to be static in the cadastral system and does not follow the changes or
deletions in Business Entity Register. Some of the integration keys are not mandatory
fields in the referring systems. Here are two examples of the Key Value Quality
metrics: 1) the percentage of outdated organization numbers for the property owners in
the cadastral system; 2) the percentage of missing or outdated cadastral building
numbers in the Statsbygg’s property management system.</p>
        <p>The Integration Quality metric measures the percentage of correct integrations in the
integration result. Though the key values may exist in the source system for master
data, it could also refer to a wrong data item. The cadastral ownership data contains
both the property owners’ organization number and name. The organization number is
used as an integration key to integrate the cadastral ownership data with Business
Entity Register. We identify afterwards the deviation between organization names
from two data sources to measure the correctness of the integration result.</p>
        <p>The non-matched rows metric returns the rows from one system that is not able to
be integrated with another system. This metric is especially useful when a
supplementary source data has partly more updated information on some specified data items
than the primary source data.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Strategies for improving data quality</title>
        <p>
          Examples of measurement metrics and corresponding quality improvement strategies
are presented in Table 2. The result of data integration [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] for the SoE service is
provisioned as RDF/Linked Data through a Linked Data generation process supported by
DataGraft10 [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ][
          <xref ref-type="bibr" rid="ref6">6</xref>
          ][
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and is available via the proDataMarket marketplace11. SPARQL
queries have been used as the underlying mechanism to assess the data quality. The
query results help the responsible staff assess and improve the data quality in the
source systems by following the suggested strategy for quality improvement. The
updated source data with better data quality are then reloaded to the integration
process to produce an updated result with improved quality.
10 https://datagraft.io/
11 https://prodatamarket.eu/
SPARQL query
to identify:
The owner name
difference between
cadastral system
and business entity
register12
The state-owned
properties that are
missing in the
previous SoE
report13
The state-owned
properties from the
previous SoE
report that are
missing in the
resulting SoE
report14
        </p>
        <p>Type of
measurement
metrics
Integration
Quality
Non-matched
rows
Non-matched
rows</p>
        <p>Possible reasons of
mismatch
Delayed or missing
updates of owner
names in the
cadastral system.</p>
        <p>The properties were
acquired after the
previous report was
made.</p>
        <p>The properties were
forgotten to be
registered in the previous
SoE report.</p>
        <p>The properties were
sold to a non-central
government
organization after the
previous report was
made.</p>
        <p>The properties are
abroad.</p>
        <p>There has been
organization change
with the owner and
the owner’s
organization number is no
longer valid in the
business entity
register.</p>
        <p>The ownership
change between
organizations in the
public sector is not
always officially
registered in the
cadastral system.</p>
        <p>The owner’s
organization is not
officially registered as
central government
organization in the
business entity
register.
12 https://datagraft.io/prodatamarket_publisher/queries/soe-query1-the-owner-name-difference
13 https://datagraft.io/prodatamarket_publisher/queries/soe-query2-missing-soe-records
14 https://datagraft.io/prodatamarket_publisher/queries/soe-query3-missing-result-soe-records</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Summary and outlook</title>
      <p>This paper introduced the State of Estate report service together with its value
proposition and the rule-based approach to address data quality issues. The report service is
a result of integrating cross-domain data from multiple sources as the cadastral
system, Business Entity Register, Building Accessibility Register and Statsbygg’s
property management system. A set of rules are developed to meet the data quality
requirements on SoE report service, including rules to specify a common data schema,
rules for data quality assessment and measurement metrics for data integration.
Strategies for improving data quality are also presented. The rule-based approach is quick
to implement and easily understandable both by domain experts and data engineers.</p>
      <p>For the further work, the identified rules shall be transformed to executable rules if
possible such that they can be applied directly in semantic reasoning to automate the
quality assessment process. The suggested quality improvement strategies can also be
half or fully automated to increase effectivity.</p>
      <p>Acknowledgements. The work in this paper is partly supported by the EC funded
project proDataMarket (Grant number: 644497).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Shi</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pettersen</surname>
            ,
            <given-names>B. E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Østhassel</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nikolov</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Khorramhonarnama</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berre</surname>
            ,
            <given-names>A. J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Roman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2015</year>
          ,
          <article-title>August)</article-title>
          .
          <article-title>Norwegian State of Estate: A Reporting Service for the StateOwned Properties in Norway</article-title>
          .
          <source>In International Symposium on Rules and Rule Markup Languages for the Semantic Web</source>
          (pp.
          <fpage>456</fpage>
          -
          <lpage>464</lpage>
          ). Springer International Publishing.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Pipino</surname>
            ,
            <given-names>L. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>Y. W.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>R. Y.</given-names>
          </string-name>
          (
          <year>2002</year>
          ).
          <article-title>Data quality assessment</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>45</volume>
          (
          <issue>4</issue>
          ),
          <fpage>211</fpage>
          -
          <lpage>218</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Chiang</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>R. J.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>Discovering data quality rules</article-title>
          .
          <source>Proceedings of the VLDB Endowment</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ),
          <fpage>1166</fpage>
          -
          <lpage>1177</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Halevy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rajaraman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Ordille</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2006</year>
          ,
          <article-title>September)</article-title>
          .
          <article-title>Data integration: the teenage years</article-title>
          .
          <source>In Proceedings of the 32nd international conference on Very large data bases</source>
          (pp.
          <fpage>9</fpage>
          -
          <lpage>16</lpage>
          ). VLDB Endowment.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Roman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nikolov</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Putlier</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sukhobok</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elvesaeter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berre</surname>
            ,
            <given-names>A. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ye</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dimitrov</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zarev</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moynihan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roberts</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berlocher</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Heath</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <article-title>DataGraft: One-Stop-Shop for Open Data Management. To appear in the Semantic Web Journal (SWJ) - Interoperability, Usability, Applicability (published and printed by IOS Press</article-title>
          , ISSN:
          <fpage>1570</fpage>
          -
          <lpage>0844</lpage>
          ),
          <year>2017</year>
          , DOI: 10.3233/SW-170263.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Roman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dimitrov</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nikolov</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Putlier</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sukhobok</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elvesaeter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berre</surname>
            ,
            <given-names>A. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ye</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Petkov</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <article-title>DataGraft: Simplifying Open Data Publishing</article-title>
          .
          <source>ESWC (Satellite Events)</source>
          <year>2016</year>
          :
          <fpage>101</fpage>
          -
          <lpage>106</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Roman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dimitrov</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nikolov</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Putlier</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elvesaeter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Petkov</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <article-title>DataGraft: A Platform for Open Data Publishing</article-title>
          .
          <source>In the Joint Proceedings of the 4th International Workshop on Linked Media and the 3rd Developers Hackshop. (LIME/SemDev@ESWC</source>
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>