                                 A Prototype ISO/IEC 11179 Metadata Registry

                    Gramm Richardson                                                           Elli Schwarz
                U.S. Department of Defense                                                   SRA International
                 gpricha@tycho.ncsc.mil                                                  eliezer_schwarz@sra.com

Abstract—Different systems across the government, as well as in      management, querying and updating of these code sets. This
the private sector, use different country names or country codes     registry can also provide a framework for tackling the
to represent the notion of a “country” within a particular           challenge of mapping entities from one code set to another.
problem domain. These systems may choose to represent
countries using a particular standard for county names and               This rest of this paper describes the Constellation metadata
country codes. Often times these systems find themselves             registry system, which uses the ISO/IEC 11179-3 Edition 3
interacting with other systems that may use another standard for     registry metamodel [6] standard to register and map country
country representation. This makes it difficult to compare and       code sets. We will describe in more detail the nuances of
link country-related data in a consistent fashion. We describe our   common country code management challenges. We will discuss
work on the Constellation system using the ISO/IEC 11179             our approach to designing a country code registry using an
metadata standard to register the various country code sets in a     OWL ontology based on the ISO/IEC 11179 metamodel, and
common metamodel. This facilitates management, querying,             explain how we handle updates. We will also describe our
updating and mapping the elements within the code sets.              algorithm used to match countries across code sets.
   Keywords: metadata, country codes, ontology                              II.   COUNTRY CODE MANAGEMENT CHALLENGES
                       I.    INTRODUCTION                               The complex nature of country data poses several
                                                                     challenges for its management in a registry:
     There exist numerous international and national standards
for country and country code representations. Some are                      A country/geopolitical entity may have an official name
designed to represent countries within a certain domain, such as             and several alternate names, and some of these names
the ITU-T e.164 [1] codes to represent telephone dialing codes               may be in multiple languages.
for countries, or the ICAO [2] codes to represent country                   In some country code standards, there may be multiple
prefixes for airplane tail numbers. Other codes are attempts at              code formats for each country. For example, in ISO
international or national standardization, such as ISO 3166 [3]              3166-1, each country has trigraphs, digraphs, and
codes and NGA Geopolitical Codes [4]. Each of these                          numeric codes, whereas other standards may have only
standards has its own terminology and criteria for inclusion in              one code format per country.
its list.                                                                   One country may have multiple codes in one format,
    Unfortunately, there is no unambiguous, standard definition              such as in the ICAO Nationality Marks code set. In that
of the term ―country‖ [5]. Many country code sets contain                    code set, South African aircraft can bear the nationality
entries for entities that might not be thought of as countries in            marks ―ZS‖, ―ZT‖, or ―ZU‖.
the common usage of the word. A code set may consider a                     Multiple countries in a single code set may share the
semi-autonomous or dependent entity to be a country in its own               same code, such as in ITU-T e.164, where 25 countries
right, or it may include non-country placeholders such as                    share the country dialing code ―1‖.
―reserved‖ or ―unknown‖. Some code sets may list a region or                A geopolitical entity may be a dependency of another
entity for practical, political, or diplomatic considerations,               country, like a state, territory, province, or outlying
notwithstanding the entity’s precise legal status.                           area. In ISO 3166, these entities are listed in a separate
                                                                             code set for dependencies, ISO 3166-2. The code set
    To further complicate matters, these country lists are not               ISO 3166-1 is used for what it considers to be ―top-
static. Dependent territories may become independent, civil                  level‖ (usually independent) countries. In ITU-T e.164,
wars may split countries, two countries can unify, or a country              the dependency may be explicitly written out as part of
may simply decide to change its official name. To keep up with               the country name in parenthesis, as in the case of
changing realities, many of these code sets or standards                     ―Greenland (Denmark)‖. In other code sets, the
organizations publish updates to their lists from time to time.              administrator is ignored.
This adds a chronological dimension to the maintenance of                   Some code sets may have entries for regions (such as
county code sets.                                                            Europe or Asia) or transnational groups (such as EU,
   All of the above factors make it necessary to maintain these              UN, or NATO) which are not traditionally thought of
code sets together in one registry that can facilitate the                   as countries.
       Code sets change over time. New versions of code sets        that we allow any vocabulary or data format to be used in each
        might be released, and updates to individual entities in     particular code set, and rely on our own internal metamodel to
        the code set, like code or name changes or even              accommodate all of these diverse data models in a uniform
        spelling corrections, might be issued.                       fashion.
    Using an ontology can be the first step toward managing              Furthermore, it is important that whatever internal
some of the above complexities. The UN FAO (Food and                 metamodel we use not be proprietary, and be able to handle
Agriculture Organization) ontology [7] illustrates one approach      updates to the data without losing the data contained in earlier
to add some degree of structure to the attributes of a country or    versions. Using a standard metamodel would enable a more
region. It provides an OWL ontology with properties such as          widespread use and understanding of our system, and would
fao:nameOfficial and fao:nameShort for the different forms of        also enable it to be used by other kinds of data besides country
a country name (with a language tag to indicate the language of      codes, to facilitate integration with a wider range of problem
the name), fao:validSince and fao:validUntil for valid dates for     domains. Maintaining a version history of the data would be of
a particular country, and fao:isAdministeredBy to represent the      great use if the system were to integrate with other systems that
administering country. It also provides many other additional        contain data from an earlier point in time. To accommodate all
properties of importance to countries, such as                       these issues, we chose to develop the Constellation system
fao:sharesBorderWith, fao:predecessorOf, fao:memberOf, and           using the ISO/IEC 11179 metamodel standard [6] to register
other useful properties.                                             our country code metadata. This standard, with some of our
                                                                     own minor extensions, enables us to build a system that can not
    Additionally, SKOS [8] can be used to provide some level         only register countries, codes, and mappings among these
of abstraction to the concept of a country and its name and code     countries, but also handle different versions of the various code
representations. Using the SKOS vocabulary in OWL provides           sets and updates.
the skos:Concept class, and instances of this class can represent
countries, with properties such as skos:prefLabel to represent          III.   IMPLEMENTING THE ISO/IEC 11179 METAMODEL IN
the preferred name, and skos:altLabel to represent other names                        OWL FOR CONSTELLATION
(with language tags on the literal to represent the language of
the name). SKOS Mapping Properties such as skos:closeMatch               The goals of the Constellation country code metadata
and skos:broadMatch can be used on these country instances to        registry are to represent the metadata using a consistent
map similar countries or country relationships. SKOS                 terminology, provide a uniform way of querying the data,
Documentation        Properties    such     as    skos:note    or    manage updates without disrupting previous versions of the
skos:changeNote can be used to further describe a country and        data, and facilitate storing relationships between data elements.
changes to a country.                                                    The ISO/IEC 11179 metamodel describes a variety of
    Methods of supplying the country code for a SKOS country         classes, attributes, and associations between classes useful for
concept have also been proposed in [9]. One possibility              representing metadata about country objects. In Constellation,
mentioned there is adding new properties for the different types     we implemented these classes and attributes in an OWL
of codes (iso3166:twoLetterCode or iso3166:numericalCode),           ontology. We represent the set of all countries in a code set as
or using a skos:prefLabel with a special private language tag to     an instance of the Conceptual_Domain class, and the set of
indicate the code type (such as using the skos:prefLabel             country codes in that code set as a Value_Domain. Each
property with ―FR‖@x-notation-twoletter as the literal).             Value_Domain can represent one country code format (e.g.,
                                                                     digraph or numeric). In most code sets we registered, there is
     SKOS-XL [10] has been proposed to further extend SKOS.          only one code format for each country, so there would be one
It provides a class skosxl:Label to further abstract the notion of   Value_Domain. In other code sets, for example ISO 3166-1,
a name from the country it represents, so the name can have its      there are three code formats for each country – the trigraph,
own properties independent of the country itself. Thus, a date       digraph, and numeric codes. Each of these formats would be a
or other provenance information pertaining to the name can be        separate Value_Domain within the Conceptual_Domain for
accommodated [11]. The Library of Congress proposed an               ISO 3166-1. The Value_Domain is made up of a set of
additional ontology, MADS/RDF [12], which builds on SKOS             Permissible_Values that contain the code (known as the
but provides additional classes and properties designed to           ―permitted value‖) for a country.
model geographic and other kinds of names, as well as thesauri
and other controlled value lists. The Library of Congress                Each country entry is modeled as a Value_Meaning within a
MARC [13] codes use the MADS/RDF ontology to represent               Conceptual_Domain. The Conceptual_Domain is thus made up
its list of geographic areas.                                        of a set of Value_Meanings. Each country can contain several
                                                                     names (official names or other forms of the name), in multiple
    Using these ontologies are a good start toward registering       languages. In order to separate the concept of ―country‖ from
country code metadata in a way that manages many of the              that of its name, we use the 11179 Designation class to
complexities listed above. However, we cannot expect that            represent a label or name for a country Value_Meaning. This
each country code set we want to register will provide their         Designation contains a ―sign‖ property containing the actual
data in this fashion. Some existing code sets are provided as        country name, and a language identifier property to represent
CSV files, with columns mapping country names to country             the language used for that name. We use a
codes, without any schema at all. Many other code sets are           Designation_Context to describe the ―acceptability‖ of a
available only as tabular data embedded in web pages or text         Designation within the context of a Conceptual_Domain. The
documents that we converted to CSV. Therefore, it is important       acceptability ratings are described in ISO/IEC 11179 as being
on a scale of: preferred, admitted, deprecated, obsolete and                 TABLE I.     CODE SETS REGISTERED IN CONSTELLATION
superseded. Only one Designation per language is ―preferred‖          Code Set            Description
in a given Context; we use ―admitted‖ to represent the other
forms of the name.                                                    International Organizations
                                                                      International Civil Aircraft nationality marks based on the
    Value_Meanings and Permissible_Values each contain a
                                                                      Aviation            Chicago Convention on International
property for begin_date and optional end_date. This is used to
represent the time period when the code set considers that value      Organization        Civil Aviation, as reported to ICAO by
to be part of its official list. Instances of these classes without                       national administrations. Used as the
an end_date are considered to be the latest valid entry. We                               prefix of an aircraft tail number.
extended the 11179 standard to add these date fields to the           International       Codes identifying the National Olympic
Designation_Context as well. If a code set has several versions       Olympic             Committees/National Teams
(such as when new countries are added, names or codes                 Committee           participating in the Olympics
change, etc.) we can represent this with multiple instances of        ISO 3166-1, ISO     Entities which are members of the UN
the class, each with a different date range. A diagram depicting
                                                                      3166-2              or one of its specialized agencies and
an example of some instances of these classes can be found in
                                                                                          parties to the Statute of the International
Fig. 1.
                                                                                          Court of Justice, or registered by the UN
    The 11179 standard also provides a way to depict                                      Statistics Division. Part 2 of the standard
relationships among concepts. We use this feature to represent                            includes dependencies of the entities in
relationships among countries, such as when an entity is part of                          Part 1.
another country or is administered by another country. We also
                                                                      UN FAO              AGROVOC, FAOSTAT, FAOTERM -
use this feature to represent relationships among countries that
are likely to be close matches (i.e. the country named ―United        Geopolitical        code sets used for agricultural statistics
States‖ in the different code sets). These matches can be             Ontology            and projects purposes
generated manually or by machine. Constellation’s semi-               UN M.49 Area        Used by the United Nations for
automated country matching algorithm [14] suggests matches            Codes               statistical purposes
based on the similarity of the names of countries in different        U.S. Government
code sets. The suggestions are then evaluated by a person who
marks them as either correct or incorrect. These human                Census Schedule C Used by the US Census Bureau as well
judgments are recorded as rules that are used when                                        as the Army Corps of Engineers
automatically aligning entities in different code sets. We            Treasury            Designations identifying countries in
explain our approach to store these relationships in more detail      International       data files on international portfolio
later.                                                                Capital Reporting capital movements reported to the US
    The Constellation system can thus be used to keep track of                            Treasury Department via the Treasury
countries, country names, country codes, relationships among                              International Capital reporting system.
countries, and different versions of all of these pieces of           GSA Geographic      Used by US federal agencies for
information. This system has been successfully applied to over        Locator Codes       reporting data to the Federal Real
15 different code sets, and it is easy to add additional ones.                            Property Profile.
Table 1 shows some of the code sets we’ve used along with a           NGA Geopolitical Codes for political entities in the NGA
brief description of how the code set is used.                        Codes (and          GEOnet Names Server (Formerly FIPS
                                                                      dependencies)       10-4).
                                                                      ITU-T e.164         Recommendation that defines structure
    In order to facilitate the easy ingestion of data of all types,
                                                                                          for telephone numbers, including
we have two main ingestion workflows: ingesting CSV files
                                                                                          country dialing codes
and RDF files. For CSV, we require some basic columns such
as country name (with separate columns for preferred names,           ITU-T e.212         Defines the code used in the Mobile
and other languages), columns for dates, and columns for                                  Country Code portion of an IMSI
country codes. The column headers need to be one of several                               (International Mobile Subscriber
that we have pre-defined. In order to ensure that all data is                             Identifier)
ingested into the system in a uniform fashion, we first convert       International Union Standard numerical country coding for
the CSV into a general-purpose RDF format suited for easy             of Railways         use in railway traffic. Used as the
conversion to our OWL representation of the 11179 format. We                              owner’s code (3rd and 4th position) of a
also take RDF country data in any format (such as UN FAO
                                                                                          12-digit wagon identification number.
data, Library of Congress MARC codes, and country currency
data, each of which uses a different ontology) and convert that
to the general-purpose RDF format using SPARQL 1.1 scripts            using another SPARQL 1.1 script to convert the general-
custom written for each of these RDF ontologies. Once this            purpose RDF to RDF conforming to our OWL implementation
data is in the general-purpose RDF format, it is then ingested        of the 11179 metamodel.
                       us_vm1: Value_Meaning
                   begin_date = "1980-01-01"                                                          label = "ISO 3166-1"

                                                            us_des1: Designation

                                                         sign = "United States"
                                                         language = "en"

                                                         acceptability = "preferred"
                                                         begin_date = "1980-01-01"
                   us_pv1: Permissible_Value                                                          digraph_vd: Value_Domain

                   permitted_value = "US"                                                             label = "digraph"
                   begin_date = "1980-01-01"

           Figure 1.           UML object diagram showing an example of Constellation’s use of ISO/IEC 11179 metamodel, edited for clarity

     Updates to the country code sets are performed in a purely                   the code set. A new Designation_Context reflecting the name’s
additive fashion. No statements are actually removed from the                     new status (in this case, ―deprecated‖) is added and given a
RDF store when performing update operations on country,                           begin_date. The RDF statements express the fact that a given
country code, or country name data. Each of these entities may                    country name ceased to be accepted and began to be deprecated
be updated separately, allowing for incremental updating of                       on a particular date. If, rather than simply being removed, the
code sets. In the case of ISO 3166-1, updates are issued on an                    name was changed, new statements would be added to relate
irregular basis every few months as update newsletters. The last                  the new name to the existing country and describe its usage
full version of ISO 3166-1 was published in 2006, and keeping                     acceptability, context, and the dates when it was used. Fig. 2
that code set current requires implementing the updates                           shows an RDF diagram using date fields to deprecate the old
described in the newsletters. These newsletters might correct a                   long-form name of Libya.
spelling mistake in a name, change one numeric code to
another, add a new country, or describe other changes. As                              V.    COUNTRY MATCHING AND RELATIONS IN ISO/IEC
stored in the Constellation metadata registry, country entities,                                           11179
codes, and country names each have begin_dates and optional                           When choosing a metamodel, there are many ways to
end_dates associated with them. In the case of country names,                     model the relationships between countries across code sets. Our
the dates are associated with the acceptability of its usage in a                 first approach was a country-centric approach, where we would
particular Designation_Context. If a code set removes an entry,                   define a unique URI for each country. Constellation’s semi-
it is not actually deleted from our database, but it is marked                    automated country matching algorithm [14] was used to
with an end_date reflecting the date this entry was removed                       determine which countries were the same or similar across
from the code set. Any data that has an end_date is not                           code sets. That URI would be used in all code sets as the
considered part of the current set of values but as part of an                    Value_Meaning representing the notional country.
earlier version of the code set.
                                                                                      However, that approach proved problematic for many
    This use of dates on Designation_Contexts is an extension                     reasons. First and foremost, two different code sets may not
to the ISO/IEC 11179 metamodel being used in Constellation.                       have the same complement of values, so a given URI might not
With this extension we can record a country name change in a                      have statements in each code set. Additionally, we don’t know
particular standard. For example, Libya in ISO 3166-1 has                         that each standard refers to the exact same country, even if the
changed its name. In 2006, the country was identified in ISO                      same name is used. For example, one code set may have an
3166-1 by its official long-form English name, ―the Socialist                     entry for United States, which would include all states and
People's Libyan Arab Jamahiriya‖, in addition to a short form                     dependent territories. Another code set may have separate
of the name. Following that country’s civil war in 2011, the                      entries for the United States, excluding territories, and separate
ISO 3166 Maintenance Agency issued an update to the                               entries for each of the territories. A code set may even include
country’s name in a November, 2011 newsletter, which                              the territories as part of its definition of United States yet still
removed the long-form English name from the entry for the                         have separate entries for some of these territories. For these
country.                                                                          reasons, having one URI for United States that would be shared
   To reflect this change in Constellation, an end_date value is                  across code sets clearly would not be appropriate, since each
added to the Designation_Context relating the former name to                      code set may have a slightly different interpretation of what is
                                                                                  indicated by the country name.
                                                                                             “2006-11-20”                         “2011-11-08”

                                                                 rdf:type                                               :end_date

                       :Designation                                                                                                                     :Conceptual_Domain                   :Context

                                                                                                                                                               rdf:type                   rdf:type
                                         :Designation_Context-scope         :acceptability                                      :Designation_Context-scope

    http://example.org/des_libya                                       “preferred”                      :Designation_Context                                       http://example.org/iso3166-1

                                         :Designation_Context-scope                                         rdf:type            :Designation_Context-scope


     “the Socialist People's Libyan Arab Jamahiriya”
                                                                      :acceptability                              :begin_date

                                                                “deprecated”                                        “2011-11-08”

                                      Figure 2.                RDF diagram showing how Constellation handles deprecated country names

    Another example of this problem is that in some standards                                               same reasons we chose not to use the same URI. The 11179
the country China includes Hong Kong and Macau, whereas in                                                  standard doesn’t provide date properties for these Relations, but
other standards each one has its own disjoint representation. If                                            we can add these fields to keep track of versions just as we did
we had one URI for China, there would be ambiguity as to                                                    for countries above.
what is meant by that URI—is that the URI of all of China and
its dependencies, or of just mainland China? Another example                                                      VI.     QUERYING CHALLENGES USING THE ISO/IEC 11179
is Sudan and South Sudan—one code set might have a separate                                                                           METAMODEL
entry for South Sudan (which recently became independent                                                        The generic nature of the 11179 metamodel adds a great
from Sudan), as well as for Sudan itself. However, another                                                  deal of complexity and abstraction to the representation of the
code set may contain one entry for Sudan, meaning both Sudan                                                data. This poses a challenge for querying, since even a simple
and South Sudan. This may be based on the different dates of                                                query getting all country codes for a given country name can
the code set, if one code set wasn’t yet updated after South                                                involve traversing a large amount of RDF, resulting in a
Sudan’s independence, or the code set may not recognize South                                               lengthy and difficult to read SPARQL query. The 11179
Sudan’s independence.                                                                                       Relations which we used to link related concepts to each other
    Another issue with using a unique URI for each country is                                               also adds a great deal of complexity and extra statements. This
that two code sets may use completely different names for the                                               is because the 11179 relations model is best suited to scale to
same country. The reason that different names may be used in a                                              ternary, quaternary, and higher-order relations, but it adds
given code set may be politically motivated. The country                                                    additional overhead when dealing with simpler binary relations,
identified in the international ISO 3166 standard as ―Myanmar‖                                              as will be explained below.
is referred to by the name ―Burma‖ in official U.S. Government                                                  We attempted to provide shortcuts in the data we ingested,
documents. The entity identified as ―Taiwan, Province of                                                    but this resulted in losing some of the benefits of 11179,
China‖ in ISO 3166 is called ―Chinese Taipei‖ by the                                                        particularly when it came to updates. We were able to simplify
International Olympic Committee. Although these entries have                                                querying using shortcuts such as adding an rdfs:label directly to
different names, technically they are referring to the same                                                 a Value_Meaning, instead of using Designations with a ―sign‖
entity.                                                                                                     property, eliminating an extra statement traversal. However,
    In all of the above cases, it is debatable whether it makes                                             this did not allow for dates to be provided for the label itself.
sense to use the same URI for the notional country across all                                               Eliminating Designation_Context and adding alternate name
code sets. Since each code set has its own idea of what an entry                                            forms directly in the Designation posed a similar problem
actually refers to, it is very difficult to determine if two code                                           managing the acceptability ratings. Since we don’t want to
sets are using a country name in exactly the same way [15].                                                 actually delete any data from our system, in order to keep
Therefore, we decided that each code set would use its own set                                              previous versions of data we needed these abstractions of
of URIs (unique Value_Meanings) for its own values. Instead                                                 Designation and Designation_Context, so we can maintain
of relying on a common URI to map countries from one code                                                   dates and acceptability ratings on the Value_Meaning and
set to another, we use 11179 Relations, which provide a way to                                              Designation_Context objects independently.
link countries across code sets. For the names of the                                                           We experienced similar problems using shortcuts for 11179
relationships, we use the SKOS vocabulary terms where                                                       Relations. In the 11179 metamodel, traversing the graph from
appropriate (such as skos:closeMatch or skos:broadMatch).                                                   one Concept to another Concept related by a Relation requires
Use of skos:exactMatch and owl:sameAs was avoided for the                                                   stepping through three intermediate objects rather than just a
single predicate. We attempted to add convenience predicates                   We are currently experimenting with applying this research to
(such as skos:broader) for these Relations to provide only one                 automated compliance challenges. The 11179 metamodel is
statement linking the two Concepts. As a result of this                        useful for registering the metadata related to system policies
simplification, the SPARQL queries using the convenience                       and rules. We can then track changes to these rules, and
predicates were much shorter and easier to read, but the                       relationships between different rules, in the same way we track
convenience predicates lacked much of the descriptive power                    changes and relationships in country code data. The
of the 11179 Relations. Fig. 3 shows a simple example of the                   Constellation registry, using the 11179 metamodel, can thus be
way that relationships are represented in the 11179 metamodel,                 used to address these challenges across a variety of metadata.
compared to how they are represented in SKOS.
    Due to our issues with shortcuts, we determined that they
 via the 11179 metamodel. Bottom - broader and narrower relations
 represented in SKOS.