<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Identitas: A Better Way To Be Meaningless</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nizal Alshammry</string-name>
          <email>N.K.E.Alshammry2@newcastle.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Phillip Lord</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computing Science, Northern Borders University</institution>
          ,
          <country country="SA">Saudi Arabia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Computing Science</institution>
          ,
          <addr-line>Newcastle University NE1 7RU</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>It is often recommended that identifiers for ontology terms should be semantics-free or meaningless. In practice, ontology developers tend to use numeric identifiers, starting at 1 and working upwards. Here we describe a number of significant flaws to this scheme, and the alternatives to them which we have implemented in our library, identitas. Software is available from https://github.com/phillord/ identitas.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>During the years that ontologies have moved to becoming a standard
part of the biomedical chain, a set of standard practices have build
up which are used to enable their good management, including
the addition of standardised metadata about each ontology term,
including labels, definitions, editorial status and so forth.</p>
      <p>
        One key piece of metadata is the identifier. For most ontological
technologies this is in the form of an IRI (Internationalized Resource
Identifer), or something that is convertable into one. Much has been
written about the nature of identifier and how they should be chosen.
The percieved wisdom is that identifiers should be semantics-free or
meaningless. The key aim here is to enable persistence of access to a
term [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]; an identifier which is based on some semantics associated
with the term may need to be changed when that aspect changes,
even if the change does not reflect a change in the ontological
semantics.
      </p>
      <p>
        As an example, OBO Foundry principles [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] provide guidelines
for identifiers; these include both management principles (“The
IDspace / prefix must be registered with the OBO library in advance.”),
syntactic constraints (“The URI should be constructed from a base
URI, a prefix that is unique within the Foundry (e.g. GO, CHEBI,
CL) and a local identifier (e.g. 0000001).”), in addition to a strong
commitment to semantics-free IDs (“The local identifier should
not consist of labels or mnemonics meaningful to humans.”). No
specific advice is given on the form of the local identifier; however,
in practice OBO identifiers use numeric IDs, 8 numerals long,
approximately increasing monotonically.
      </p>
      <p>While semantics-free identifiers have their advantages there
are distinct disadvantages as well, especially for humans. They
are poorly mnemonic, hard to differentiate from each other and
relatively difficult to read. For this reason, for example, many
bioinformatics databases provide both semantic-free accession
numbers (which are essentially the same thing as an identifer in
ontology terminology), and an identifier (which is rather like a
compressed, syntactically predicatable label). It is also interesting
to note that, with software development, programmers emphasise
the use importance of semantically-meaningful identifiers, and use
other techniques to manage change.
should
addressed:</p>
      <p>In this paper, we ask whether it is possible to overcome these
and some related issues with monotonic, numeric identifiers while
remaining semantics-free. We describe our solutions, along with the
identitas library which implements these.</p>
      <p>
        Racing: One unusual aspect of ontological identifiers is that
they are usually monotonically increasing. This causes a significant
race condition if two developers are building a single ontology in
parallel. If both attempt to add a new term, they both must coin a
new identifier, which must be unique. This is impossible to achieve
without some degree of co-ordination. One typical strategy is for
developers have to pre-coordinate to build the ontology by using
pre-allocation schema. For example, one developer allocated with
the IDs from 1 to 1000, another allocated with 1000 to 2000 and
so on. This approach is effective, however it requires developers
to manage the ID space accurately, and also reduces the overall
ID space since preallocated IDs cannot be used elsewhere. Another
approach is to just-in-time co-ordinate; for example, the URIGen [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]
server enables this approach in Proteg´e´. Projects such as EFO
(Experimental Factor Ontology) and SWO (Software Ontology) use
this to manage their namespace. A final approach is to use temporary
IDs, and then allocate final IDs at a single, co-ordinated point in
the development process; URIGen also does this to enable off-line
working.
      </p>
      <p>We propose a much simpler approach which is to simply use
random IDs not just as temporary identifiers. While randomness
does not a priori completely remove the potential race condition,
given a large enough identifier space, the chances of collision can be
reduced to provide world (or universe) uniqueness. This approach is
commonly used with random UUIDs (Universal Unique Identifiers)
being perhaps the most common example.</p>
      <p>
        Pronouncing: The use of randomness raises a secondary issue.
These identifiers are likely to be relatively long, exacerbatting the
problems of memorability and pronounceability. One solution to
this problem is to just not show the identifiers to humans. With
tools like Proteg´e´ this is possible, of course, because it has a view
which may be different from the underlying model. With text
fileformats, including OBO format, the various OWL serialisations
or the Tawny-OWL [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] programmatic representation, this is rather
harder (although the latter does provide an mechanism for achieving
this). It is also difficult to do this for programmers developing tools
like Proteg´e´, who are themselves using general tools such as IDEs,
debuggers and version control systems.
      </p>
      <p>
        We have considered using a dictionary-based approach, to replace
numeric identifiers with English words. However, this approach
raises the probability of selecting a word which is inappropriate or
unfortunate – consider the Sonic Hedgehog gene mutations which
causes holoprosencephaly in humans. Instead, we are investigating
a solution in the form of the proquint [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. This is a library build
to encode numbers as a set of strings of alternating consonants
and vowels. Each consonant provide four bits of information, each
Alshammry et al
vowel only two bits, as shown in Figure 1. Thus, sixteen bits can be
represented using five letters (3 consonants, 2 vowels).
      </p>
      <p>Four-bits as a consonant:
0 1 2 3 4 5 6 7 8 9 A B C D E F
b d f g h j k l m n p r s t v z</p>
      <p>For example a numeric identifier 10 associated with some term in
a given ontology would be translated to babab-babap, 11 would
be translated to babab-babar by using proquint function which
is quite readable, spellable and pronounceable string. In practice,
if used to represent random numbers, the proquints would rarely
be so close in alphabetic space. Note that proquints map directly
to a single number, so can be freely converted in either direction,
and that they are alphabetically ordered. Mappings between integer
values are shown in Figure 2.</p>
      <p>Integer
0
1
2
3
4
5
Integer/MIN VALUE
Integer/MAX VALUE</p>
      <p>Equivalent string</p>
      <p>Fig. 2. Integer to Proint(string).</p>
      <p>In a simple extension, to the original algorithm, we have also
provided conversions from the Java short and long data types which
provides either a larger identifier space, or less typing; conversions
are shown in Figure 3.</p>
      <p>Short - Long
0
1
2
0
1
Long/MIN VALUE
Long/MAX VALUE</p>
      <p>Equivalent string
”babab”
”babad”
”babaf”
”babab-babab-babab-babab”
”babab-babab-babab-babad”
”mabab-babab-babab-babab
”luzuz-zuzuz-zuzuz-zuzuz</p>
      <p>
        We note that the short range at 216 numbers is large enough for
most ontologies current in operation. However, it is far too small
when combined with randomness as due to the birthday problem
is very likely to result in collisions even for small ontologies [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
The long range, meanwhile at 264 numbers is likely to cope for
all ontological applications where the identifiers are allocated as a
result of human action; it has half the bit-length of a UUID (which
has a 2128 range).
      </p>
      <p>Checking: We note that monotonic numeric ideas suffer from a
final problem. As well as being unmnenomic, if a numeric ID is
misunderstood, it is very likely that the incorrect ID is stil actually
a valid one; for instance, OBI:0001440 (“all pairs design”) and
OBI:0001404 (“genetic characteristics information”) are IDs which
differ in one one number.</p>
      <p>
        A solution to this problem is well-understood with the use of a
checksum. For the identitas library, we use the Damm algorithm [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
This algorithm is design to operate on numbers, but it will work on
proquints also, as they can be converted to numbers. Examples of
valid or invalid numbers are shown in Figure 4.
5724
231
0
222
valid
invalid
valid
invalid
      </p>
      <p>Of course, the Damm algorithm incorporates a checksum so
reduces the total space of valid identifiers, in this case by an
order of magnitude, which will have implications if combined with
randomness. Under these circumstances, the larger numeric spaces
(int or long) are likely to be necessary.</p>
      <p>
        In this paper we present a critique of current ontology
semanticsfree identifiers; monotonically increasing numbers have a number of
significant usability flaws which make them unsuitable as a default
option, and we present a series of alternatives. We have provide an
implementation of these alternatives which can be freely combined.
We are now starting to integrate these into ontology development
environments such as Tawny-OWL [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and will later provide an
implementation for Proteg´ e´. This form of identifier space could
significantly improve the management of ontologies with very little
cost.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Birthday</surname>
          </string-name>
          problem - wikipedia. https://en.wikipedia.org/wiki/ Birthday_problem,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>EBI.</given-names>
            <surname>Urigen</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Phillip</given-names>
            <surname>Lord</surname>
          </string-name>
          .
          <article-title>The Semantic Web takes Wing: Programming Ontologies with TawnyOWL</article-title>
          .
          <source>OWLED</source>
          <year>2013</year>
          ,
          <year>March 2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>James</given-names>
            <surname>Malone</surname>
          </string-name>
          , Robert Stevens, Simon Jupp, Tom Hancocks,
          <string-name>
            <given-names>Helen</given-names>
            <surname>Parkinson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Cath</given-names>
            <surname>Brooksbank</surname>
          </string-name>
          .
          <article-title>Ten simple rules for selecting a bioontology</article-title>
          . http://journals.plos.org/ploscompbiol/article? id=
          <volume>10</volume>
          .1371/journal.pcbi.
          <volume>1004743</volume>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H. Michael</given-names>
            <surname>Damm</surname>
          </string-name>
          .
          <article-title>Totally anti-symmetric quasigroups for all orders n 6= 2; 6</article-title>
          . Discrete Mathematics,
          <volume>307</volume>
          (
          <issue>6</issue>
          ):
          <fpage>715</fpage>
          -
          <lpage>729</lpage>
          ,
          <year>Mar 2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>OBO</given-names>
            <surname>Foundry Consortium. OBO Foundry</surname>
          </string-name>
          <article-title>Principles</article-title>
          . http://obofoundry. org/wiki/index.php/OBO_Foundry_Principles,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Shawcross Wilkerson</surname>
          </string-name>
          .
          <article-title>A proposal for proquints: Identifiers that are readable, spellable, and pronounceable</article-title>
          .
          <source>CoRR, abs/0901.4016</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>