=Paper= {{Paper |id=Vol-2807/paperJ |storemode=property |title=A BioSequence Ontology from Molecular Structure |pdfUrl=https://ceur-ws.org/Vol-2807/paperJ.pdf |volume=Vol-2807 |authors=Jona Thai,Michael Grüninger |dblpUrl=https://dblp.org/rec/conf/icbo/ThaiG20 }} ==A BioSequence Ontology from Molecular Structure== https://ceur-ws.org/Vol-2807/paperJ.pdf
                   A BioSequence Ontology from Molecular
                                 Structure
                                               Jona THAI a , Michael GRÜNINGER a
                  a Department of Mechanical and Industrial Engineering, University of Toronto, Ontario,

                                                            Canada M5S 3G8

                              Abstract. Gene sequences are a focal point of modern biological research, with
                              applications ranging from diagnostics to gene-driven drug design. An ontology’s
                              automated reasoning capability and traceable logic is sure to be an asset to these
                              efforts. However, current biomedical ontologies fail to achieve this potential. This
                              may partly be due to a lack of formal axiomatization of the underlying molecular
                              structure and mereology of gene sequences, despite an otherwise richly defined
                              vocabulary. In this paper, we propose a new BioSequence Ontology with explicit
                              axiomatization of the underlying path graph structure.

                              Keywords. mereology, sequence ontology, genes, molecules




                  1. Introduction

                  If we passed an arbitrary gene sequence as a query into a biomedical ontology, would
                  it be able to reason about its ancestry and recognize an open reading frame? Removing
                  the possibility of recognizing a well-known or pre-programmed sequence, most biomed-
                  ical ontologies simply do not have the axiomatization that is necessary and sufficient to
                  reason about concepts like parthood, betweenness or even directionality. Moreover, gene
                  sequences are curious things. They are simultaneously a molecule, abstract information
                  and (to a computer at least) a string of letters. Even if we chose to focus on the molecular
                  aspect of a gene sequence, it can be either linear or circular, which have not-so-subtle
                  nuances in understanding of betweenness. Hence, the BioSequence Ontology proposed
                  in this paper has three goals: (1) Explicit axiomatization of the mereotopology of gene
                  sequences (2) Design a gene sequence ontology based on molecular structure (3) Provide
                  representation for both linear and circular gene sequences.


                  2. Motivating Scenarios

                  What is a gene? It is commonly agreed to be a sequence of nucleotides in DNA or RNA
                  that encode the synthesis of a gene product, including and not limited to RNA or protein.
                  To break down this definition, we have to look into definitions of key terms. For instance,
                  consider the word ”sequence”. Formally, a sequence is a collection of elements, or mem-
                  bers in which repetitions are allowed and order matters. Hence, a sequence can be built
                  of sub-sequences, which are further built of sub-sub-sequences and so on so forth. This




Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
is important because it sheds light as to what kind of molecules are structurally plausible
and what are not.
Motivating Scenario 1: Sequence of Sequences

          ATG CAA TGG GGA AAT ACC AGG TCC GAA CTT ATT GAG GTA AGA CAG ATT TAA
          A TGC AAT GGG GAA ATA CCA GGT CCG AAC TTA TTG AGG TA A GAC AGA TTT AA
          AT GCA ATG GGG A AA TAC CAG GTC CGA ACT TAT TGA GGT AAG ACA GAT TTA A

                                                  (i)
                                                 ATGC                              j


         A    T       G   C              ATG            TGC                    h          i


                                        AT       TG         CG             e       f          g



                                    A        T          G        C     a       b       c          d


               (ii)                          (iii)                                 (iv)


                              Figure 1. Mereology of gene sequences.


A DNA molecule is understood to have a double helix, double strand structure. How-
ever, these strands are complementary and essentially hold the same information, so it is
of interest to focus on the structure of a single strand. RNA also happens to be single-
stranded. Figure (i) shows a an arbitrary sequence split into triplets (codons). Figure (ii)
then zooms into the connection between 4 nucleotides. As displayed by Figure (iii), this
set of nucleotides can be split into two codons, depending on how the sequence is read,
with ”TG” as the overlapping pair of elements. However, ”A” and ”C” are never consid-
ered as a pair of elements as ”TG” is in between them. In a classical mereology, ”AC”
would be considered as a possible pair. This clearly shows that gene sequence mereology
differs subtly from classical mereology due to this emphasis on a connected substructure.
Another word of interest is ”encode”. In this context, encoding is achieved through a
series of processes, namely transcription, translation, and splicing to build a product,
namely RNA or protein. These processes are explored more in depth below.
Motivating Scenario 2: Splicing
Splicing is a process that converts messenger RNA (mRNA) to mature RNA, through the
removal of introns(non-coding regions) and joining together exons(coding region). It is
a fundamental process in making proteins that takes place after transcription and before
translation.
Motivating Scenario 3: Transcription
In a nutshell, transcription is the process of copying DNA to RNA. This is achieved with
RNA polymerase (RNA maker, essentially) binding to a section on the DNA strand called
the promoter. Since DNA has a double helix structure (2 strands), the RNA is copied by
generating a complementary strand of the template DNA strand.
Motivating Scenario 4: Translation
Post-splicing, translation is the process of building an amino acid chain from the tran-
scribed RNA. This is achieved via tRNA attaching to codons after the start codon, and
assigning the according amino acid.
                                      Figure 2. Splicing
                                             [12]




                             Figure 3. Transcription & Translation
                                             [5]

Coincidentally, processing the gene sequence as above results in an amino acid chain
or RNA, which is also a sequence. In other words, these processes can essentially be
interpreted as sequence manipulation to build other sequences.

2.1. Semantic Requirements/Intended Models

Based on the above motivating scenarios, we can conclude that genes can be interpreted
and operated on as sequences. However, in reality, genes are also physical molecules
with chemical structures that are not necessarily sequence-like. In fact, spatial aspects of
a molecule is akin to a graph i.e. chemical graph theory.
This begs the question - How can we preserve the intuitive semantics of a sequence
while remaining true to its molecular properties, to maximize inference capabilities and
minimizing complexity? This motivates the following competency questions.
    1. What type of overlap is present with ATP6 and ATP8 genes?
       An overlapping gene exists when an expressible nucleotide sequence of a gene
       overlaps with another expressible nucleotide sequence in another gene. This is
       natural so that a sequence can contribute to more than one gene product. This
       competency question tests the ontology’s ability to reason about parthood and
       directionality within a gene sequence.
    2. What do the DNA codons ATT and TTA code for?
       The RNA sequence ATT codes for Isoleucine whereas TTA codes for Leucine.
       This competency question is designed to ensure the ontology’s ability reason
       about directionality and sequence convexity (between-ness).
    3. What are the introns/exons in the given sequence?
       Introns are non-coding sequences of DNA that are removed via splicing. Exons
       are coding sequences of DNA that provide instructions for making a product
       (protein). This coding sequence tests the ontology’s ability to reason and classify
       key components of gene structure in regards to function/process.
    4. What are conserved sequences present in the histone h1 protein given the amino
       acid sequence.
       Histone H1 is one of the five main histone protein families which are components
       of chromatin in eukaryotic cells. It is simultaenously one of the most conserved
       and most variable histone across species. A conserved sequence are identical or
       similar sequences in nucleic acids (DNA and RNA) or proteins across species.
    5. What are conservative replacements for the conserved sequence in the gene cod-
       ing for the h1 histone protein?
       A conservative replacement in an amino acid chain involves swapping an amino
       acid with another that has similar biochemical properties e.g. glycine and alanine.
       This competency question tests the ontology’s ability to reason about molecular
       structure.
    6. Is this gene sequence circular? A circular gene sequence is essentially a sequence
       where the 5’ and 3’ ends have been joined together to form a loop. Examples of
       natural occurring circular sequences are eccDNA and plasmids. This competency
       question tests the ontology’s ability to reason about between-ness and correlate
       molecular structure with gene sequence.
    7. How is 5’UTR and the start codon related?
       The 5’ UTR (untranslated region) refers to a phosphate group attached to the 5
       carbon of the ribose ring in a DNA/RNA molecule, and the start of the open read-
       ing frame. The start codon is first codon of a messenger RNA (mRNA) transcript
       translated by a ribosome during translation. The 5’ UTR is directly upstream of
       the start codon.
Based on these competency questions, we can construct ontological commitments to
drive the design of the ontology.
    1. The ontology must represent the properties of functional groups, connections be-
       tween functional groups, connections between functional groups with sequences,
       components of sequences and connections between sequences and sequences.
    2. Functional groups, molecules, and chains of molecules(sequences) are the primi-
       tives of the ontology. Functional groups are the lowest level of abstraction in this
       ontology.
    3. Axioms in the ontology are fully interpretable in the Molecular Structure Ontol-
       ogy (MoST) [3] as all sequences are molecules.
    4. The ontology must represent the 5’ and 3’ relation: the 5’ end of a sequence is
       the beginning of the open reading frame and the 3’ end is the end of the open
       reading frame. This introduces the concept of directionality into the ontology.
    5. The ontology must represent the ”between” relation. To understand what a gene
       sequence does, it requires understanding the concept of an open reading frame
       (between start and stop codons). Sequences often have overlapping reading
       frames so that a nucleotide sequence can code for more than one protein.


3. Existing Work

3.1. Evaluation wrt semantic requirements

The idea to represent gene sequences through an ontology is not a novel one. In fact,
two of the most notable biomedical ontologies are dedicated to this purpose - the Gene
Ontology (GO) [1] and the Sequence Ontology(SO) [4] and Molecular Sequence Ontol-
ogy(MSO) [2].
The GO was a pioneer in creating an extensive, consistent vocabulary on gene sequences
and is widely used, particularly in annotation. Definitions within the ontology are defined
in terms of each other, akin to a directed acyclic graph in mathematics. This ensures con-
sistency with labeling.
However, the GO had no explicit mathematical definitions to represent part-hood in a
gene sequence. This was addressed in the SO, which recognized the need for a prop-
erly defined mereology for increased automated reasoning capacity. This topic was first
raised in a paper by Hoehndorf [10]. Here, he explicitly stated the difference between
molecular, abstract and syntactic aspects of a gene sequence and provided a set of ax-
ioms. However, this axiomatization was disputed by the original creators of the Sequence
Ontology in a later-released paper [9]. Mungall then released another paper [8], which
recognized the deep similarity between time intervals and gene sequences, as both have
an underlying connected substructure. However, there were no explicit axiomatizations
of the mereology and ordering.
Developments in the Molecular Sequence Ontology is fairly recent, and meant to be the
molecular counterpart to the Sequence Ontology’s genome annotation. [2]

3.2. Relationship to Upper Ontologies

One may wonder why the MSO cannot simply be added on as a feature to the SO. This is
perhaps largely due to SO and MSO’s alignment to the Basic Formal Ontology (BFO). In
BFO, concepts are classified as continuant or occurrent. This works great in many cases,
but in this case, it would be rather limiting to classify a gene sequence as only either
abstract information(sequence) or physical entity(molecule).
As highlighted earlier, we are committed to accurately represent gene sequences as what
they are - sequences and molecules. To achieve that, we need to build upon an ontology
of molecular chemistry, with rigorous definitions of molecular structure and parthood.
A candidate we considered was Chemical Entities of Biological Interest(CheBI) [6],
however it did not quite fulfill our requirements for a mathematically rigorous defini-
tion of parthood relations. After consideration, only the Molecular Structure Ontology
(MoST) [3] best fit our requirements for expressivity and axioms about structural prop-
erties. Its ability to represent functional groups in chemical graph theory grants us the
luxury of setting that as the BioSequence Ontology’s lowest level of abstraction. This
will improve the readability and usability of the BioSequence Ontology. Coincidentally,
MoST is independent from an upper ontology. As such, the BioSequence Ontology is
also foundationless, the effect of which will be studied.


4. The BioSequence Ontology

4.1. Overview

In this section we will discuss the signature and design of the BioSequence Ontology
based on the requirements developed in previous sections. This ontology is axiomatized
in Common Logic syntax and made available in a repository on COLORE link.

4.2. MoST

What is DNA? It is a sequence of nucleotides, and hence it is also a molecule. To accu-
rately represent a molecule, we need to build upon a knowledge base of atoms and bonds.
To achieve this, we build upon the existing Molecular Structure Ontology (MoST). It is
a foundation-less ontology that represents molecules as molecular graphs.
Since the focus of the BioSequence Ontology is the semantics of gene sequences, the
lowest level of abstraction would be on a functional group level. Atoms and bond types
would be beyond our scope, and this information is inherited from MoST. Hence, most
or all entities within the BioSequence Ontology can be represented as a skeleton, or
connected graph of atoms. This is achieved with the mol and tether relations.

Definition 1
                                           ∀x(mol(x, x))                                     (1)
                                    ∀x∀y(mol(x, y) ⊃ mol(y, x))                              (2)
                     ∀g1 ∀g2 ∀b((tether(g1 , g2 , b) ≡ (group(g1) ∧ group(g2)
               ∧bond(b) ∧ (g1 6= g2 ) ∧ ∃a1 ∃a2 ((atom(a1 ) ∧ atom(a2 ) ∧ mol(a1 , g1 )
            ∧mol(a2 , g2 ) ∧ mol(a1 , b) ∧ mol(a2 , b) ∧ ¬mol(b, g1 ) ∧ ¬mol(b, g2 )))))).   (3)

4.3. Signature

Molecular chemistry background knowledge in hand, the natural next step in building the
ontology is to define the primitives, or basic building blocks of the ontology. Every entity
in the ontology will be defined in terms of these primitives. Intuitively, these primitives
will be skeleton, sequence, nucleotide, f unctional group, triplet, coding sequence,
protein and codon.

Definition 2
                                  ∀xnucleotide(x) ⊃ skeleton(x)                                     (4)
                                   ∀xsequence(x) ≡ skeleton(x)                                      (5)
                         ∀xnucleotide(x) ≡ ∃s1∃s2∃s3∃b1∃b2skeleton(x)
   ∧nucleobase(s1) ∧ sugar(s2) ∧ phosphoric acid(s3) ∧ mol(s1, x) ∧ mol(s2, x) ∧ mol(s3, x)
                               ∧tether(s1, s2, b1) ∧ tether(s2, s3, b2)                             (6)
            ∀x sequence(x) ≡ ∃n1∃n2∃n3∃b1∃b2 ∧ nucleotide(n1) ∧ nucleotide(n2)
                     ∧nucleotide(n3) ∧ tether(n1, n2, b1) ∧ tether(n2, n3, b2)                      (7)
     ∀x triplet(x) ≡ ∃n1∃n2∃n3∃b1∃b2 nucleotide(n1) ∧ nucleotide(n2) ∧ nucleotide(n3)
                              ∧tether(n1, n2, b1) ∧ tether(n2, n3, b2)                              (8)
  ∀ccodon(c) ≡ (∃x, y, z)nucleotide(x) ∧ nucleotide(y) ∧ nucleotide(z) ∧ x 6= y ∧ x 6= z ∧ y 6= z
 ∧part(x, c) ∧ part(y, c) ∧ part(z, c) ∧ ∃y1∃z1 amino acid(y1) ∧ r f (z1) ∧ triplet(c) ∧ mol(c, z1)
   ∧codes f or(c, y1) ∧ (∀w) nucleotide(w) ∧ part(w, c) ⊃ ((w = x) ∨ (w = y) ∨ (w = z))))           (9)
                  ∀x cds(x) ≡ ∃y protein(y) ∧ sequence(x) ∧ translates to(x, y)                 (10)
          ∀x r f (x) ≡ ∃y∃z nucleotide(y)nucleotide(z) ∧ sequence(x) ∧ 50 (y, x) ∧ 30 (z, x)    (11)
 ∀x or f (x) ≡ ∃y∃z r f (x) ∧ protein(y) ∧ RNA(z) ∧ (transcribes to(x, z) ∨ translates to(z, y)) (12)

Next, relationships between the entities need to be defined. These relationship axiom
predicates are between, overlap, tether.

Definition 3
                 ∀x∀y tandem overlap(x, y) ≡ sequence(x) ∧ sequence(y) ∧ x 6= y
                            ∧∃n1∃n2 nucleotide(n1) ∧ nucleotide(n2)
                              ∧30 (n1, x) ∧ 50 (n2, y) ∧ overlap(n1, n2)                        (13)
               ∀x∀y convergent overlap(x, y) ≡ sequence(x) ∧ sequence(y) ∧ x 6= y
                            ∧∃n1∃n2 nucleotide(n1) ∧ nucleotide(n2)
                              ∧30 (n1, x) ∧ 30 (n2, y) ∧ overlap(n1, n2)                        (14)
                ∀x∀y divergent overlap(x, y) ≡ sequence(x) ∧ sequence(y) ∧ x 6= y
                            ∧∃n1∃n2 nucleotide(n1) ∧ nucleotide(n2)
                              ∧50 (n1, x) ∧ 50 (n2, y) ∧ overlap(n1, n2)                        (15)
               ∀x∀y in phase(x, y) ≡ sequence(x) ∧ sequence(y) ∧ (r f (x) = r f (y))            (16)
               ∀x∀y out phase(x, y) ≡ sequence(x) ∧ sequence(y) ∧ (r f (x) 6= r f (y))          (17)

4.4. Mereology

A major goal of the BioSequence Ontology is to explicitly axiomatize the parthood rela-
tions between entities of the ontology. The most unique point of the mereology of gene
sequences is that it is composed of convex intervals; in other words, the mereological
sum of sequence A and sequence B implies that A and B are already connected through a
bond. We achieve this by modeling gene sequences after path graphs, defined below [7].


Definition 4 Let H = hV, Ei be a simple graph.
     H is a path iff there exists a sequence x1 , ..., xn such that (xi , xi+1 ) ∈ E.
     H is connected iff for any two vertices x, y ∈ V , there exists an induced subgraph
that is a path containing x, y.


     In a nutshell, the following definitions illustrate the chain-like connected-substructure
of a sequence.


Definition 5 A partial ordering P = hV, ≤i is properly chain semimodular iff1

     1. P is atom-height, that is, the cardinality of all maximal chains in P is equal to the
        cardinality of the set of atoms in P;
     2. for each x ∈ V , hU P [x], ≤i is an upper semimodular lattice:

         (a) any two elements y, z have a least upper bound and a greatest lower bound
            in U P [x];
         (b) if z covers the greatest lower bound of z and y, then the least upper bound of
            z and y covers y;

     3. for each x ∈ V , hU P [x], ≤i ∼
                                      = m × n.

M proper chain semimodular denotes the class of properly chain semimodular partial order-
ings.


     The following theorem illustrates the verification of the mereotopology by estab-
lishing a bijection between our definition of part-hood and the models of our ontology
for gene sequences. We obtain the axioms of our ontology from Tcisco , an ontology for
mereotopology of connected substructures [7].


Theorem 1 Let Tcico be the extension of Tem mereology ∪ Tub mereology with the sentences:




   1 We use the following notation:

U P [x] = {y : x ≤ y}
                                    S
                          U P [X] = x∈X U[x]
                          (∀u, x) ppart(u, x) ⊃ (∃y) atom(y) ∧ part(y, x)                      (18)
                 (∀x, y) covers(x, y) ⊃ (∃z) atom(z) ∧ ppart(z, x) ∧ ¬part(z, y)               (19)
                  (∀x, y, z, u) covers(x, y) ∧ atom(z) ∧ ppart(z, x) ∧ ¬part(z, y)
                         ∧atom(u) ∧ ppart(u, x) ∧ ¬part(u, y) ⊃ (z = u)                        (20)
                                (∀x, a, b) part(x, a) ∧ part(x, b) ⊃
                  (∃z) part(x, z) ∧ (∀u) (part(z, u) ≡ (part(a, u) ∧ part(b, u)))              (21)
                                (∀x, a, b) part(x, a) ∧ part(x, b) ⊃
                  (∃z) part(x, z) ∧ (∀u) (part(u, z) ≡ (part(u, a) ∧ part(u, b)))              (22)
                          (∀p, x, y) atom(p) ∧ part(x, y) ∧ ¬part(p, y) ⊃
                       (∃z) part(x, z) ∧ part(p, z) ∧ part(y, z) ∧ covers(z, y)                (23)
                 (∀u, x) ppart(u, x) ⊃ (∃y, z) covers(x, y) ∧ covers(x, z) ∧ y 6= z            (24)
               (∀u, x, y, z) ppart(u, x) ∧ covers(x, y) ∧ covers(x, z) ∧ covers(x, w)
                                     ⊃ (y = z ∨ y = w ∨ z = w)                                 (25)
         (∀x, y, z, w) covers(y, x) ∧ covers(z, x) ∧ covers(w, x) ⊃ (y = z ∨ y = w ∨ z = w)    (26)
                 (∀x, y, z, w) covers(y, x) ∧ covers(z, x) ∧ y 6= z ∧ overlaps(w, x)
                           ⊃ (∃u) atom(u) ∧ ¬part(u, w) ∧ ¬part(u, x)                          (27)

There exists a bijection ϕ : Mod(Tcico ) → M proper semimodular such that
(x, y) ∈ partM iff x ∈ LP [y]

It is important to note that this is a nonclassical mereology, since sums do not exist
for every pair of underlapping elements. This is distinct from all earlier approaches to
parthood that have been taken in biomedical ontologies [11].

4.5. Directionality

A DNA molecule is understood to be composed on two strands that are complementary to
each other yet contain the same information. They are essentially inverses of each other.
Moreover, as mentioned in Competency Question 2, ATT codes for Isoleucine whereas
TTA codes for Leucine. This illustrates how essential explicit ground rules to represent
directionality are. Due to topic nuance, we must first define the related vocabulary. This
set of definitions define the start (5’) and stop (3’) ends of a sequence, and notions of
upstream and downstream:
  ∀x∀y nucleotide(x) ∧ sequence(y) ∧ 50 (x, y) ≡ ∀z phosphoric acid(z) ∧ mol(z, x) ∧ end(z, x)
                                                                                             (28)
  f orallx∀ynucleotide(x) ∧ sequence(y) ∧ 30 (x, y) ≡ ∀z hydroxyl(z) ∧ mol(z, x) ∧ end(z, x)   (29)
                      ∀x∀y sequence(x) ∧ sequence(y) ∧ downstream(x, y) ≡
 ∃n1∃n2∃b nucleotide(n1) ∧ nucleotide(n2) ∧ bond(b) ∧ 50 (n1, x) ∧ (n2, y) ∧ tether(x, y, b) (30)
                       ∀x∀y sequence(x) ∧ sequence(y) ∧ upstream(y, x) ≡
 ∃n1∃n2∃b nucleotide(n1) ∧ nucleotide(n2) ∧ bond(b) ∧ 50 (n1, x) ∧ (n2, y) ∧ tether(x, y, b) (31)
However, interpretation of these definitions require a concept of between-ness. Circular
between-ness varies slightly from non-circular between-ness as the beginning and end of
the sequence is not as explicit
Definition 6 Non-Circular Between-ness
                          ∀a∀b∀c between(a, b, c) ⊃ between(c, b, a)                         (32)
                ∀a∀b∀c∀d between(a, b, d) ∧ between(b, c, d) ⊃ between(a, b, c)              (33)
           ∀a∀b∀c∀d between(a, b, c) ∧ between(b, c, d) ∧ (b 6= c) ⊃ between(a, b, d)        (34)
     ∀a∀b∀c∀d between(a, b, d) ∧ between(a, c, d) ⊃ between(a, b, c) ∨ between(a, c, b)      (35)
  ∀a∀b∀c∀d between(a, b, c) ∧ between(a, b, d) ∧ (a 6= b) ⊃ between(a, c, d) ∨ between(a, d, c)
                                                                                              (36)

Definition 7 Circular Betweenness
                                 ∀x∀y∀zC(x, y, z) ⊃ ¬C(z, y, x)                              (37)
                                  ∀x∀y∀zC(x, y, z) ⊃ C(y, z, x)                              (38)
                          ∀x∀y∀z∀wC(x, y, z) ∧C(x, z, w) ⊃ C(x, y, w)                        (39)
               ∀x∀y∀z∀u∀vC(x, y, z) ∧C(x, u, v) ⊃ C(x, u, y) ∨C(x, y, u) ∨ (x = y)           (40)


5. Evaluation

With the proposal of these axioms, questions inadvertently arise. Namely, are these the
right models, and why are these the right models? The former, also known as verifica-
tion, is shown in Theorem 1. Theorem 1 formally demonstrates that the mereology of
our models of gene sequences are equivalent to that of a partial linear ordering, or as
connected subgraphs of a path graph as mentioned in Definition 4. Validation, or why
these are the right models, can be achieved by successful answering of the the compe-
tency questions posed in Section 2.1. To achieve this, we will rephrase and encode the
competency questions as first-order logic statements. This will be set as the goal in an
automated theorem prover e.g. Prover9.
    1. What type of overlap is present with ATP6 and ATP8 genes? Naturally, ATP6 and
       ATP8 must first be defined ontologically. ATP6 and ATP8 nucleotide sequence
       data will be processed via a parsing script, then further defined into terms within
       our ontology. For example,

         ∀xAT P6(x) ≡ ∃aAAT (a) ∧ mol(a, x) ∧ ∃c∃b1∃t∃b2CT G(c)T TC(t)bond(b1)
           bond(b2) ∧ mol(c, x) ∧ mol(t, x) ∧ tether(a, c, b1) ∧ tether(c,t, b2) ∧ ...       (41)


       ATP6 is 681 base pairs in length, so for spatial constraint reasons, we will not
       include the full definition here. A similar definition is provided for ATP8. Now,
       classes for ATP6 and ATP8 have been defined and can be used for logical infer-
       ence.
       In other words, this question asks whether there exists a some sequence of nu-
   cleotides that is part of ATP6, and a separate sequence of nucleotides that is part
   of ATP8, that overlap as per definition in previous sections.
   In first order logic, this is expressed as:

           ∀x∀yAT P6(x) ∧ AT P8(y) ⊃ ∃s1∃s2sequence(s1) ∧ sequence(s2)
                 ∧part(s1, AT P6) ∧ part(s2, AT P8) ∧ overlap(s1, s2)              (42)

2. What do the DNA codons ATT and TTA code for?
   Again, data regarding ATT and TTA will be ontologically interpreted into rela-
   tions within the ontology. For instance,

   ∀xAT T (x) ≡ ∃a∃t1∃t2∃b1∃b2adenine(a) ∧ thymine(t1) ∧ thymine(t2) ∧ bond(b1)
   ∧bond(b2)codon(x) ∧ mol(a, x) ∧ mol(t1, x) ∧ mol(t2, x) ∧ tether(a,t1, b1) ∧ tether(t1,t2, b2)
                                                                               (43)

   Then, the question can be rewritten as:

      ∀x∀yAT T (x) ∧ T TA(y) ⊃ ∃zprotein(z)code f or(x, z) ∧ code f or(y, z)       (44)

3. What are the introns/exons in the given sequence? Introns and exons are nu-
   cleotide sequences that are removed and not removed, respectively, during RNA
   splicing. Processes such as splicing, transcription and translation will be further
   defined in a follow-up ontology (The BioSequence Process Ontology).

        ∀x sequence(x) ⊃ ∃i∃e intron(i) ∧ exon(e) ∧ part(i, x) ∨ part(e, x)        (45)

4. What are conserved sequences present in the histone h1 protein given the amino
   acid sequence. As shown in the above examples, histoneh1(x) will be defined as
   a specific class within the ontology. The question can then be rewritten as:

              ∀xhistoneh1(x) ⊃ ∃s ∧ conserved sequence(s) ∧ part(s, x)             (46)

5. What are conservative replacements for the conserved sequence in the gene cod-
   ing for the h1 histone protein? A conserved sequence is defined as sequences
   in nucleic acids that are similar or identical across species. In other words, does
   there exist a sequence that serves the same role as the conserved sequence in the
   h1 histone protein? This can be rewritten as:

                 ∀sconserved sequence(s)∀hhistoneh1(h) ∧ part(s, x)
            ⊃ ∃yconserved sequence(y) ∧ s 6= y ∧ (part(y, x)∧ 6= part(s, x)        (47)

6. Is this gene sequence circular? In other words, this asks if the starting and ending
   sequence somehow overlap or connect. This can be phrased as:

             ∀xsequence(x) ⊃ ∃n1 ∃n2 ∧ 50 (n1 , x) ∧ 30 (n2 , x) ∧ mol(n1 , n2 )   (48)
       7. How is 5’UTR and the start codon related? In other words, do the 5’UTR of a
          sequence and the start codon overlap somehow?

                               ∀xstart codon(x) ⊃ ∃y50UT R(y) ∧ mol(y, x)                            (49)


6. Summary

We began this paper claiming that classical mereology was different from the mereol-
ogy of gene sequences due to its convex nature, which has not been explicitly visited
by other biomedical ontologies. We then provided mathematically rigorous definitions
to represent this, and validated it with a theorem. This enables our ontology to represent
and reason about structural properties such as directionality, between-ness and parthood
while maintaining the semantics of a sequence. Explicit axiomatization of circular and
linear between-ness also provides representation for circular gene sequences, which is
also lacking in earlier approaches. This will be beneficial in applications such as gene-
driven drug design. Reasoning on molecular chemistry is achieved by building upon the
Molecular Structure Ontology (MoST), which is coincidentally unaligned with an upper
ontology. This was a conscious design decision to maximize expressivity and simultane-
ously represent gene sequences as a physical molecule and a abstract information entity
within the same ontology. Aligning with specific ontologies based on design necessity
instead of an arbritrary upper ontology proved beneficial in this case, and could serve as a
case study for other niche ontologies that don’t fit perfectly into a larger framework. For
future work, we will explore developing a BioSequence Process Ontology to formalize
definitions for processes such as transcription, translation and splicing.


References

 [1] Ashburner M, Ball CA, Blake JA, et al.: Gene ontology: tool for the unification of biology. The Gene
     Ontology Consortium. (2000)
 [2] Bada, M., Eilbeck, K.: Efforts toward a more consistent and interoperable sequence ontology. In: ICBO.
     (2012)
 [3] Chui, C., Gruninger, M.: A Molecular Structure Ontology for Medicinal Chemistry. In: Proc. of the 10th
     Int. Conference on Formal Ontologies in Information Systems (FOIS2016), IOS Press (2016) 317–330
 [4] Eilbeck, K., Lewis, S.E., Mungall, C.J. et al.: The Sequence Ontology: a tool for the unification of
     genome annotations. . (2005)
 [5] Khan Academy: Transcription & translation (2020) [Online; accessed April 27, 2020].
 [6] Kirill Degtyarenko, Paula de Matos, M.E.J.H.M.Z.A.M.R.A.M.D.M.G.a.M.A.: ChEBI: a database and
     ontology for chemical entities of biological interest. Nucleic Acids Research (2007)
 [7] Michael Gruninger, Carmen Chui, Y.R.J.T.: A mereology of connected structures. In: FOIS. (2020)
 [8] Mungall, C.J.: Formalization of Genome Interval Relations. bioRxiv (2014)
 [9] Mungall CJ, Batchelor C, E.K.: Evolution of the Sequence Ontology terms and relationships. J Biomed
     Inform 1 (2011) 87–93
[10] Robert Hoehndorf, J.K..H.H.: The ontology of biological sequences. BMC Bioinformatics 10 (2009)
[11] Stefan Schulz, Anand Kumar, Thomas Bittner: Biomedical ontologies: What part-of is and isn’t . 39
     (2006)
[12] Wikipedia, the free encyclopedia: Splicing (2020) [Online; accessed April 27, 2020].