<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Barry Smith</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University at Buffalo</institution>
          ,
          <addr-line>Buffalo, NY 14051</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>   What follows is a comment on the 2022 ICBO paper “What AlphaFold teaches us about deep learning with prior knowledge” by Jobst Landgrebe. It seeks to throw light on the sense in which the prior knowledge used by AlphaFold is to be understood in ontological terms.</p>
      </abstract>
      <kwd-group>
        <kwd> 1  AlphaFold</kwd>
        <kwd>AI</kwd>
        <kwd>ontology</kwd>
        <kwd>Protein Ontology</kwd>
        <kwd>UniProtKB</kwd>
        <kwd>protein structure</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction </title>
      <p>
        The AI models developed in the life sciences
have a much lower predictive power than the
models developed in domains such as engineering
or physics. Why is this so? In a paper for this
conference, Jobst Landgrebe analyzes AlphaFold
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], one of the few examples of applying AI to
biology that is predictively successful. Landgrebe
shows that this success turns on the fact that
AlphaFold is able to use prior knowledge about
protein folding that has already been assembled
through experimental efforts invested in the
decoding of protein sequences.
      </p>
      <p>For a cluster of such decoded sequences,
Alphafold can be applied to identify certain
patterns in each protein homologous to the cluster,
which then allow it to make highly successful
predictions about its structure. This is remarkable
given that, before AlphaFold, a very low degree
of success had been attained in making protein
folding predictions.</p>
      <p>
        As Landgrebe explains, the prior knowledge
about protein folding ingested as input into the
AlphaFold machine learning algorithm takes two
forms: 1) as protein structure data (CIF files); and
2; as knowledge about protein homology groups.
These form the decisive factors which enable the
predictive success of the algorithm, which uses
only the protein’s amino acid sequence as input
and the heavy atom angle information as output.
Like other prediction algorithms in the field, the
ability of the AlphaFold model to predict protein
structure can be applied only for proteins
homologous to those with established structures.
This ability depends on a part of the implicit
model capturing the relationship between these
known folding structures and sequence clusters.
This is why the model succeeds; but also why it
fails to predict structures for those molecules
which are not homologous to proteins for which
the structures had been already determined using
classical protein crystallography. The model can
therefore not create new protein folding
knowledge – this must still be obtained from
experiments, which can take several years per
protein or fail altogether after unsuccessful efforts
(as is often the case, for example, for
transmembrane domains of proteins). On the other
hand AlphFold stands out as compared to other
structure prediction algorithms because it
achieves high accuracy even for sequences with
fewer homologous sequences [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. The role of ontology </title>
      <p>There are two meanings of the term
“ontology”: (i) as a branch of classical
metaphysics dealing with the fundamental
structure of the world, and (ii) as a scientific
discipline that developed over the last 30 years,
and which deals with organizing data and
information about the world in a structured form
to enable various sorts of data exploitation, for
example in what is called ‘data science’. A
significant fraction of the work carried out today
under the label of ontology in sense (ii) is
influenced by our understanding of ontology in
the more traditional sense (i), above all in its use
of the distinction between universals, organized in
taxonomical hierarchies, and their respective
instances – typically entities such as cells and
molecules, which exist in time and space.</p>
      <p>
        When Landgrebe claims that the prior
knowledge that was used by AlphaFold is a form
of ontology, he is referring to the CIF files, each
of which represents the structure of the protein it
describes. In what sense, then, is a CIF file an
ontology, or a part of an ontology? This is a deep
question, which takes us to the very foundations
of sense (ii) ontology, namely to the distinction
between universals and instances. The terms
protein, amino acid chain, histone, chordin, and so
forth, are unquestionably ontological – they are all
terms from the Protein Ontology [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. But so also,
and for the same reason, is the term:
      </p>
      <p>PR:P06733-2: alpha-enolase isoform hMBP-1
(human)
whose position in the PRO hierarchy is illustrated
in Figure 1. This term is defined in PRO as:
An alpha-enolase (human) that is a translation
product of some mRNA whose exon structure and
start site selection renders it capable of giving rise
to a protein with the amino acid sequence
represented by UniProtKB:P06733-2.</p>
      <p>The mentioned UniProt sequence is an example
sequence for a certain class of molecules (briefly:
molecules having the same translation site and
exon structure, where ‘same’ means ‘belong to the
same class’).</p>
      <p>Why, now, do we regard PRO as an ontology,
and UniProtKB as a database? There are a number
of answers to this question. Most importantly, as
Figure 1 makes clear, the content of PRO is
organized in terms of hierarchies of
representations of universals of greater and lesser
generality; something which is not the case for the
content of UniProtKB.</p>
      <p>Indeed, there is a sense in which UniProtKB
comprises instance data as its content: almost all
of the protein sequences it provides are derived
through translation of the coding sequences
(CDS) submitted to public nucleic acid databases
on the basis of analysis of biological samples, i.e.
of instances. In this sense each sequence is, given
its provenance, itself a specific instance (it is the
sequence of a corresponding specific sample).</p>
      <p>Yet at the same time each such sequence is
found in (is the sequence of) many trillions of
molecules. It is for this reason that UniProtKB is
useful to biological and biomedical research. Each
UniProt sequence in fact represents a universal
with these many corresponding instances.
UniProtKB, it is true, lacks an explicit hierarchy
for these universals, though one could infer an
implicit hierarchy from the information in the
entry. (We know, for example, that all information
about the sequences is derived from the indicated
gene.) UniProt, as contrasted with PRO and other
ontologies, also lacks explicit definitions –
though, again, these are implied. PRO is explicit
in its representation of the molecules themselves
and--for those cases that derive from
UniProtKB-makes explicit those implied hierarchies and
definitions.On the other hand, all the entries in
UniProtKB (and in practically all other putative
databases maintained by biological scientists),
consist of representations of universals. Each of
the protein sequences contained in UniProtKB,
for example, almost certainly exists in some
trillions of instances.</p>
      <p>
        In the same way, each of the mmCIF files in
the Protein Data Bank (PDB) represents a protein
structure that, again, almost certainly exists in
trillions of instances. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] The collection of mmCIF
files is already structured into protein homology
families, and as UniProt and PDB develop we can
expect that more and more of the hierarchical
ontology structure incorporated into PRO will
become explicit in these resources, too. We can
also expect that more and more of this ontological
knowledge – which means knowledge that is
organized in such a way as to make explicit the
relations between universals – as it is made
available in computable form, will in the future
help to drive progress in applying AI to the life
sciences.
      </p>
      <p>
        Why, then, are there not already more
predictive models in biology? Because organisms
are complex systems and it is only certain aspects
of such systems that can be modelled using
mathematics [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Due to evolutionary pressure
and the high costs of evolutionary change when it
occurs in biological systems, nature has conserved
protein homology families to a high extent. That
is a pattern of regularity in a complex system that
is amenable to mathematical modelling. But many
other aspects of biological systems are not
conserved. The task of applied mathematics in
biology is to find the patterns of regularity that can
be modelled using implicit models such as those
exploited by AlphaFold.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgements </title>
      <p>Thanks are due to Jobst Landgrebe, Darren
Natale and Asiyah Lin for helpful comments.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Landgrebe</surname>
          </string-name>
          ,
          <article-title>What AlphaFold teaches us about deep learning with prior knowledge</article-title>
          .
          <source>ICBO</source>
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Jumper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pritzel</surname>
          </string-name>
          , et al.
          <article-title>Highly accurate protein structure prediction with AlphaFold</article-title>
          .
          <source>Nature</source>
          <volume>596</volume>
          (
          <year>2021</year>
          )
          <fpage>583</fpage>
          -
          <lpage>589</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Natale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. N.</given-names>
            <surname>Arighi</surname>
          </string-name>
          <string-name>
            <surname>CN</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Blake</surname>
          </string-name>
          et al.
          <article-title>Protein Ontology: a controlled structured network of protein entities</article-title>
          .
          <source>Nucleic acids res</source>
          .
          <volume>42</volume>
          (
          <year>2014</year>
          )
          <fpage>D415</fpage>
          -
          <lpage>21</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P. D.</given-names>
            <surname>Adams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. V.</given-names>
            <surname>Afonine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Baskaran</surname>
          </string-name>
          , et al.
          <article-title>Announcing mandatory submission of PDBx/mmCIF format files for crystallographic depositions to the Protein Data Bank (PDB)</article-title>
          .
          <source>Acta Crystallographica Sect D: Struct Biol</source>
          .
          <volume>75</volume>
          (
          <year>2019</year>
          )
          <fpage>451</fpage>
          -
          <lpage>454</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Landgrebe</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Smith</surname>
          </string-name>
          .
          <source>Why Machines Will Never Rule the World: Artificial Intelligence Without Fear. Routledge</source>
          ,
          <year>2022</year>
          .  
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>