<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UniProt, ChEBI and IDSM Sachem: exploring biologically relevant ligands</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sebastien Gehant</string-name>
          <email>sebastien.gehant@sib.swiss</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elisabeth Coudert</string-name>
          <email>elisabeth.coudert@sib.swiss</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anne Morgat</string-name>
          <email>anne.morgat@sib.swiss</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jerven Bolleman</string-name>
          <email>jerven.bolleman@sib.swiss</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicole Redaschi</string-name>
          <email>nicole.redaschi@sib.swiss</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alan Bridge</string-name>
          <email>alan.bridge@sib.swiss</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>UniProt Consortium</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>European Bioinformatics Institute (EBI), Wellcome Trust Genome Campus</institution>
          ,
          <addr-line>Hinxton Cambridge CB10 1SD</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Protein Information Resource (PIR), Georgetown University Medical Center</institution>
          ,
          <addr-line>3300 Whitehaven Street, NW, Suite 1200, Washington, DC 20007</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Protein Information Resource (PIR), University of Delaware</institution>
          ,
          <addr-line>15 Innovation Way, Suite 205, Newark, DE 19711</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Swiss-Prot group, SIB Swiss Institute of Bioinformatics, CMU</institution>
          ,
          <addr-line>1 Michel Servet, 1211 Geneva 4</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>UniProt has recently standardized the annotation of biologically relevant small molecule ligands essential for protein structure and function - using the chemical ontology ChEBI, allowing UniProt users to retrieve small molecule ligands that match a given chemical structure, or that are members of a given chemical class (as defined by the ChEBI ontology), via the UniProt website and APIs (www.uniprot.org). In this work we use SPARQL to further extend the chemical structure search capabilities of UniProt by federation of UniProt to the IDSM Sachem SPARQL endpoint, which supports chemical similarity and substructure searches. Protein structures - experimentally determined and predicted using AI methods lack biologically relevant ligands. This work provides a simple means to identify them for the purposes of protein structure annotation and modeling.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;UniProt</kwd>
        <kwd>ligands</kwd>
        <kwd>Sachem</kwd>
        <kwd>similarity search</kwd>
        <kwd>federation</kwd>
        <kwd>SPARQL</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Experimentally determined protein structures found in PDB[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] often contain ligands that were
artificially modified to enable crystallization. Accurate protein structure determination and
modeling therefore requires that these artificial ligands are replaced by their biologically relevant
(cognate) equivalents.
      </p>
      <p>
        UniProt [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] announced that in release 2022_03 it significantly improved the annotation of
cognate ligands [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], replacing thousands of existing textual descriptions of ligands with their
equivalents from the ChEBI ontology [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. These annotations are now available as part of the
core UniProt dataset on its public SPARQL endpoint at https://sparql.uniprot.org/. Sachem is
a set of tools that support chemical substructure and similarity searches on a large corpora
of known chemical compounds imported from PubChem, ChEMBL and ChEBI [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Sachem
exposes these capabilities as a W3C standard SPARQL 1.1 endpoint[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>We show with an example how the annotations in UniProtKB and Sachem data can be
combined via federated SPARQL queries in order to find potential cognate ligands for AMP-PCP
(PDB ligand code ACP), an ATP analog commonly used in crystallography. The ChEBI identifiers
of all chemicals similar to AMP-PCP above a given similarity score (e.g. 0.8) are obtained by
querying Sachem with the SMILES representation of the artificial ligand (SMILES, or Simplified
Molecular-Input Line-Entry System, is a line notation for describing the structure of chemical
species). The list of ChEBI identifiers, bound to the variable ?, is then reduced through
the additional constraint that the variable must also correspond to a cognate ligand annotated
in UniProtKB.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Query</title>
      <p>PREFIX up : &lt; h t t p : / / p u r l . u n i p r o t . org / c o r e / &gt;
PREFIX sachem : &lt; h t t p : / / b i o i n f o . uochb . c a s . c z / r d f / v1 . 0 / sachem # &gt;
PREFIX r d f s : &lt; h t t p : / / www. w3 . org / 2 0 0 0 / 0 1 / r d f −schema # &gt;
SELECT DISTINCT
? l i g a n d S i m i l a r i t y S c o r e
? l i g a n d
WHERE {</p>
      <p>SERVICE &lt; h t t p s : / / idsm . e l i x i r − c z e c h . c z / s p a r q l / e n d p o i n t / c h e b i &gt;
{
[ sachem : compound ? l i g a n d ;
sachem : s c o r e ? l i g a n d S i m i l a r i t y S c o r e ]
sachem : s i m i l a r i t y S e a r c h [
sachem : query " c1nc ( c 2 c ( n1 ) n ( cn2 ) [C@H] 3 [C@@H] ( [C@@H] ( [
C@H] ( O3 )CO[P@@] ( =O) (O)O[P@] ( =O) ( CP ( =O) (O)O)O)O)O)N" ;
# I s o m e r i c SMILES o f AMP−PCP
sachem : c u t o f f " 8 e −1 " ^^ xsd : d o u b l e ;
sachem : s i m i l a r i t y R a d i u s 1 ]
}
? u n i p r o t up : a n n o t a t i o n ? a n n o t a t i o n .
? a n n o t a t i o n a up : B i n d i n g _ S i t e _ A n n o t a t i o n ;</p>
      <p>up : l i g a n d / r d f s : s u b C l a s s O f ? l i g a n d .
}
ORDER BY DESC ( ? l i g a n d S i m i l a r i t y S c o r e )
Find cognate ligands similar to the artificial ligand AMP-PCP (PDB ACP) using Sachem and
UniProtKB.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Funding &amp; acknowledgements</title>
      <p>This work was supported by the National Eye Institute (NEI), National Human Genome Research
Institute (NHGRI), National Heart, Lung, and Blood Institute (NHLBI), National Institute on
Aging (NIA), National Institute of Allergy and Infectious Diseases (NIAID), National Institute of
Diabetes and Digestive and Kidney Diseases (NIDDK), National Institute of General Medical
Sciences (NIGMS), National Institute of Mental Health (NIMH), and National Cancer Institute
(NCI) of the National Institutes of Health (NIH) under grant U24HG007822. Additional support
for the EMBL-EBI’s involvement in UniProt comes from European Molecular Biology
Laboratory (EMBL) core funds, the Alzheimer’s Research UK (ARUK) grant ARUK-NAS2017A-1,
the Biotechnology and Biological Sciences Research Council (BBSRC) [BB/T010541/1] and
Open Targets. UniProt activities at the SIB are additionally supported by the Swiss Federal
Government through the State Secretariat for Education, Research and Innovation SERI. PIR’s
UniProt activities are also supported by the NIH grants R01GM080646, G08LM010720, and
P20GM103446, and the National Science Foundation (NSF) grant DBI-1062520.</p>
      <p>We used Sachem IDSM, kindly provided by the Institute of Organic Chemistry and
Biochemistry of the Czech Academy of Sciences.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Burley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. M.</given-names>
            <surname>Berman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Duarte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Flatt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. P.</given-names>
            <surname>Hudson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lowe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Peisach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Piehl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Rose</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sekharan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Vallat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Voigt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Westbrook</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Y.</given-names>
            <surname>Young</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zardecki</surname>
          </string-name>
          ,
          <article-title>Protein data bank: A comprehensive review of 3D structure holdings and worldwide utilization by researchers, educators, and students</article-title>
          ,
          <source>Biomolecules</source>
          <volume>12</volume>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>UniProt</given-names>
            <surname>Consortium</surname>
          </string-name>
          ,
          <article-title>UniProt: the universal protein knowledgebase in 2023</article-title>
          ,
          <source>Nucleic Acids Res</source>
          . (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>Coudert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gehant</surname>
          </string-name>
          , E. de Castro,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pozzato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Baratin</surname>
          </string-name>
          , T. B.
          <string-name>
            <surname>Neto</surname>
            ,
            <given-names>C. J. A.</given-names>
          </string-name>
          <string-name>
            <surname>Sigrist</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Redaschi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Bridge</surname>
          </string-name>
          ,
          <article-title>The UniProt Consortium, Annotation of biologically relevant ligands in UniProtKB using ChEBI</article-title>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hastings</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Owen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dekker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ennis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Muthukrishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Turner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Swainston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mendes</surname>
          </string-name>
          , C. Steinbeck, ChEBI in 2016:
          <article-title>Improved services and an expanding collection of metabolites</article-title>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kratochvíl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vondrášek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Galgonek</surname>
          </string-name>
          ,
          <article-title>Sachem: a chemical cartridge for high-performance substructure search</article-title>
          ,
          <source>J. Cheminform</source>
          .
          <volume>10</volume>
          (
          <year>2018</year>
          )
          <fpage>27</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kratochvíl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vondrášek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Galgonek</surname>
          </string-name>
          ,
          <article-title>Interoperable chemical structure search service</article-title>
          , J. Cheminform.
          <volume>11</volume>
          (
          <year>2019</year>
          )
          <fpage>45</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>