<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Slim-o-matic: a semi-automated way to generate Gene Ontology slims</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>M´elanie Courtot</string-name>
          <email>mcourtot@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alex Mitchell</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maxim Scheremetjew</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Janet Pin˜ero</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laura I. Furlong</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Robert D. Finn</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Helen Parkinson</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>European Bioinformatics Institute (EMBL-EBI), European Molecular Biology Laboratory</institution>
          ,
          <addr-line>Wellcome Genome Campus, Hinxton</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Research Programme on Biomedical Informatics (GRIB), Hospital del Mar Medical Research Institute (IMIM), Department of Experimental and Health Sciences, Universitat Pompeu Fabra</institution>
          ,
          <addr-line>Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The Gene Ontology (GO) currently contains over 40,000 terms describing the locations, activities and processes of gene products. Several millions of gene products have been annotated using the GO, and these annotations are routinely used for multiple applications. However, because of the di↵erence of granularity in the annotations, it is useful to summarize GO annotations using GO slims. GO slims contain a subset of the GO terms, providing a higher-level, broader overview of the ontology while abstracting the finer details. Compiling GO slims is a time consuming process relying on manual human expertise, the process of creating the slims is often poorly documented, and maintaining and updating them can be dicult. In this paper, we present a semi-automated way to generate GO slims based on the annotation data available. We applied the tool to two di↵erent use cases, one for data overview in the newly released EBI Metagenomics pipeline, and one for gene-disease enrichment analysis using the DisGeNET platform. The slim-o-matic tool supports choosing the best terms for the slim, ensuring they are representative of the dataset, and have the best coverage using the minimal number of terms.</p>
      </abstract>
      <kwd-group>
        <kwd>Gene Ontology</kwd>
        <kwd>slim</kwd>
        <kwd>visualization</kwd>
        <kwd>enrichment</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        GO slims are subset of the Gene Ontology (GO)[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] which allow grouping at a
higher-level of annotations to lower level GO terms. When used for
visualization, GO slims employ only the main categories within the annotations, thereby
providing an overview of the dataset. When used for enrichment analysis, the
statistical power of the signal per slim term is greater than if signals to lower
classes were individually counted, which can provide greater insights [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. While
several slims are available on the GO website [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], their development has been ad
hoc and based on empirical methods, relying on both an expert GO editor to
select the best GO terms and a domain expert to provide guidance and describe
the dataset. To address this, and make the slim generation more transparent
and reproducible, the new slim-o-matic methodology was developed.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <p>
        A 4 step pipeline has been implemented:
1. The GO term identifiers (IDs) and their frequencies in the dataset are
mapped to the current GO ontology file. The current label of the terms
is retrieved from the GO file, and a new annotation property ‘label with
counts’ is populated by concatenating the label and associated frequency for
each term. A new Web Ontology Language (OWL)[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] file is generated.
2. The newly generated OWL file is opened in the Prot´eg´e OWL editor [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
and manually inspected. The data owner reviews the hierarchy and chooses
which higher-level terms maximize coverage of the annotations, using a fixed
number of terms and while retaining specificity for the particular dataset. A
list of slim term IDs is generated.
3. Based on the slim generated in 2., a script checks which of the original
annotations would be included or excluded from the result, to validate whether
any large count is falling outside of the chosen slim.
4. After iterations of steps 2 and 3, and once the data owner is satisfied with
the terms included in the newly generated slim, a mapping script is run. It
is based on map2slim [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and generates a list of all terms in their dataset
mapped to a higher level ontology term from the slim.
      </p>
      <p>All the code and files are available under our GitHub repository, https://
github.com/ebispot/slim-o-matic.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>
        We applied the slim-o-matic tool to two di↵erent use cases, one for dataset
overview in the newly updated EBI Metagenomics pipeline [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and one for
genedisease enrichment analysis using the DisGeNET [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] platform.
3.1
      </p>
      <sec id="sec-3-1">
        <title>EBI metagenomics pipeline</title>
        <p>
          The EBI Metagenomics is a resource for the analysis, archiving and browsing
of metagenomic and metatranscriptomic datasets, with the aim of providing
understanding of the microbial community composition and functional profile
of deposited samples. The number of sequences within these datasets can be
potentially vast, running into the 100s of millions, with similar numbers of
annotations. Users therefore need to be able to visualize GO terms (assigned by
InterProScan [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]) in an easy and compact way. A metagenomics GO slim was
first created in 2012, built using 30 million annotations available at the time.
Since then, the EBI Metagenomics resource has expanded dramatically and
currently contains 10s of billion annotations for taxonomically diverse sequences
sampled from a wide range of di↵erent environments. Given the increase in size
and diversity of annotations, and to support the release of an updated v3.0
analysis pipeline, the metagenomics GO slim was rebuilt using the slim-o-matic
approach. Compared to the pre-existing slim, the new GO slim contains a few
more terms (171 vs 160), and provides vastly improved coverage (98 % vs 80%
overall). This increased coverage stems from the fact that using slim-o-matic, the
GO terms chosen better reflects the current content of the EBI metagenomics
(for example, more eukaryotic-derived sequences) and updates in the GO (such
as better representation of viral terms in 2015 [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]). Fig. 1 shows an excerpt of
the coverage comparison between the old and new GO slims.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>DisGeNET</title>
        <p>
          DisGeNET [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] is a discovery platform that integrates information about genes
and variants associated to human diseases. To facilitate the analysis and
interpretation of the data, DisGeNET supplies a variety of annotations describing genes,
variants, and diseases. Currently, the genes in DisGeNET are characterized with
their Panther protein class, and their top level Reactome pathway. Nevertheless,
46% of genes in DisGeNET have no Panther protein class, and almost 60% have
no Reactome pathway. Adding GO information increases the coverage of
annotations for protein-coding genes in DisGeNET to over 90%. However, the diverse
granularity of the GO terms, and the relatively high number of annotations per
gene is a hurdle to straightforward data interpretation. This is why, as a proof of
concept, the slim-o-matic tool was applied to the GO cellular component (GO
CC) subset of GO terms. As a result, more than 1,400 terms GO CC terms
were reduced to 60 slim GO terms, and the median number of annotations per
gene decreased from 8 to 3. Additionally, an enrichment analysis per disease was
performed, to test whether the genes associated to each disease showed a
preferential distribution of cellular locations. DisGeNET diseases (curated subset)
were tested for an over representation of GO CC categories in the complete, and
slim set of GO terms. To ease the analysis of results, we grouped diseases by
broader categories that correspond to the MeSH classification of diseases [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
The results for the complete GO set after multiple test correction contained over
500 diseases in 360 GO CC categories, while the slim GO set contained 334
diseases in 47 categories. The results of the GO slim enrichment analysis show that
some types of neoplasms and complex cardiovascular diseases are associated to
proteins showing enrichment across all cellular compartments, while Mendelian
disease proteins tend to be more confined to one specific compartment. For
instance, Leigh disease and Coenzyme Q10 deficiency show an enrichment in the
mitochondria (both are mitochondrial diseases). Additionally, most nervous
system diseases, and mental disorders are enriched in proteins located in the plasma
membrane (receptors, channels, and transporters).
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Discussion</title>
      <p>
        Future development include investigating ways of creating disease-oriented slims,
where a term denoting process that might be involved in a disease
pathophysiology - such as angiogenesis - is chosen, and co-occurring annotations are fetched
from the GOA database with their counts. This can then be used as input to step
1. of the slim-o-matic tool, and allow semi-automated generation of slims focused
on specific clinical investigations. While the slim-o-matic method has been
developed based on the GO, nothing in the implementation is actually GO-specific.
This means it could be applied to other resources, such as the Experimental
Factor Ontology (EFO) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], which currently contains just over 19,000 classes (and
increasing), therefore reaching the limits of manual usability, and applied to the
NHGRI GWAS catalog [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Finally, an interesting idea would be to try and fully
automate the slim generation, thereby making it completely reproducible. While
expert intervention may improve coverage and minimize number of terms, this
comes at a cost of both resource and time, and we are aiming at implementing
fully automated slim extraction from the Ontology Lookup Service (OLS) [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]
hosted resources to provide a one-click slim experience to users.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>Slim-o-matic allows for easy, fast and semi-automated generation of slims based
on the underlying data. Consequently, slims have improved coverage over the
existing annotations, and can be regenerated on a regular basis as either the
dataset or the ontology evolve. As more and more ontologies reach a large size,
the ability to process their hierarchy semi-automatically and summarize their
content for visualization or enrichment analysis becomes critical.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>M</given-names>
            <surname>Ashburner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C A</given-names>
            <surname>Ball</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J A</given-names>
            <surname>Blake</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D</given-names>
            <surname>Botstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H</given-names>
            <surname>Butler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J M</given-names>
            <surname>Cherry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A P</given-names>
            <surname>Davis</surname>
            , K Dolinski
          </string-name>
          ,
          <string-name>
            <given-names>S S</given-names>
            <surname>Dwight</surname>
            , J T Eppig
          </string-name>
          ,
          <string-name>
            <given-names>M A</given-names>
            <surname>Harris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D P</given-names>
            <surname>Hill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L</given-names>
            <surname>Issel-Tarver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A</given-names>
            <surname>Kasarskis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J C</given-names>
            <surname>Matese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J E</given-names>
            <surname>Richardson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M</given-names>
            <surname>Ringwald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G M</given-names>
            <surname>Rubin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G</given-names>
            <surname>Sherlock</surname>
          </string-name>
          .
          <article-title>Gene ontology: tool for the unification of biology. The Gene Ontology Consortium</article-title>
          .
          <source>Nature genetics</source>
          ,
          <volume>25</volume>
          (
          <issue>1</issue>
          ):
          <fpage>25</fpage>
          -
          <lpage>9</lpage>
          , may
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Seung</given-names>
            <surname>Yon</surname>
          </string-name>
          <string-name>
            <surname>Rhee</surname>
          </string-name>
          , Valerie Wood,
          <string-name>
            <given-names>Kara</given-names>
            <surname>Dolinski</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Sorin</given-names>
            <surname>Draghici</surname>
          </string-name>
          .
          <article-title>Use and misuse of the gene ontology annotations</article-title>
          .
          <source>Nat Rev Genet</source>
          ,
          <volume>9</volume>
          (
          <issue>7</issue>
          ):
          <fpage>509</fpage>
          -
          <lpage>515</lpage>
          , jul
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <article-title>Go subsets on the go website</article-title>
          . http://geneontology.org/page/ download-ontology#Subsets. Accessed:
          <fpage>2016</fpage>
          -09-23.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. W3C OWL Working Group.
          <article-title>OWL 2 Web Ontology Language: Document Overview</article-title>
          .
          <source>W3C Recommendation</source>
          , 27
          <year>October 2009</year>
          . Available at http://www.w3.org/TR/ owl2-overview/.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Mark</surname>
            <given-names>A Musen.</given-names>
          </string-name>
          <article-title>The prot´eg´e project: a look back and a look forward</article-title>
          .
          <source>AI matters</source>
          ,
          <volume>1</volume>
          (
          <issue>4</issue>
          ):
          <fpage>4</fpage>
          -
          <lpage>12</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. map2slim wiki. https://github.com/owlcollab/owltools/wiki/Map2Slim. Accessed:
          <fpage>2016</fpage>
          -09-22.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Alex Mitchell, Francois Bucchini, Guy Cochrane, Hubert Denise, Petra ten Hoopen, Matthew Fraser, Sebastien Pesseat, Simon Potter, Maxim Scheremetjew,
          <string-name>
            <given-names>Peter</given-names>
            <surname>Sterk</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Robert D.</given-names>
            <surname>Finn</surname>
          </string-name>
          .
          <article-title>Ebi metagenomics in 2016 - an expanding and evolving resource for the analysis and archiving of metagenomic data</article-title>
          .
          <source>Nucleic Acids Research</source>
          ,
          <volume>44</volume>
          (
          <issue>D1</issue>
          ):
          <fpage>D595</fpage>
          -
          <lpage>D603</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Janet</given-names>
            <surname>Pin</surname>
          </string-name>
          <article-title>˜ero, Nu´ria Queralt-Rosinach, A`lex Bravo, Jordi Deu-Pons, Anna BauerMehren</article-title>
          , Martin Baron, Ferran Sanz, and
          <string-name>
            <surname>Laura</surname>
            <given-names>I</given-names>
          </string-name>
          <string-name>
            <surname>Furlong</surname>
          </string-name>
          .
          <article-title>Disgenet: a discovery platform for the dynamical exploration of human diseases and their genes</article-title>
          .
          <source>Database</source>
          ,
          <year>2015</year>
          :bav028,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Philip</given-names>
            <surname>Jones</surname>
          </string-name>
          , David Binns,
          <string-name>
            <surname>Hsin-Yu</surname>
            <given-names>Chang</given-names>
          </string-name>
          , Matthew Fraser,
          <string-name>
            <given-names>Weizhong</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>Craig</surname>
            <given-names>McAnulla</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hamish</surname>
            <given-names>McWilliam</given-names>
          </string-name>
          , John Maslen, Alex Mitchell, Gift Nuka, Sebastien Pesseat, Antony F. Quinn, Amaia Sangrador-Vegas, Maxim Scheremetjew, SiewYit Yong, Rodrigo Lopez, and
          <string-name>
            <given-names>Sarah</given-names>
            <surname>Hunter</surname>
          </string-name>
          .
          <article-title>Interproscan 5: genome-scale protein function classification</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>30</volume>
          (
          <issue>9</issue>
          ):
          <fpage>1236</fpage>
          -
          <lpage>1240</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>R. E.</given-names>
            <surname>Foulger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Osumi-Sutherland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. K.</given-names>
            <surname>McIntosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hulo</surname>
          </string-name>
          , P. Masson, S. Poux,
          <string-name>
            <given-names>P.</given-names>
            <surname>Le Mercier</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Lomax</surname>
          </string-name>
          .
          <article-title>Representing virus-host interactions and other multiorganism processes in the Gene Ontology</article-title>
          .
          <source>BMC Microbiology</source>
          ,
          <volume>15</volume>
          (
          <issue>1</issue>
          ):146, dec
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <article-title>Medical subject headings (mesh</article-title>
          . https://www.nlm.nih.gov/mesh. Accessed:
          <fpage>2016</fpage>
          -11-18.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>James</surname>
            <given-names>Malone</given-names>
          </string-name>
          , Ele Holloway, Tomasz Adamusiak, Misha Kapushesky, Jie Zheng, Nikolay Kolesnikov, Anna Zhukova, Alvis Brazma, and
          <string-name>
            <given-names>Helen</given-names>
            <surname>Parkinson</surname>
          </string-name>
          .
          <article-title>Modeling sample variables with an experimental factor ontology</article-title>
          .
          <source>Bioinformatics</source>
          ,
          <volume>26</volume>
          (
          <issue>8</issue>
          ):
          <fpage>1112</fpage>
          -
          <lpage>1118</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Danielle</surname>
            <given-names>Welter</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jacqueline</surname>
            <given-names>MacArthur</given-names>
          </string-name>
          , Joannella Morales, Tony Burdett, Peggy Hall, Heather Junkins, Alan Klemm, Paul Flicek, Teri Manolio, Lucia Hindor↵, et al.
          <article-title>The nhgri gwas catalog, a curated resource of snp-trait associations</article-title>
          .
          <source>Nucleic acids research</source>
          ,
          <volume>42</volume>
          (
          <issue>D1</issue>
          ):
          <fpage>D1001</fpage>
          -
          <lpage>D1006</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Simon</surname>
            <given-names>Jupp</given-names>
          </string-name>
          , Tony Burdett, James Malone, Catherine Leroy, Matt Pearce,
          <string-name>
            <surname>Julie McMurry</surname>
            ,
            <given-names>and Helen</given-names>
          </string-name>
          <string-name>
            <surname>Parkinson</surname>
          </string-name>
          .
          <article-title>A New Ontology Lookup Service at EMBL-EBI</article-title>
          .
          <source>In Proceedings of SWAT4LS International Conference</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>