<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Handling the Challenges of Microbiome Data through Supervised Autoencoders for the Non-invasive Disease Diagnosis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Veronica Buttaro</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michelangelo Ceci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gianvito Pio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Data Science Lab, CINI Consortium</institution>
          ,
          <addr-line>Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dept. of Computer Science, University of Bari Aldo Moro</institution>
          ,
          <addr-line>Bari</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Dept. of Knowledge Technologies, Jozef Stefan Institute</institution>
          ,
          <addr-line>Ljubljana</addr-line>
          ,
          <country country="SI">Slovenia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The analysis of the human microbiome is very important for the maintenance of human health and the possible early diagnosis of diseases. The so-called dysbiosis of the microbiome, that is the disruption of the state of equilibrium between “good” and “bad” bacteria, can trigger several disease conditions and disturbs [1, 2]. For example, through the microbiome-gut-brain axis, there is evidence of a correlation between alterations in the microbiome and neurodevelopmental conditions, such as Autism Spectrum Disorder (ASD) [3, 4]. Another relevant disease that has shown to be correlated with the microbiome is the Colorectal Cancer (CRC): several studies identified changes in the composition of the gut microbiome associated with CRC progression [5, 4]. In this context, the analysis of the microbiome would represent a non-invasive solution, that would complement other approaches, such as the FIT analysis [6]. The adoption of machine learning techniques can nowadays accelerate the construction of novel predictive models for the early and non-invasive diagnosis of diseases and disturbs from microbiome data. However, while the adoption of these algorithms could facilitate the identification of novel biomarkers, there are numerous challenges to be faced when working with microbiome data [7]. Among the main challenges, it is worth mentioning the high dimensionality, the data sparsity, the high variability and heterogeneity, and data compositionality. Following this line of research, we propose a novel machine learning approach to analyze human microbiome data to build a predictive model for a non-invasive diagnosis of ASD and CRC, that is able to handle such challenges.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Supervised Autoencoders</kwd>
        <kwd>Microbiome</kwd>
        <kwd>Autism Spectrum Disorder</kwd>
        <kwd>Colorectal Cancer</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. The proposed method</title>
      <p>
        Microbiome data are collections of counts for a wide range of Operational Taxonomic Units (OTUs),
namely, counts at a given level of detail (genus, species, families, etc.) observed in fecal samples. They
are usually expressed as relative abundances, thus introducing data compositionality. In order to handle
the high variability and the issues raised by data compositionality, we rely on the pseudo Centered Log
Ratio normalization (CLR) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        On the other hand, we handle the high dimensionality and sparsity of microbiome data through a
specific kind of neural network, based on Autoencoders (AEs). AEs usually exhibit a funnel-shaped
structure, that aims to learn a compressed representation, such that data provided to the input layer is
accurately reconstructed in the output layer. However, standard AEs tend to discard the actual label (i.e.,
diagnosis, in our case) of training instances (i.e., individuals, in our case) while learning the compressed
space. The novelty of our approach with respect to other works [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ] consists in the exploitation of the
actual diagnosis of individuals during the training of a supervised autoencoder (SAE), that is performed
by simultaneously optimizing a reconstruction loss (RL) and a classification loss (CL). We measure the
RL through the Mean Squared Error (MSE) computed between the input and the reconstructed output,
while we measure the CL through the Binary Cross Entropy. The combined loss is then computed as
the linear combination of these losses, where  ∈ [0; 1] (resp., 1 −  ) represents the weight provided to
the reconstruction loss (resp., classification loss).
      </p>
      <p>The bottleneck layer of the trained autoencoder is then used as input to learn a classification model
based on Random Forests (RF). A figure depicting the proposed method is shown in Fig. 1.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Results and discussion</title>
      <p>We focus our experiments on the diagnosis performed on two public datasets about CRC and ASD
. All the experiments were conducted using a stratified 5-fold cross validation, collecting precision,
recall and F1-score. For comparison, we considered the results obtained: i) without reducing the data
dimensionality; ii) by reducing the input space through PCA; iii) with a standard (unsupervised) AE.
We also experimented with diferent values of  to assess its influence on the results of our SAE.</p>
      <p>For both datasets, the proposed architecture proved to be able to consider the label of training
instances (i.e., known diagnosis), during the identification of the optimal compression of the data.
Indeed, the proposed SAE led to better results in comparison with classifiers learned from the original
features, as well as to those learned after the application of the PCA or the standard AE. Specifically, we
observed an improvement in terms of macro F1 score of 5.6% (with  = 0.5) on the ASD dataset, and of
32% on the CRC dataset (with  = 0.7) over the results obtained from the original features. While other
values of  did not provide the same improvement, the obtained results were almost always higher
than those achieved by training the classifier from the original features.</p>
      <p>In the future, we will integrate an explainability component to identify which bacteria mostly
contributed to making the diagnosis. We will also extend our method to work in a semi-supervised
setting to also exploit unlabeled instances.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <p>This work was partially supported by the European Union - NextGenerationEU through the Italian
Ministry of University and Research, Projects: “FAIR - Future AI Research (PE00000013)”, Spoke 6
- Symbiotic AI; PRIN 2022 “BA-PHERD: Big Data Analytics Pipeline for the Identification of
Heterogeneous Extracellular non-coding RNAs as Disease Biomarkers", grant n. 2022XABBMA, CUP:</p>
    </sec>
    <sec id="sec-5">
      <title>A. Online Resources</title>
      <p>The source code and data are publicly available, as follows:
• GitHub (source code): https://github.com/VeronicaButtaro98/SAE-microbiome
• CRC dataset: https://hackmd.io/@laurichi13/rJt3ewZut
• ASD dataset: https://www.kaggle.com/datasets/antaresnyc/human-gut-microbiome-with-asd</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Askarova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Umbayev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-R.</given-names>
            <surname>Masoud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kaiyrlykyzy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Safarova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tsoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Olzhayev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kushugulova</surname>
          </string-name>
          ,
          <article-title>The links between the gut microbiome, aging, modern lifestyle and Alzheimer's disease, Frontiers in cellular and infection microbiology 10 (</article-title>
          <year>2020</year>
          )
          <fpage>104</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Role and mechanism of gut microbiota in human disease</article-title>
          ,
          <source>Frontiers in Cellular and Infection Microbiology</source>
          <volume>11</volume>
          (
          <year>2021</year>
          )
          <fpage>625913</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Guo</surname>
          </string-name>
          , et al.,
          <article-title>Altered gut microbial profile is associated with abnormal metabolism activity of Autism Spectrum Disorder</article-title>
          ,
          <source>Gut microbes 11</source>
          (
          <year>2020</year>
          )
          <fpage>1246</fpage>
          -
          <lpage>1267</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Simeon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Radovanović</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lončar-Turukalo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ceci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brdar</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Pio, Multi-class boosting for the analysis of multiple incomplete views on microbiome data</article-title>
          ,
          <source>BMC bioinformatics 25</source>
          (
          <year>2024</year>
          )
          <fpage>188</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Novielli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Romano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Magarelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. D.</given-names>
            <surname>Bitonto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Diacono</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chiatante</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lopalco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sabella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Venerito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Filannino</surname>
          </string-name>
          , et al.,
          <article-title>Explainable artificial intelligence for microbiome data analysis in colorectal cancer biomarker identification</article-title>
          ,
          <source>Frontiers in Microbiology</source>
          <volume>15</volume>
          (
          <year>2024</year>
          )
          <fpage>1348974</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>N. T.</given-names>
            <surname>Baxter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Rufin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Rogers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. D.</given-names>
            <surname>Schloss</surname>
          </string-name>
          ,
          <article-title>Microbiota-based model improves the sensitivity of fecal immunochemical test for detecting colonic lesions</article-title>
          ,
          <source>Genome medicine 8</source>
          (
          <year>2016</year>
          )
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>I.</given-names>
            <surname>Moreno-Indias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lahti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nedyalkova</surname>
          </string-name>
          , I. Elbere,
          <string-name>
            <given-names>G.</given-names>
            <surname>Roshchupkin</surname>
          </string-name>
          , et al.,
          <article-title>Statistical and machine learning techniques in human microbiome studies: contemporary challenges and solutions</article-title>
          ,
          <source>Frontiers in microbiology 12</source>
          (
          <year>2021</year>
          )
          <fpage>635781</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Swift</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cresswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Johnson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Stilianoudakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <article-title>A review of normalization and diferential abundance methods for microbiome counts data</article-title>
          ,
          <source>WIREs Computational Statistics</source>
          <volume>15</volume>
          (
          <year>2023</year>
          )
          <article-title>e1586</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Oh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Deepgeni:
          <article-title>Deep generalized interpretable autoencoder elucidates gut microbiota for better cancer immunotherapy</article-title>
          ,
          <source>Scientific Reports</source>
          <volume>13</volume>
          (
          <year>2023</year>
          )
          <fpage>4599</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Reiman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <article-title>Using autoencoders for predicting latent microbiome community shifts responding to dietary changes</article-title>
          ,
          <source>in: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1884</fpage>
          -
          <lpage>1891</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>