<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Toward Explainable Biomedical Deep Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrea Mastropietro</string-name>
          <email>mastropietro@bit.uni-bonn.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Life Science Informatics and Data Science</institution>
          ,
          <addr-line>B-IT</addr-line>
          ,
          <institution>LIMES Program Unit Chemical Biology and Medicinal Chemistry</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Lamarr Institute for Machine Learning and Artificial Intelligence</institution>
          ,
          <addr-line>Friedrich-Hirzebruch-Allee 5/6, 53115 Bonn</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Rheinische Friedrich-Wilhelms-Universität</institution>
          ,
          <addr-line>Friedrich-Hirzebruch-Allee 5/6, 53115 Bonn</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <fpage>9</fpage>
      <lpage>11</lpage>
      <abstract>
        <p>Deep learning is a powerful tool for biomedical applications. However, it has a shortcoming that cannot be underestimated: the absence of interpretability. Therefore, the aim of this doctoral research is to propose a comprehensive biomedical deep learning pipeline enriched with explainable artificial intelligence components. This pipeline, which goes from the discovery of disease-associated genes to the development of novel drugs, by opening the black box and rationalizing predictions, can enable a more efective and transparent usage of neural network-based models in real-world bioinformatics and chemoinformatics scenarios.</p>
      </abstract>
      <kwd-group>
        <kwd>deep learning</kwd>
        <kwd>explainable artificial intelligence</kwd>
        <kwd>bioinformatics</kwd>
        <kwd>chemoinformatics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Deep learning has been extensively used in bioinformatics and chemoinformatics, delivering promising
results in tasks such as disease gene identification and molecular activity prediction. However, its
widespread adoption in biomedicine is hindered by the inherent black-box character of neural networks,
whose complex, nonlinear mechanisms often make their decisions hard to rationalize. In fields where
understanding the underlying biological rationale is critical, such as determining the genetic basis of
diseases or the eficacy of therapeutic compounds, this lack of transparency undermines trust and limits
practical utility.</p>
      <p>To overcome this challenge, explainable artificial intelligence (XAI) is needed to reveal which input
features influence predictions and how those features interact. This research contributes to this goal
by developing and applying novel XAI techniques specifically designed for deep learning models in
biomedical contexts. These methods are integrated into a comprehensive biomedical deep learning
pipeline, enabling explainable outputs at each stage, from gene prioritization to drug repurposing and
design.</p>
      <p>In addition to deep models, the research deals with classical machine learning and network-based
algorithms, which are relevant in the life sciences and therefore hold a place within this work. The
resulting framework demonstrates that deep learning can be both powerful and explainable, ofering
scientists tools that are not only accurate but also trustworthy.</p>
    </sec>
    <sec id="sec-2">
      <title>2. The Explainable Biomedical Deep Learning Pipeline</title>
      <p>This doctoral research introduces an explainable biomedical deep learning pipeline, illustrated in
Figure 1, which integrates multiple components across the two main domains of bioinformatics and
chemoinformatics, leveraging large-scale biological and chemical data. In the bioinformatics area,
the pipeline begins with the training of a gene discovery model (block 1), whose predictions must be
https://www.mastro.me/ (A. Mastropietro)</p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073
CHEMICAL DATA</p>
      <p>ADD TO
CANDIDATE</p>
      <p>DRUGS
2</p>
      <p>MODEL
EXPLANATION</p>
      <p>FEATURE
IMPORTANCE
explained (block 2). To this end, the research explores two complementary aspects of explainability:
feature importance, which highlights the influence of individual inputs (e.g., specific gene mutations),
and feature interaction, which uncovers how features act jointly (e.g., gene–gene epistasis).</p>
      <p>
        To obtain explainable gene–disease predictions, we developed XGDAG (eXplainable Gene–Disease
Associations via Graph Neural Networks) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], a framework that combines graph neural networks (GNNs)
with explainability techniques such as GNNExplainer [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], GraphSVX [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and SubgraphX [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This
approach not only explains predictions but also aids in discovering novel gene associations through
explainable outputs.
      </p>
      <p>
        However, meaningful explanations require robust training, particularly challenging in bioinformatics
due to the prevalence of positive–unlabeled (PU) data. To address this, we propose NIAPU
(NetworkInformed Adaptive PU Learning) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], a novel method that uses a Markov difusion process on biological
networks to assign pseudo-labels with varying degrees of positiveness. These pseudo-labels allow GNNs
to learn efectively, enabling XGDAG to produce accurate and interpretable results.
      </p>
      <p>For feature interaction, we developed EpiDetect, a method designed to detect epistatic interactions
from genome-wide data. By analyzing neural network weights, EpiDetect estimates the extent to which
combinations of genetic variants influence a phenotype, revealing complex trait mechanisms beyond
single-gene efects.</p>
      <p>
        Once disease-associated genes are identified, they can inform therapeutic strategies (block 3) and
be used as targets for drug treatments. We demonstrate this with a case study on primary biliary
cholangitis (PBC) [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ], where NIAPU-identified genes were used to guide drug repurposing. In this
bioinformatics-driven approach, gene discovery guides the identification of new treatments.
      </p>
      <p>
        The pipeline then transitions into chemoinformatics (block 4), where the graph-like nature of
molecules is exploited. This makes GNNs a natural fit for predicting compound activity against
target proteins. To explain these models, we developed EdgeSHAPer [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ], the first edge-centric
Shapley value-based method for GNNs. Using a tailored Monte Carlo sampling approach, EdgeSHAPer
eficiently identifies molecular substructures most responsible for activity prediction outcomes.
      </p>
      <p>
        We further extended EdgeSHAPer to regression tasks (specifically, compound potency prediction in
protein–ligand interactions) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] to investigate whether GNNs truly capture meaningful biochemical
interaction patterns. Our findings showed mixed results, revealing that GNNs tend to memorize ligand
structures and can learn interaction-relevant information only when supported by high-quality graph
representations: an unanticipated novel finding.
      </p>
      <p>
        In addition, we address limitations in classical machine learning models. Approximate Shapley values
often fail in support vector machines (SVMs) used for molecular activity prediction. To overcome
this, we developed SVERAD (Shapley-Value Expressed Radial Basis Function) [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ], a method that
computes exact Shapley values from binary molecular fingerprints in quadratic time, providing reliable
feature attributions for SVMs.
      </p>
      <p>The final component of the pipeline (block 5) opens to future research directions, focusing on
generative drug design. Using the features identified in earlier stages, generative models can be
guided to create new molecules with desired properties. These candidate drugs can then be validated
and integrated into biomedical databases, closing the loop in a pipeline that is not only data-driven and
predictive but also explainable and biologically grounded.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Results and Conclusions</title>
      <p>This doctoral research presents a comprehensive explainable biomedical deep learning pipeline,
beginning with the challenge of detecting disease-associated genes. This is at first addressed with NIAPU,
a network-informed adaptive PU learning method that enables efective training in PU settings by
propagating pseudo-labels through biological networks. NIAPU not only facilitates gene prioritization
but also supports downstream explainability.</p>
      <p>In the second stage of the pipeline, explainability comes into play through the XGDAG framework.
By explaining GNN-based predictions, XGDAG generates explanation subgraphs highlighting candidate
disease genes. Proper GNN training is made possible thanks to NIAPU’s pseudo-labeling. The results
were validated through enrichment analysis, confirming that XAI methods can be used not just for
post hoc explanation but as active tools for gene discovery. XGDAG represents the first approach to
integrate PU learning and GNN explainability for this task.</p>
      <p>Beyond single-gene associations, the research addresses the complexity of epistatic interactions
through EpiDetect, a novel method that uses neural network weights to detect gene–gene interactions
influencing diseases and traits. This deepens the biological insights ofered by the pipeline and highlights
the multifactorial nature of disease mechanisms.</p>
      <p>The discovered genes can become targets for drug repurposing (block 3). In a case study on PBC,
NIAPU was used to expand the set of target genes, leading to meaningful candidate drugs with an
approach that is advantageous from both development time and safety profile perspectives.</p>
      <p>Bridging into chemoinformatics, the pipeline employs GNNs to predict compound activity, supported
by EdgeSHAPer, the first edge-centric Shapley value explanation method for GNNs. EdgeSHAPer
identifies relevant molecular substructures and outperforms existing tools in both explanation accuracy
and chemical relevance. It was further extended to predict compound potency in protein–ligand
interactions. Surprisingly, we found that while GNNs often struggle to learn interaction patterns
from overly simplistic graph representations, some models prioritize meaningful interaction edges
when trained on high-quality data. These results emphasize the critical role of high-quality graph
representations in enabling efective learning and consequent explainability.</p>
      <p>To address explanations in classical models, we introduced SVERAD, a method for exact Shapley
value computation in SVMs using binary molecular fingerprints. SVERAD provides reliable feature
attributions in quadratic time rather than exponential, further enhancing trust in models used for
compound activity prediction.</p>
      <p>As future research directions, the output of these models, i.e., important molecular features and
substructures, can be fed into the final stage of the pipeline: generative drug design (block 5). By
guiding generative models with features deemed important by the previous steps of the pipeline, we can
enable the creation of molecules that are both novel and efective, filling the last gap in the proposed
framework.</p>
      <p>In summary, this doctoral research presents a complete and explainable deep learning pipeline for
biomedicine, from gene discovery to drug design. Each component plays a pivotal role in enabling
a transparent and trustworthy usage of deep learning models. We presented diferent XAI solutions
working at every step of the pipeline, thereby enhancing the trustworthiness of neural networks in
bioinformatics and chemoinformatics, and going toward explainable biomedical deep learning.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <p>I would like to thank Aris Anagnostopoulos, Paolo Tieri, and Jürgen Bajorath for their instrumental
help and support during my doctoral studies. The Ph.D. thesis this extended abstract summarizes was
defended on January 31st, 2024, at Sapienza University of Rome.</p>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author used ChatGPT-4 in order to: Grammar and spelling
check. After using this tool/service, the author reviewed and edited the content as needed and takes
full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mastropietro</surname>
          </string-name>
          , G. De Carlo,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anagnostopoulos</surname>
          </string-name>
          ,
          <article-title>XGDAG: explainable gene-disease associations via graph neural networks</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>39</volume>
          (
          <year>2023</year>
          )
          <article-title>btad482</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ying</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bourgeois</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>You</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zitnik</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Leskovec,</surname>
          </string-name>
          <article-title>GNNExplainer: generating explanations for graph neural networks</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>32</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Duval</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. D.</given-names>
            <surname>Malliaros</surname>
          </string-name>
          ,
          <article-title>GraphSVX: Shapley value explanations for graph neural networks</article-title>
          ,
          <source>in: Machine Learning and Knowledge Discovery in Databases. Research Track: European Conference, ECML PKDD</source>
          <year>2021</year>
          , Proceedings,
          <source>Part II 21</source>
          , Springer,
          <year>2021</year>
          , p.
          <fpage>302</fpage>
          -
          <lpage>318</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ji</surname>
          </string-name>
          ,
          <article-title>On explainability of graph neural networks via subgraph explorations</article-title>
          ,
          <source>in: International Conference on Machine Learning, PMLR</source>
          ,
          <year>2021</year>
          , p.
          <fpage>12241</fpage>
          -
          <lpage>12252</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Stolfi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mastropietro</surname>
          </string-name>
          , G. Pasculli,
          <string-name>
            <given-names>P.</given-names>
            <surname>Tieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Vergni</surname>
          </string-name>
          ,
          <article-title>NIAPU: network-informed adaptive positive-unlabeled learning for disease gene identification</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>39</volume>
          (
          <year>2023</year>
          )
          <article-title>btac848</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>E.</given-names>
            <surname>Shahini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Pasculli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mastropietro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Stolfi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Tieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Vergni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cozzolongo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Giannelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pesce</surname>
          </string-name>
          ,
          <article-title>Network proximity-based drug repurposing strategy for primary biliary cholangitis</article-title>
          ,
          <source>Digestive and Liver Disease</source>
          <volume>54</volume>
          (
          <year>2022</year>
          )
          <article-title>S106</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>Shahini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Pasculli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mastropietro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Stolfi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Tieri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Vergni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cozzolongo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pesce</surname>
          </string-name>
          , G. Giannelli,
          <article-title>Network proximity-based drug repurposing strategy for early and late stages of primary biliary cholangitis</article-title>
          ,
          <source>Biomedicines</source>
          <volume>10</volume>
          (
          <year>2022</year>
          )
          <fpage>1694</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mastropietro</surname>
          </string-name>
          , G. Pasculli,
          <string-name>
            <given-names>C.</given-names>
            <surname>Feldmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rodríguez-Pérez</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Bajorath,</surname>
          </string-name>
          <article-title>EdgeSHAPer: bondcentric Shapley value-based explanation method for graph neural networks</article-title>
          ,
          <source>iScience</source>
          <volume>25</volume>
          (
          <year>2022</year>
          )
          <fpage>105043</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mastropietro</surname>
          </string-name>
          , G. Pasculli,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bajorath</surname>
          </string-name>
          ,
          <article-title>Protocol to explain graph neural network predictions using an edge-centric Shapley value-based approach</article-title>
          ,
          <source>STAR Protocols 3</source>
          (
          <year>2022</year>
          )
          <fpage>101887</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mastropietro</surname>
          </string-name>
          , G. Pasculli,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bajorath</surname>
          </string-name>
          ,
          <article-title>Learning characteristics of graph neural networks predicting protein-ligand afinities</article-title>
          ,
          <source>Nature Machine Intelligence</source>
          <volume>5</volume>
          (
          <year>2023</year>
          )
          <fpage>1427</fpage>
          -
          <lpage>1436</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mastropietro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Feldmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bajorath</surname>
          </string-name>
          ,
          <article-title>Calculation of exact Shapley values for explaining support vector machine models using the radial basis function kernel</article-title>
          ,
          <source>Scientific Reports</source>
          <volume>13</volume>
          (
          <year>2023</year>
          )
          <fpage>19561</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mastropietro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bajorath</surname>
          </string-name>
          ,
          <article-title>Protocol to explain support vector machine predictions via exact shapley value computation</article-title>
          ,
          <source>STAR Protocols 5</source>
          (
          <year>2024</year>
          )
          <fpage>103010</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>