<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A framework for disease-specific information extraction from biomedical literature and open databases, aiming at drug re-purposing</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Georgios Paliouras</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stavroula Svolou</string-name>
          <email>ssvolou@iit.demokritos.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fotis Aisopos</string-name>
          <email>fotis.aisopos@iit.demokritos.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anastasios Nentidis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anastasia Krithara</string-name>
          <email>akrithara@iit.demokritos.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Informatics and Telecommunications, National Centre of Scientific Research “Demokritos”, Patr. Grigoriou &amp; Neapoleos Str</institution>
          ,
          <addr-line>Ag. Paraskevi, Athens, 15341</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <fpage>71</fpage>
      <lpage>83</lpage>
      <abstract>
        <p>The information needed for complex biomedical tasks, such as drug re-purposing, is often scattered in diferent resources, such as biomedical ontologies, databases, and the scientific literature. Scientific literature, in particular, is often the most up-to-date resource, encapsulating the most recent information and latest findings. This work provides a framework for retrieving disease-specific literature and analysing their text with natural language tools for the extraction of concepts and semantic relations. The literature and the information extracted from it are then integrated with information from ontologies and databases, constructing an up-to-date knowledge graph. This graph is then further analysed, providing path-based feature representations for downstream tasks, such as link prediction. This framework is applied to nine neurological, neurometabolic, and neuromuscular disorders, aiming to identify re-purposed drug candidates as potential treatments. To this end, machine learning models are developed achieving promising results on three complementary link-prediction tasks related to drug re-purposing. The preliminary results reveal that both information extracted from the literature, such as concepts and relations, and document-level information, such as concept co-occurrence and document topics, are useful for these tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>Literature mining</kwd>
        <kwd>Biomedical ontologies</kwd>
        <kwd>Knowledge Graph</kwd>
        <kwd>Drug re-purposing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The scientific literature is an indispensable source of information for biomedical research as indicated
by the billions of annual searches on PubMed1, the bibliographic database of the US National Library
of Medicine (NLM). Moreover, PubMed, which currently consists of almost 31 million documents, is
consistently growing with more than one million new documents added annually, during the last few
years2. In this context, identifying information related to a specific disease is a challenging task per se.
In particular, biomedical researchers, who often specialize in a specific disease or a group of diseases,
need up-to-date access to all the information available in the relevant literature. Furthermore, they
need to combine such information with knowledge located in diferent resources, such as biomedical
ontologies and databases, to estimate the plausibility or the clinical potential of alternative scientific
hypotheses and prioritize their experimental investigation.</p>
      <p>In the field of drug development, for instance, drug re-purposing is a promising strategy for
accelerating the identification of treatments by utilizing existing drugs for new therapeutic purposes. In this
direction, existing drugs, already approved for some diseases, are considered as potential candidates
for treating a new disease of interest. However, in order to decide which previously approved drug is
more promising in a systematic way, one needs to compare many drugs identifying and considering
any information available in diferent resources for each of them, as well as for the disease of interest.
Therefore, the automation of the process is the only viable option for large-scale re-purposing studies,
where many potential candidate drugs are available.</p>
      <p>In this work, we present a software framework developed in the context of SIMPATHIC project3
for mining and analysing information from scholarly articles, using predictive in-silico models, with
the aim to identify, prioritize, and select re-purposing drug candidates, as well as druggable targets for
such candidates. Initially, all the literature articles relevant to specific diseases of interest are retrieved
from PubMed and PubMed Central (PMC) through semantic search based on Medical Subject Headings
(MeSH)4 topic annotations. Then, Entity Recognition (ER) and Natural Language Processing (NLP) are
utilized, to extract knowledge from the raw text into a structured form. In specific, biomedical entities
and relations of certain types are identified, presenting the information discussed within the text of
those articles in the form of knowledge graph triples. This process also presupposes a fine-grained
semantic indexing functionality, employing open and commonly accepted biomedical ontologies and
vocabularies, such as the Unified Medical Language System (UMLS) 5. This preliminary semantified
literature knowledge graph is then further enriched with data coming from open databases such as
Drugbank6 and ontologies such as the OBO Foundry Human Disease Ontology7, providing useful
associations such as hypernymic relations and known Drug-Drug Interactions (DDIs) that are manually
reported and documented by clinical experts. The integration of all the aforementioned datasets
within the knowledge graph provides the ground for generating up-to-date and comprehensive feature
representations for interactions among biomedical entities, such as drugs and genes, based on the paths
that connect them. This paves the way for further AI-based analysis (e.g. link prediction) with machine
learning models, resulting in candidate drug recommendations and drug prioritization.</p>
      <p>To highlight the adequacy of these automatically generated feature representations in the field
of drug re-purposing, we applied the framework for a group of rare neuromuscular diseases. In
particular, we considered three alternative drug re-purposing scenarios, experimenting with three link
prediction scenarios, namely predicting Drug-Disease, Drug-Gene, and Drug-Phenotype interactions.
Our preliminary results suggest that the information extracted from literature, its integration into a
knowledge graph, and the generated path-based feature representations can indeed be useful for
linkprediction tasks related to drug re-purposing. In particular, the inclusion of document-level information,
such as concept co-occurrence in documents and document-topic relations, in the feature representations
appears to have a positive impact on the predictive performance of the machine learning models.</p>
      <p>The contributions of this work are summarized below:
• A framework for generating comprehensive feature representations for alternative drug
repurposing hypotheses based on the retrieval and mining of biomedical literature and the creation
of an up-to-date disease-specific knowledge graph.
• Investigation of using the generated representations in machine learning models for drug eficacy
prediction, under three alternative scenarios.
• The application of the method for a group of rare diseases resulting in a list of potential drug
re-purposing candidates for their treatment.</p>
      <p>The rest of this paper is structured as follows: First, in section 2 we provide a brief introduction to
relevant prior work. Then, in section 3 we describe the structure of the proposed framework. In section 4,
we elaborated on the generation and analysis of the datasets, the development of the predictive models,
and the respective preliminary results on prioritizing and selecting re-purposing drug candidates.
Finally, in section 5 we conclude this work and discuss future directions.
3https://simpathic.eu/
4https://www.nlm.nih.gov/mesh/meshhome.html
5https://www.nlm.nih.gov/research/umls/index.html
6https://go.drugbank.com/
7http://obofoundry.org/ontology/doid.html</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <sec id="sec-2-1">
        <title>2.1. Literature Mining and Knowledge Graphs</title>
        <p>
          Graphs have an intuitive and versatile structure that renders them adequate for integrating and
representing information from diferent resources, including information extracted from the literature. In the
biomedical domain, this is highlighted by the adoption of knowledge graphs by several frameworks and
methods for integrating and analyzing health data for diferent diseases [
          <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
          ]. In this work, we build
upon the iASiS Open Data Graph, an open-source framework for the automated retrieval and
integration of disease-specific knowledge into an up-to-date knowledge graph, for any disease of interest [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
The paths connecting diferent entities in such automatically-generated biomedical knowledge graphs
for Alzheimer’s disease and Lung Cancer have shown promising results as a resource for generating
feature representations for downstream tasks such as drug-drug and drug-gene interaction prediction
[5, 6]. In this work, we extended the prior work by introducing a new framework that considers three
complementary scenarios related to drug re-purposing. Namely, the prediction of drug interactions
with diseases, genes, and phenotypes. In addition, we experimentally investigate the efectiveness of
this framework under the above scenarios on a larger scale, considering nine rare diseases.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Drug Re-purposing using Literature Information</title>
        <p>Various works try to utilize the available biomedical literature to achieve drug re-purposing. GrEDeL
[7] proposed a biomedical knowledge graph embedding-based recurrent neural network, trained with
known drug therapies, in order to discover candidate drugs for diseases of interest. Similarly, works
in [8], [9], [10] aim at drug re-purposing based on literature knowledge graph completion techniques,
such as graph embedding methods.</p>
        <p>The later [10] focuses on disease-specific (cancer) drug re-purposing, as well as works in [ 11], [12],
[13] during the COVID-19 pandemic that tried to identify potential therapies for SARS-CoV-2 infection.
Sosa, et. al [14] focus on rare diseases, using a literature-based knowledge graph embedding method to
identify drug re-purposing candidates.</p>
        <p>The added value of the framework presented in this work entails the addition of co-references and
topic relations in the literature knowledge graph, which enhances the information extraction process
and its predictive potential, as will be discussed in the experimental results.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <p>
        The first parts of the proposed framework focus on the retrieval and mining of relevant literature
(Sec. 3.1) and the semantic integration of literature-based information with information from relevant
structured resources (Sec. 3.2), building upon the methods presented in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Then, the generated
Knowledge Graph is used to produce feature representations for three complementary link prediction
tasks related to drug repurposing (Sec. 4.3).
      </p>
      <sec id="sec-3-1">
        <title>3.1. Literature Retrieval and Mining</title>
        <p>For the retrieval of relevant literature, we rely on a semantic search over PubMed/Medline. In particular,
we use the annotations of PubMed/Medline articles with MeSH thesaurus terms, provided by NLM,
to identify all the articles relevant to each disease of interest. MeSH, which stands for the Medical
Subject Headings Thesaurus, is a controlled vocabulary developed and maintained by NLM. MeSH
consists of more than thirty thousand topics (descriptors) for annotating the main subject of articles
from the scientific literature, hierarchically organized primarily into broader and narrower topics [ 15].
In addition, these topic annotations with all the MeSH terms that represent the main subjects of each
article are also retrieved. For these articles, the abstract text is also retrieved from PubMed/Medline,
and the full text from PubMed Central, when available.</p>
        <p>The text of each article, abstract and full text, is analyzed with concept and relation extraction
tools, namely MetaMap [16] for the concepts and SemRep [17] for relations between these concepts.
MetaMap and SemRep are two established literature mining tools developed by NLM that rely on a
multi-stage NLP analysis. This analysis involves named entity recognition and disambiguation, and
the use of syntactic and semantic rules. An important merit of MetaMap and SemRep is the adoption
of the UMLS as their semantic reference schema. This renders them comprehensive supporting a
wide range of more than three million concepts from the UMLS Metathesaurus8 and more than thirty
types of relations between them from the UMLS Semantic Network (SN)9. The precision and recall of
MetaMap have been estimated to range from 84% to 93%, and from 61% to 89% respectively, for specific
types of entities [18, 19]. The precision of extracting diferent types of relations between concepts has
been estimated to range between 75% and 96%, and the recall between 55% and 70% [20]. Still, the
vocabulary of these tools is directly extendible with additional concepts from particular vocabularies
of interest. In particular, in this work, we extended the vocabulary considered by these tools with the
NCI Metathesaurus (NCIm), which provides additional concepts from biomedical terminologies not
available in the UMLS Metathesaurus10.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Semantic integration into a Knowledge Graph</title>
        <p>The semantic schema of the proposed framework for integrating information from diferent resources
is based on the UMLS and NCIm. In particular, the information retrieved and mined is integrated into a
knowledge graph with two basic types of nodes: a) nodes representing articles of the literature, and b)
nodes representing concepts from the UMLS or the NCIm metathesauri. Each article node is linked
with each concept node corresponding to a concept recognized in its text with an incoming directed
edge. These edges representing concept-article relations are labeled as “MENTIONED_IN” edges. A
concept node is also linked to any other concept node for which a relation has been extracted from the
text of some article. These concept-to-concept edges are labeled with the respective UMLS SN relation
type. For instance, we suppose that the relation “Aspirin”(CUI: C0004057) TREATS “Myocardial Infarction
(CUI: C0027051)” was extracted from the text of some article. In this case, the node for “Aspirin” will be
linked with the node for “Myocardial Infarction” with an edge labeled as “TREATS”.</p>
        <p>In order to integrate the topic annotations of the articles in the same knowledge graph, we link each
article node with each concept node corresponding to a MeSH topic of the article with an outgoing edge
labeled as “HAS_MESH”. The MeSH thesaurus is one of the vocabularies included in the UMLS, hence
the mapping from MeSH topics to respective UMLS concepts is available in the UMLS metathesaurus.
Beyond topic annotations for the relevant literature, hierarchical relations between MeSH topics were
also extracted from the MeSH thesaurus. These binary relations were integrated into the same graph as
edges between corresponding concept nodes, labeled after the respective UMLS SN relation type, that is
“IS_A”.</p>
        <p>
          Hierarchical relations between concepts were extracted from other ontologies as well, namely the
Gene Ontology [21] and the Disease Ontology [
          <xref ref-type="bibr" rid="ref5">22</xref>
          ], enriching the integrated knowledge graph with
more edges labeled as “IS_A”. Finally, relations representing the chemical interaction between drugs
were extracted from DrugBank [
          <xref ref-type="bibr" rid="ref6">23</xref>
          ]. These relations were also integrated into the same knowledge
graph, as edges labeled after the respective UMLS SN relation type, that is “INTERACTS_WITH”. As
with MeSH, Gene Ontology, Disease Ontology, and DrugBank are resources included in the vocabularies
of the UMLS metathesaurus, hence the mapping from the original-resource identifiers to UMLS concepts
is also available in the UMLS metathesaurus.
8In this work, we used the UMLS 2023 release. https://web.archive.org/web/20230710090306/https://www.nlm.nih.gov/
research/umls/knowledge_sources/metathesaurus/release/statistics.html
9https://www.nlm.nih.gov/research/umls/knowledge_sources/semantic_network/index.html
10https://ncimetathesaurus.nci.nih.gov/ncimbrowser/
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Link Prediction and Drug-repurposing model</title>
        <p>Using the generated knowledge graph described above and an external data set of known drug relations
as groundtruth, we now aim to apply link prediction for three use case scenarios: (a) drug-disease,
(b) drug-gene, and (c) drug-phenotype relations. Those use cases have been identified as the most
interesting ones towards drug re-purposing by the SIMPATHIC medical experts. Link prediction is
addressed as a binary classification problem (relation/no-relation) between respective nodes, using a
“white-box” semantic path analysis method presented in [6].</p>
        <p>In specific, a set of features is extracted by all existing paths between pairs of interesting nodes
in each scenario (e.g. drugs and diseases). Those features are constructed by aggregating the
occurrences of subject/object node semantic types and relation types in each hop ℎ of a path, denoted as
nodℎ_semanticType and relℎ_relationType. A schematic presentation of this process for an example
drug-gene pair can be seen in Figure 1.</p>
        <p>Summing individually the diferent semantic types (127) and diferent relation types (35) of each hop
for paths of length , we end up with 162 features in total. Note that, in the current settings, we set
maximum path length  = 3, as longer paths do not seem to improve the link prediction accuracy. Thus,
the feature size finally equals 468.</p>
        <p>After generating features for each scenario, a random forest classifier is trained using the labels of
the external groundtruth to learn to discriminate between correct and incorrect relations. As a final
step, in order to address the drug re-purposing task, the trained classifier in each scenario is asked to
score negative pairs (i.e. cases for which no external evidence of relation exists), using all drugs existing
in our knowledge graph. This results into a ranked list of drug candidates, each one with a certain
confidence, for every disease, gene or phenotype of interest, according to what scenario we want to
investigate. The highest ranked candidates can be used as recommendations for drug re-purposing,
based on the confidence score accompanying each prediction.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Methods Implementation</title>
        <p>
          The proposed framework was implemented as an open-source library of two distinct modules that
can be used independently. The Knowledge Graph generation module11, for the retrieval and
integration of up-to-date information from disease-specific literature and selected structured resources has
been developed with Java, building upon the iASiS Open Data Graph framework [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. Regarding the
literature mining tools, we have used SemRep (release 1.7) and MetaMap 2020 configured to use the
UMLS2023 Metahesaurus vocabulary extended with the vocabulary of the NCIm Metahesaurus. The
resulting knowledge graph is saved as a Neo4j12 graph database. The link prediction method13 has been
implemented with Python 3.12, using the scikit-learn14 library.
4.2. Data
        </p>
        <sec id="sec-4-1-1">
          <title>4.2.1. Scholarly Data Retrieval</title>
          <p>As mentioned, in the context of our experiments, we are focusing on 9 rare neurological, neurometabolic
and neuromuscular syndromes, namely:
• Spinocerebellar Ataxia type 3 (SCA3)
• Leigh syndrome (Leigh)
• Congenital Neurotransmitter defects (CNT)
• Pyridoxine Dependent Epilepsy (PDE )
• Glutaric Aciduria (GA1)
• PMM2-Congenital Disorder Glycosylation (PMM2)
• Zellweger Spectrum Disorders (ZSD)
• Myotonic Dystrophy type 1 (DM1)
• Congenital Myasthenic Syndrome (CMS)</p>
          <p>As advised by medical experts, the Guanosine Triphosphate Cyclohydrolase (GTPCH) Deficiency
and the Succinic Semialdehyde dehydrogenase (SSADH) Deficiency for CNT, as well as the Peroxisome
Biogenesis Disorder 1A (PBD1A) and the Peroxisome Biogenesis Disorder 6A (PBD6A) for ZSD, were
considered as the most interesting sub-syndromes to investigate. Thus, we ended up with 11 specific
rare syndromes at hand.</p>
          <p>As a first step towards obtaining our data, we need to decide the MeSH Headings, based on which
PubMed and PubMed Central must be queried. Some syndromes such as GA1 do not correspond to a
MeSH Heading, in which cases we decided to use a more general term (e.g. “Brain Diseases, Metabolic”).
Table 1 presents an overview of the specific syndromes, along with the corresponding UMLS, OMIM 15,
ORPHA16 and MeSH terms.</p>
          <p>Using the framework presented in Section 3 and the MeSH Headings of Table 1, a total of 34,712
scientific articles from PubMed and PMC have been harvested and analyzed, resulting in a knowledge
graph of approximately 215 thousand nodes and 5.5 million relations. A small sample of our
knowledge graph is available via GitHub.17
11https://github.com/SSvolou/SIMPATHIC_SCOLIA_2025/tree/main/Harvesting/Literature_Harvester
12https://neo4j.com/
13https://github.com/SSvolou/SIMPATHIC_SCOLIA_2025/tree/main/Link_Prediction
14https://scikit-learn.org/
15https://omim.org/
16https://www.orpha.net/en/disease
17https://github.com/SSvolou/SIMPATHIC_SCOLIA_2025/blob/main/Knowledge%20Graph%20Sample.csv</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>4.2.2. Drug indication datasets</title>
          <p>Following the creation of the knowledge graph, the aim is to experiment with three separate link
prediction scenarios, identifying (a) new drug-disease treatment relations, (b) new drug-gene interactions, and
(c) drug-phenotype relations. To this end, a groundtruth of approved relations of each type is required,
in order to train the machine learning classifier described in Section 3.</p>
          <p>
            For links of type (a), drug indications related to the 11 aforementioned syndromes were extracted
and unified from TTD [
            <xref ref-type="bibr" rid="ref7">24</xref>
            ], DrugCentral [
            <xref ref-type="bibr" rid="ref8">25</xref>
            ], Open Targets [
            <xref ref-type="bibr" rid="ref9">26</xref>
            ] and Drugbank [
            <xref ref-type="bibr" rid="ref10">27</xref>
            ] repositories. On
the other hand, a set of documented drug-gene interactions was retrieved from TTD, KEGG, Drugbank
and DGIdb [
            <xref ref-type="bibr" rid="ref11">28</xref>
            ] for scenario (b). To proceed with scenario (c), since no open database with structured
drug-phenotype relations could be found, we have decided to apply a simple inductive rule on the
datasets obtained for (a) and (b): If a drug treats a syndrome related to a phenotype or interacts with
a gene related to this phenotype, then we consider the drug-phenotype relation as positive. For each
scenario, we consider all possible pairs that do not have a known positive relation as negatives, resulting
in the highly imbalanced groundtruth datasets depicted in Table 2.
          </p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.3. Link Prediction Model Evaluation</title>
        <p>After extracting the features18 from our knowledge graph using the collected ground truth samples, we
apply a Random Forest Classifier from scikit-learn 19. To minimize the risk of over-fitting to specific
patterns in the graph, we set the number of decision trees to 100. By using this ensemble learning
approach, where each tree is trained on a random subset of the data, we are able to identify the most
important features for each use case we study with greater confidence.</p>
        <p>To evaluate our model’s performance, we implemented a nested Cross Validation (CV) strategy
specifically developed to further address the challenges of imbalanced datasets. The outer loop performs
a 10-fold CV, while the inner loop runs a 5-fold CV to determine the best under-sampling approach
for each fold. In particular, in each iteration of the outer loop, we split the training set into 5 folds,
18https://github.com/SSvolou/SIMPATHIC_SCOLIA_2025/tree/main/Extracted_Features
19https://scikit-learn.org/
where one of these folds is used as the validation set, and the remaining four are utilized for the model’s
training.</p>
        <p>In the inner loop, we predefined a list of promising sampling ratios (i.e. ratio = [0.15, 0.16, 0.17,
0.18, 0.19, 0.2, 0.225]) and tested two under-sampling strategies, i.e. RandomUnderSampler20 and
NearMiss21. RandomUnderSampler randomly selects samples from the Negative samples class to
address the imbalance, while NearMiss selects samples from the Negative samples class based on their
proximity to the Positive samples class. We tested all possible combinations of sampling ratios and
under-sampling strategies. For each combination and fold, we trained the model and calculated the
F1-Score, the harmonic mean of Precision and Recall, using the following formulas:
  =</p>
        <p>+  
 =</p>
        <p>+  
 1- =
2 ×   ×</p>
        <p>+ 
where TP, TN, FP, and FN denote the numbers of true positive, true negative, false positive and false
negative samples, respectively.</p>
        <p>Based on the highest average F1-score across the 5 folds in the inner loop, we determine the best
under-sampling strategy and ratio, which we then use to train our model on the outer training dataset.
By applying this nested CV approach, we aim to handle the imbalanced nature of the tasks we investigate,
leading to meaningful predictions.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.4. Preliminary Results and Lessons Learned</title>
        <p>Looking at the results of Table 3 it is obvious that the performance of the classifier is much worse for
scenario (c). This suggests that the groundtruth generated for drug-phenotypes as discussed in Section
4.2.2 is not of high quality. Moreover, examining the disease nodes in the knowledge graph, we identify
some syndromes that have a few or even no relationships at all. The lack of context in such parts of the
graph is evident due to the lack of a substantial number of research publications in PubMed that focus
on such syndromes, which finally causes the classification model performance to drop.</p>
        <p>In terms of these performance results, it should be noted that some of the links that are tested may
already have been extracted by SemRep. In those cases, the relation extraction tool has already identified
the relationships examined in scientific articles, making, thus, the role of the link prediction model
trivial. In all three link prediction scenarios, the percentage of such “obvious" predictions was no
more than 3% of the overall links (7 out of 217 of Drug-Disease, 1 out of 106 Drug-Gene, and 34 out
of 1414 Drug-Phenotype positive samples), so we consider the efect of such cases in the overall task
performance evaluation as minimal.</p>
        <p>Figure 2 illustrates the ten most important features for the classifier of each link prediction scenario.
As can be observed, co-mention and topic relationships in various positions of paths are of utmost
importance for a classification decision. This finding highlights the significance of the extraction and
inclusion of such relation types and instances into our knowledge graph, in order to enhance the
performance of the classifier.</p>
        <p>A closer examination of the feature sets across the three scenarios reveals that several features
consistently play a crucial role in classification. In particular, “MENTIONED_IN” appears in all three
feature sets, highlighting the importance of co-mention relationships in identifying relevant connections.
Additionally, “INTERACTS_WITH”, “HAS_MESH”, “humn” (Human), and “PROCESS_OF” are present
in at least two scenarios, suggesting that they hold broad predictive value across multiple link prediction
tasks. These common features may indicate fundamental relationships between entities that are
universally relevant, making them key components for the classification model.</p>
        <p>On the other side, certain features appear only in specific scenarios, indicating their case-specific
significance. For instance, “phsu” (Pharmacological Substance) and “USES” are unique to the top-10
features of scenario (a), while “LOCATION_OF” is only present in scenario (b), and “podg” (Patient or
Disabled Group) is only found in scenario (c). These case-specific features may capture unique aspects
of the relationships within their respective datasets, but their impact do not generalize across all tasks.</p>
        <p>A major advantage of the employed methodology is its transparency. Unlike other methods, such as
Large Language Models (LLMs), which often function as “black-boxes” with limited insight into their
decision-making process, our feature-based approach provides meaningful justifications for classification
outcomes. By explicitly identifying the most influential features in each scenario, we facilitate a deeper
understanding of the underlying relationships within the data, where transparency is particularly
valuable in order to make our biomedical predictions trustworthy.</p>
        <p>Using the models tested for each scenario, the next step towards drug re-purposing is the scoring
of all possible drug candidates for each syndrome of interest (Table 1) and the identification of the
top ranking cases. The resulting csv files, available via Github 22, list all initial candidate drugs per
syndrome, related gene or phenotype.</p>
        <p>As can be observed in the candidate list of scenario (a), for some syndromes (e.g. PMM2-Congenital
disorder glycosylation), no candidate drugs have been identified. This happens because of the lack of
literature that focuses on such conditions, resulting in an insuficient related context in the knowledge
graph. As a mitigation measure, we can identify potential candidates for genes (e.g. PMM2 gene) or
phenotypes (e.g. Dysarthria) related to those syndromes, through the other two candidate lists.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and future work</title>
      <p>This paper introduced a framework that generates comprehensive and up-to-date feature representations
for alternative drug re-purposing hypotheses, considering three complementary link prediction tasks.
Namely, the prediction of drug interactions with diseases, genes, or phenotypes. To do so, this framework
is based on retrieving and mining disease-specific literature and the automated generation of a knowledge
graph, where information from disease-specific literature is integrated with the information from
structured resources. We experimentally investigate the use of this framework to provide drug
repurposing suggestions for a real-world scenario, concerning nine rare neurological, neurometabolic,
and neuromuscular syndromes. In this direction, beyond the application of the proposed framework
that led to the generation of the respective knowledge graph, we also developed three groundtruth
drug indication datasets for model development and evaluation, based on data obtained from publicly
available repositories.</p>
      <p>A key challenge encountered during this process was the highly imbalanced nature of these datasets,
as rare diseases, by definition, have fewer or even no known drug indications at all. Furthermore, no
22https://github.com/SSvolou/SIMPATHIC_SCOLIA_2025/tree/main/Drug_Candidates
granularity of the concepts proved to be another important challenge that needs careful consideration,
predominantly for document retrieval. In some cases, such as GA1, we may need to consider a broader
concept due to the lack of suficient syndrome-specific context and exact identifier alignment across
vocabularies. In other cases, however, important sub-diseases of a main disease concept may be of
interest as well.</p>
      <p>Our future research plans concerns the investigation of improvements in each part of the proposed
framework. For literature retrieval and mining, we built upon traditional biomedical semantic search
and NLP approaches, which are often consistent and explainable. However, exploiting modern methods
that rely on LLMs is a direction that could improve the quality of the generated knowledge graph and
respective feature representations. We also plan to explore approaches for the detection and removal
of noisy or not useful parts of the Knowledge Graph, leading to quality improvement and potentially
more manageable size. In addition, we plan to enhance the framework by considering additional
information, such as author, journal, or year information of articles, semantic descriptions of concepts,
and pre-trained embeddings for concept terms. Finally, we also research alternative link prediction
methods based on Graph Neural Networks (GNNs).</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>T h i s work was f u n d e d by t h e SIMPATHIC p r o j e c t , i n t h e c o n t e x t o f
E u r o p e a n Union ’ s H o r i z o n 2 0 2 0 r e s e a r c h and i n n o v a t i o n programme
u n d e r g r a n t a g r e e m e n t No 1 0 1 0 8 0 2 4 9 .</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>The authors have not employed any Generative AI tools.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Aisopos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jozashoori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Niazmand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Purohit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rivas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sakor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Iglesias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Vogiatzis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Menasalvas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Rodriguez</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Vigueras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gomez-Bravo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Torrente</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. Hernández</given-names>
            <surname>López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Provencio</given-names>
            <surname>Pulla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dalianis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Triantafillou</surname>
          </string-name>
          , G. Paliouras, M.-E. Vidal,
          <article-title>Knowledge graphs for enhancing transparency in health data ecosystems1</article-title>
          ,
          <source>Semantic Web</source>
          <volume>14</volume>
          (
          <year>2023</year>
          )
          <fpage>943</fpage>
          -
          <lpage>976</lpage>
          . doi:
          <volume>10</volume>
          .3233/sw-223294.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Sakor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jozashoori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Niazmand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rivas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bougiatiotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Aisopos</surname>
          </string-name>
          , E. Iglesias,
          <string-name>
            <given-names>P. D.</given-names>
            <surname>Rohde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Padiya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          , G. Paliouras, M.-E. Vidal,
          <fpage>Knowledge4COVID</fpage>
          -
          <lpage>19</lpage>
          :
          <article-title>A semantic-based approach for constructing a COVID-19 related knowledge graph from various sources and analyzing treatments' toxicities</article-title>
          ,
          <source>Journal of Web Semantics</source>
          <volume>75</volume>
          (
          <year>2023</year>
          )
          <article-title>100760</article-title>
          . URL: https://linkinghub.elsevier. com/retrieve/pii/S1570826822000440. doi:
          <volume>10</volume>
          .1016/j.websem.
          <year>2022</year>
          .
          <volume>100760</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Aisopos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Rentoumi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bougatiotis</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-E. Vidal</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Menasalvas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rodriguez-Gonzalez</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Samaras</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Garrard</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Torrente</surname>
            ,
            <given-names>M. Provencio</given-names>
          </string-name>
          <string-name>
            <surname>Pulla</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Dimakopoulos</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Mauricio</surname>
            , J. Rambla De Argila,
            <given-names>G. Gaetano</given-names>
          </string-name>
          <string-name>
            <surname>Tartaglia</surname>
          </string-name>
          , G. Paliouras, iASiS:
          <article-title>Towards Heterogeneous Big Data Analysis for Personalized Medicine</article-title>
          ,
          <source>in: 2019 IEEE 32nd International Symposium on Computer-Based Medical Systems (CBMS)</source>
          , volume 2019-June, IEEE,
          <year>2019</year>
          , pp.
          <fpage>106</fpage>
          -
          <lpage>111</lpage>
          . URL: https://ieeexplore.ieee.org/document/8787467/. doi:
          <volume>10</volume>
          .1109/CBMS.
          <year>2019</year>
          .
          <volume>00032</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bougiatiotis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          , G. Paliouras, iASiS Open Data Graph:
          <source>Automated Semantic Integration of Disease-Specific Knowledge, in: 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS)</source>
          ,
          <article-title>volume 2020-July</article-title>
          , IEEE,
          <year>2020</year>
          , pp.
          <fpage>220</fpage>
          -
          <lpage>225</lpage>
          . URL: https://ieeexplore.ieee.org/document/9183291/http://arxiv.org/abs/
          <year>1912</year>
          .08633. doi:
          <volume>10</volume>
          .1109/ CBMS49503.
          <year>2020</year>
          .
          <volume>00049</volume>
          . arXiv:
          <year>1912</year>
          .
          <volume>08633</volume>
          . 10802651http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC3037419. doi:
          <volume>10</volume>
          .1038/ 75556. arXiv:
          <fpage>10614036</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>L. M.</given-names>
            <surname>Schriml</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Arze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nadendla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-W. W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mazaitis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Felix</surname>
          </string-name>
          , G. Feng,
          <string-name>
            <given-names>W. A.</given-names>
            <surname>Kibbe</surname>
          </string-name>
          , Disease Ontology:
          <article-title>a backbone for disease semantic integration</article-title>
          .,
          <source>Nucleic acids research</source>
          <volume>40</volume>
          (
          <year>2012</year>
          )
          <article-title>D940-6</article-title>
          . URL: http://www.ncbi.nlm.nih.gov/pubmed/22080554http://www.pubmedcentral.nih.gov/ articlerender.fcgi?artid=PMC3245088. doi:
          <volume>10</volume>
          .1093/nar/gkr972.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Wishart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Knox</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Guo</surname>
          </string-name>
          , D. Cheng, S. Shrivastava,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tzur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gautam</surname>
          </string-name>
          , M. Hassanali,
          <article-title>DrugBank: a knowledgebase for drugs, drug actions and drug targets</article-title>
          .,
          <source>Nucleic acids research</source>
          <volume>36</volume>
          (
          <year>2008</year>
          )
          <article-title>D901-6</article-title>
          . URL: http://www.ncbi.nlm.nih.gov/pubmed/18048412http://www.pubmedcentral. nih.gov/articlerender.fcgi?artid=PMC2238889. doi:
          <volume>10</volume>
          .1093/nar/gkm958.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , Ttd:
          <article-title>Therapeutic target database describing target druggability information</article-title>
          ,
          <source>Nucleic acids research</source>
          <volume>52</volume>
          (
          <year>2024</year>
          )
          <fpage>D1465</fpage>
          -
          <lpage>D1477</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>S.</given-names>
            <surname>Avram</surname>
          </string-name>
          , T. B. Wilson,
          <string-name>
            <given-names>R.</given-names>
            <surname>Curpan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Halip</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Borota</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bora</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. G.</given-names>
            <surname>Bologa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Holmes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Knockel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Yang</surname>
          </string-name>
          , et al.,
          <article-title>Drugcentral 2023 extends human clinical data and integrates veterinary drugs</article-title>
          ,
          <source>Nucleic acids research</source>
          <volume>51</volume>
          (
          <year>2023</year>
          )
          <fpage>D1276</fpage>
          -
          <lpage>D1287</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>A.</given-names>
            <surname>Buniello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Suveges</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cruz-Castillo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. B.</given-names>
            <surname>Llinares</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Cornu</surname>
          </string-name>
          , I. Lopez,
          <string-name>
            <given-names>K.</given-names>
            <surname>Tsukanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Roldán-Romero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fumis</surname>
          </string-name>
          , et al.,
          <article-title>Open targets platform: facilitating therapeutic hypotheses building in drug discovery</article-title>
          ,
          <source>Nucleic Acids Research</source>
          <volume>53</volume>
          (
          <year>2025</year>
          )
          <fpage>D1467</fpage>
          -
          <lpage>D1475</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>C.</given-names>
            <surname>Knox</surname>
          </string-name>
          , M. Wilson,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Klinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Franklin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Oler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wilson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cox</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. E.</given-names>
            <surname>Chin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Strawbridge</surname>
          </string-name>
          , et al.,
          <source>Drugbank</source>
          <volume>6</volume>
          .
          <article-title>0: the drugbank knowledgebase for 2024</article-title>
          ,
          <source>Nucleic acids research</source>
          <volume>52</volume>
          (
          <year>2024</year>
          )
          <fpage>D1265</fpage>
          -
          <lpage>D1275</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cannon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Stevenson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Basu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cofman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kiwala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. F.</given-names>
            <surname>McMichael</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kuzma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Morrissey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cotto</surname>
          </string-name>
          , et al.,
          <source>Dgidb</source>
          <volume>5</volume>
          .
          <article-title>0: rebuilding the drug-gene interaction database for precision medicine and drug discovery platforms</article-title>
          ,
          <source>Nucleic acids research</source>
          <volume>52</volume>
          (
          <year>2024</year>
          )
          <fpage>D1227</fpage>
          -
          <lpage>D1235</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>