<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>May</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>through Knowledge Graphs for Improving Diabetes Prediction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rita T. Sousa</string-name>
          <email>rita.sousa@uni-mannheim.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Heiko Paulheim</string-name>
          <email>heiko.paulheim@uni-mannheim.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Diabetes Prediction, Expression data, Knowledge Graph, Ontology, Knowledge Graph Embedding</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Data and Web Science Group, Universität Mannheim</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>26</volume>
      <issue>2024</issue>
      <abstract>
        <p>Diabetes is a worldwide health issue afecting millions of people. Machine learning methods have shown promising results in improving diabetes prediction, particularly through the analysis of diverse data types, namely gene expression data. While gene expression data can provide valuable insights, challenges arise from the fact that the sample sizes in expression datasets are usually limited, and the data from diferent datasets with diferent gene expressions cannot be easily combined.</p>
      </abstract>
      <kwd-group>
        <kwd>Diabetes</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>Motivation</title>
      <p>
        Diabetes is a chronic health condition resulting from insuficient insulin production by the
pancreas or the body’s inability to utilize the insulin it generates efectively [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This disease
has emerged as a worldwide health issue, impacting millions of people globally. According
to the World Health Organization, in 2019, diabetes directly contributed to 1.5 million deaths,
with 48% occurring before the age of 70. Besides that, this chronic disease is associated with the
development of several comorbidities, such as blindness, kidney failure, heart attacks, strokes,
and lower limb amputation.
      </p>
      <p>
        Due to the multidisciplinary nature of diabetes, predicting and detecting this complex
disease continues to pose a significant challenge. In the last decades, some approaches have
demonstrated encouraging outcomes using machine learning methods to identify patterns and
potential risk factors linked to diabetes, allowing not only the early detection of diabetes but
also enabling tailored interventions [
        <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5">2, 3, 4, 5</xref>
        ]. These machine learning approaches encompass
several types of data, including electronic health records [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], imaging data [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and demographic
data [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Omics data, namely gene expression datasets, have also received attention since
genomics, epigenomics, and transcriptomics can help understand the critical pathways and
regulatory mechanisms in diabetes [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        While gene expression datasets are readily accessible in public databases, and gene expression
analysis is a powerful tool for pinpointing genes associated with diseases, particularly in the
context of diabetes prediction, a significant issue arises in handling this type of data. On the
one hand, gene expression datasets often exhibit a limitation in sample size, with a relatively
small number of included samples. Conversely, supervised machine learning methods are
data-driven, relying on a large number of labeled data for efective training and performance.
One alternative involves combining multiple expression datasets to increase the sample pool
for training machine learning models. However, this brings us to the challenge of how to
integrate the information about multiple expression datasets, as each dataset may measure gene
expression across distinct genes. Additionally, variations in experimental platforms and designs
across diferent studies further complicate integration eforts. Knowledge graphs (KGs) present a
unique and promising solution. KGs can represent knowledge about concepts and relationships
in a fully machine-readable format [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Moreover, several biomedical ontologies are publicly
available to enrich KGs [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], enabling the representation of domain-specific knowledge. In fact,
over the past few years, biomedical ontologies and KGs have emerged as a tool for biomedical
data integration and have been adopted in many machine learning applications, with KG
embedding approaches [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] becoming increasingly popular [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>This work tackles the challenge of integrating heterogeneous gene expression datasets in
biomedical applications, focusing on diabetes prediction. We propose a novel approach that
generates a KG to incorporate both gene expression data and domain-specific knowledge and
then employs KG embedding methods to generate vector representations of patients. These
patient representations serve as the input for a classifier to predict the likelihood of a patient
having diabetes. We conducted an evaluation of the impact of integrating multiple gene
expression datasets, which showed that incorporating other expression datasets and
domainspecific knowledge improves diabetes prediction, emphasizing the eficacy of our approach. This
work is developed in the context of the KI-DiabetesDetektion project, funded by the German
Federal Ministry of Education and Research, that aims to integrate biomedical data from various
sources and apply machine learning methods to improve the early-stage detection of Diabetes.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Related Work</title>
      <p>
        Several works have been using gene expression data to predict diabetes, employing diverse
methodologies and datasets. In Li et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], a support vector machine classifier is used for
the diagnosis of diabetes. While multiple datasets were extracted from the Gene Expression
Omnibus database, the machine learning model was trained on only one dataset, with three
additional datasets used for validation. Feature selection involved the identification of ten
common genes across all datasets. Mansoori et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and Kazerouni et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] focus on long
non-coding RNAs potentially associated with diabetes type 2. Both studies incorporated data
collected from 100 diabetic and 100 non-diabetic to train the classifiers. Mansoori et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]
employed logistic regression, whereas Kazerouni et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] compare four classifiers (  -nearest
neighbor, support vector machine, logistic regression, and artificial neural networks) to predict
diabetes type 2 using the expression values for specific long non-coding RNAs as input. Both
studies suggest that increasing the dataset with a larger number of samples would likely improve
the performance of the classifiers. Furthermore, some other approaches explore expression data
for diabetes prediction without employing machine learning methods [
        <xref ref-type="bibr" rid="ref17 ref18 ref9">17, 18, 9</xref>
        ].
      </p>
      <p>In the biomedical domain, the exploration of KGs has become increasingly prominent, with
KG embedding methods emerging as particularly promising for capturing KG-based
information [19]. These methods map entities and relationships in a KG into a lower-dimensional vector
space while preserving graph structure and, in some cases, semantic information. Various types
of KG embedding methods have been proposed to date. Translational models, exemplified by
TransE [20], employ distance-based scoring functions to capture relationships between entities.
On the other hand, semantic matching approaches, such as distMult [21], use similarity-based
scoring functions to capture the latent semantics of entities and relations in their vector space
representations. Walk-based methods, such as RDF2Vec [22], employ random walks to generate
entity sequences as input to a neural language model that learns latent entity representations.
Diferent walk-based approaches difer in their strategies for random walks and consideration
of edge direction and type. In the context of biomedical KGs, characterized by rich hierarchical
relations, walk-based approaches emerge as particularly well-suited, considering that these
hierarchical relations can be more easily captured in walks.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Methodology</title>
      <p>As discussed above, gene expression datasets typically only have few instances, and diferent
datasets record diferent gene expressions. Thus, when training prediction models, one can
either (1) use only one dataset, thereby having only little training data, or (2) try to combine
multiple datasets. In the latter case, those are typically “incompatible” in the sense that they
have diferent feature sets, i.e., a naive combination would lead to a larger dataset with lots of
NULL values.</p>
      <p>To overcome these challenges, we propose a methodology to integrate multiple expression
datasets into a biomedical KG and then use it for diabetes prediction. Figure 1 shows an overview
of this methodology. The first step corresponds to building the KG that integrates not only
expression data from diferent datasets but also domain knowledge on protein function and
protein interactions. Then, we generate a vector representation for each patient described in
the biomedical KG. The last step involves giving the vectors as input for a classifier. The source
code for our methodology is available on GitHub (https://github.com/ritatsousa/expressionKG).</p>
      <sec id="sec-4-1">
        <title>3.1. Expression Data</title>
        <p>Several studies have recently explored gene expression for diabetic and non-diabetic individuals,
and the findings from these studies can be accessed in publicly available databases. The Gene
Expression Omnibus (GEO) [23] is a public database maintained by the National Center for
Biotechnology Information that archives high-throughput gene expression and other genomics
datasets. Each GEO dataset represents a curated collection of biologically comparable GEO
samples whose measurements are assumed to be calculated equivalently. The file associated with
each dataset contains the raw gene expression data generated by microarrays. In addition to raw
data, processed files containing normalized or transformed expression values may be included.
In the latter scenario, the data is structured in a tabular format, with each row corresponding
to a unique sample, columns representing diferent genes, and the cells containing specific
expression values of those genes for each respective sample.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Building the Knowledge Graph</title>
        <p>The KG is built by integrating two types of data sources: expression data and domain-specific
knowledge. Figure 2 illustrates the integration of the two data sources into a KG.</p>
        <p>Since our approach relies on KG graph embeddings for generating patient representations
and most embedding approaches are not able to handle numeric literals [24], we adopt two
diferent strategies to include the expression data in the KG:
• The first strategy involves representing patient gene expression values in a KG using blank
nodes and binning approaches. Following the technique proposed in [24], we create bins
from the set of expression values for each gene within a given dataset. The percentage of
unique values defines the number of bins. To implement this, a blank node is generated
to represent the expression value attributed to a specific gene for a given patient. This
establishes an association wherein a patient is connected to a blank node, which, in turn,
is linked to a bin representing the expression value and the corresponding gene. Let us
consider a simplified example using RDF:
(patientID, rdf:type, :Patient)
(:geneID, rdf:type, :Gene)
(:patientID, :hasExpression, _:x)
(_:x, :isExpressionOfGene, :geneID)
(_:x :hasValue :binID)
where _:x denotes a blank node.
• The second strategy employs a linking approach between patients and genes based on
expression values. A link between a patient and a gene is created when the patient’s
expression value for that gene is higher than the calculated average expression value for
the gene within the dataset.</p>
        <p>The domain-specific knowledge includes the Gene Ontology (GO) [ 25], GO annotation
data [26], and protein-protein interaction (PPI) data [27]. The GO defines a hierarchy of classes
that describe protein functions that can be represented as a graph where nodes are GO classes
and edges define relationships between them. The GO encompasses three distinct domains
for characterizing functions: the biological processes a protein is involved in, the molecular
functions a protein performs, and the cellular components where a protein is located. These
three domains of GO are represented as separate root ontology classes since they do not share
any common ancestor. The GO annotation data refers to assigning functions represented as
GO classes to proteins represented as links in the graph. Finally, the PPI data is extracted from
STRING [27], one of the largest available PPI databases that integrates physical interactions
and functional associations between proteins collected from several sources.</p>
        <p>To bridge the gap between the two types of data sources, the expression data and the
domain-specific knowledge, a gene in the expression data graph is mapped to a protein in the
domain-specific KG. Online ID mapping tools, namely UniProt ID Mapping tool 1, are used to
convert identifiers between genes and proteins.</p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. Learning Patient Representations</title>
        <p>We propose to generate patient representations by leveraging the information of multiple
gene expression datasets and domain knowledge. As a preliminary step, the KG is converted
1https://www.uniprot.org/id-mapping
into a directed and labeled RDF graph, following the W3C’s OWL to RDF Graph Mapping
guidelines2. Next, our methodology employs RDF2Vec, a KG embedding method, to generate
the low-dimensional vector representations. RDF2Vec [22] is a path-based embedding method
that generates random walks in a graph that take into consideration both edge direction and
type, making it particularly suited to KGs. Word2vec, a language model, is then employed over
random walks on the RDF graph to produce the embeddings.</p>
        <p>Two distinct approaches are employed to represent patients: the first involves generating
RDF2vec embeddings directly for the patients using the KG, while the second generates RDF2Vec
embeddings for the genes present in gene expression datasets and represents patients as the
weighted average of gene embeddings, determined by the respective gene expression values.</p>
      </sec>
      <sec id="sec-4-4">
        <title>3.4. Predicting Diabetes</title>
        <p>Diabetes prediction is formulated as a binary classification task, where the goal is to categorize
a set of patients based on whether they have diabetes or not. Therefore, in the final step, the
patient representations are fed into a decision tree [28] algorithm for training.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Evaluation</title>
      <p>4.1. Data
Three diabetes-related GEO datasets (GSE15932, GSE30208, and GSE55098) are considered for
this work (Table 1). These datasets comprise samples associated with two distinct groups:
patients diagnosed with type 1 diabetes (T1D) and those serving as control subjects (non-T1D).
The data from the three datasets are integrated into a KG described in Table 2.</p>
      <sec id="sec-5-1">
        <title>4.2. Results and Discussion</title>
        <p>To assess the eficacy of the proposed methodology, we analysed the diabetes performance
on the GSE15932 dataset by enriching the training data with information from the GSE30208
and GSE55098 datasets. Since our approach involves integrating data from multiple expression
datasets into a KG, we compare it against two baselines that employ the expression values
2https://www.w3.org/TR/owl2-mapping-to-rdf/
of the patient directly as input for the classifier. The first baseline exclusively employs data
from GSE15932 for training the classifier. The second baseline represents a more simplistic
approach to adding information from other datasets. It involves merging all measured genes
across datasets and setting the value to 0 when the patient does not have an expression value.
We employed a stratified cross-validation strategy to ensure robust evaluation, dividing the
GSE30208 dataset into five folds. The same five folds were used throughout all experiments.
The reported results represent the average performance over these five folds. Figure 3 illustrates
the employed cross-validation strategy.</p>
        <p>Table 3 shows the accuracy, precision, recall, f-measure, weighted average f-measure and
the area under the ROC curve for the baselines and the proposed methodology. The second
baseline results indicate that simplistically adding information from other datasets does not
enhance performance. In fact, it appears to introduce noise to the classifier. This outcome is
not unexpected, as the integration of information from diverse datasets is lacking, leading to
an inefective impact on overall performance. However, by integrating the information from
other datasets in a KG, it becomes evident that training a model with diverse datasets improves
the performance of machine learning models in all metrics, with the exception of precision.
Therefore, it confirms our hypothesis that injecting other expression datasets can improve the
performance of machine learning models.</p>
        <p>However, there are performance variations between the diferent alternatives of our approach.
For the integration of expression data into the KG, we explore the use of blank nodes and
binning approaches versus a linking method based on expression values to link patients and
genes. In generating patient representations, we employed two strategies: direct learning of
embeddings for patients in the KG; or learning embeddings for genes and representing patients
as the weighted average of gene embeddings. This last strategy is independent of the strategy
employed to represent the expression data in the KG, so Table 3 presents only three alternatives.
Comparing the performance results of Table 3, the strategy involving the weighted average
of gene embeddings for patient representation emerges as particularly promising because it
consistently outperforms the other alternatives. Using links between patients and genes based
on the expression values is the second-best strategy, and it still improves performance across
several metrics compared to the baselines. Employing the binning approach achieves the worst
results, performing worse for many metrics than the baseline. These results may be attributed to
the inherent limitations of our path-based embedding method since genes and gene-expression
values exist on separate paths.</p>
        <p>Since we are interested in investigating the impact of domain-specific knowledge on
integrating data from diferent datasets, we evaluated the diabetes prediction performance using
a KG that only contains gene expression data. Figure 4 illustrates the performance variations
observed when employing a KG with domain knowledge alongside expression data, compared
to utilizing a KG with expression data alone. The performance decreases when the domain
knowledge is removed for both strategies of building the KG. This demonstrates that knowledge
about protein functions and interactions can play an important role in integrating data from
datasets measuring gene expression across diferent genes.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusion</title>
      <p>Several diabetes prediction approaches rely on the analysis of expression data, which provide
a detailed molecular profile reflecting gene activity and regulation and therefore can uncover
relationships between specific genes and the development of diabetes. However, exploring
expression data in machine learning presents its own set of challenges. Existing expression
datasets related to diabetes have a very low number of samples what can be a limitation for
datadriven methods such as machine learning algorithms. Therefore, the integration of multiple
(a) Using binning approach
(b) Using patient-gene links
expression datasets can address the issue of limited samples and, at the same time, ofer a
comprehensive perspective on the complex factors influencing diabetes.</p>
      <p>We have developed an approach that enables a comprehensive representation of gene
expression data from diferent datasets within a KG. Through semantic links and domain-specific
knowledge, KGs can create a unified knowledge space to connect datasets from distinct studies.
In this work, we have explored diferent strategies to include the expression data in the KG
and diferent strategies to represent the patients within the KG using KG embedding methods.
The results of our experiments showed that integrating gene expression data in a KG is able to
improve the performance of diabetes prediction.</p>
      <p>The proposed approach is versatile and can be extended to the prediction of other diseases. In
addition, since graph neural networks has gained substantial traction recently, as future work,
we aim to investigate how can these architectures explicitly designed for graph structures can
be used rather than the conventional process of generating embeddings and given them as input
for classical machine learning methods such as decision trees.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>The work presented in this paper has been partly funded by the German Federal Ministry of
Education and Research under grant number 13GW0661C (KI-DiabetesDetektion).
profiling of type 2 diabetes mellitus by bioinformatics analysis, Computational and
Mathematical Methods in Medicine 2020 (2020).
[19] D. Chang, I. Balažević, C. Allen, D. Chawla, C. Brandt, R. A. Taylor, Benchmark and best
practices for biomedical knowledge graph embeddings, in: Proceedings of the conference.
Association for Computational Linguistics. Meeting, volume 2020, NIH Public Access, 2020,
p. 167.
[20] A. Bordes, N. Usunier, A. Garcia-Durán, J. Weston, O. Yakhnenko, Translating embeddings
for modeling multi-relational data, in: Proceedings of NIPS 2013, Curran Associates Inc.,
Red Hook, NY, USA, 2013, p. 2787–2795.
[21] B. Yang, W. tau Yih, X. He, J. Gao, L. Deng, Embedding entities and relations for learning
and inference in knowledge bases, 2015.
[22] P. Ristoski, H. Paulheim, RDF2Vec: RDF graph embeddings for data mining, in: Proceedings
of the 15th International Semantic Web Conference, Springer International Publishing,
Cham, Switzerland, 2016, pp. 498–514.
[23] E. Clough, T. Barrett, S. E. Wilhite, P. Ledoux, C. Evangelista, I. F. Kim, M. Tomashevsky,
K. A. Marshall, K. H. Phillippy, P. M. Sherman, et al., Ncbi geo: archive for gene expression
and epigenomics data sets: 23-year update, Nucleic Acids Research 52 (2024) D138–D144.
[24] P. Preisner, H. Paulheim, Universal preprocessing operators for embedding knowledge
graphs with literals (2022).
[25] G. Consortium, The Gene Ontology resource: enriching a GOld mine, Nucleic Acids</p>
      <p>Research 49 (2021) D325–D334.
[26] R. P. Huntley, T. Sawford, P. Mutowo-Meullenet, A. Shypitsyna, C. Bonilla, M. J. Martin,
C. O’Donovan, The GOA database: gene ontology annotation updates for 2015, Nucleic
Acids Research 43 (2015) D1057–D1063.
[27] D. Szklarczyk, A. L. Gable, K. C. Nastou, D. Lyon, R. Kirsch, S. Pyysalo, N. T. Doncheva,
M. Legeay, et al., The STRING database in 2021: customizable protein–protein networks,
and functional characterization of user-uploaded gene/measurement sets, Nucleic Acids
Research 49 (2021) D605–D612.
[28] J. R. Quinlan, Induction of decision trees, Machine learning 1 (1986) 81–106.
[29] Series GSE30208, 2014. URL: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=</p>
      <p>GSE30208.
[30] H. Kallionpää, L. L. Elo, E. Laajala, J. Mykkänen, I. Ricano-Ponce, M. Vaarma, T. D. Laajala,
H. Hyöty, J. Ilonen, R. Veijola, et al., Innate immune activity is detected prior to
seroconversion in children with hla-conferred type 1 diabetes susceptibility, Diabetes 63 (2014)
2402–2414.
[31] Series GSE15932, 2012. URL: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=</p>
      <p>GSE15932.
[32] Series GSE55098, 2014. URL: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=</p>
      <p>GSE55098.
[33] M. Yang, L. Ye, B. Wang, J. Gao, R. Liu, J. Hong, W. Wang, W. Gu, G. Ning, Decreased mi
r-146 expression in peripheral blood mononuclear cells is correlated with ongoing islet
autoimmunity in type 1 diabetes patients 1, Journal of diabetes 7 (2015) 158–165.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Care</surname>
          </string-name>
          , Care in diabetes-2022, Diabetes care
          <volume>45</volume>
          (
          <year>2022</year>
          )
          <article-title>S17</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V.</given-names>
            <surname>Jaiswal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Negi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <article-title>A review on current advances in machine learning based diabetes prediction</article-title>
          ,
          <source>Primary Care Diabetes</source>
          <volume>15</volume>
          (
          <year>2021</year>
          )
          <fpage>435</fpage>
          -
          <lpage>443</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sonar</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          <article-title>JayaMalini, Diabetes prediction using diferent machine learning approaches</article-title>
          ,
          <source>in: 2019 3rd International Conference on Computing Methodologies and Communication (ICCMC)</source>
          , IEEE,
          <year>2019</year>
          , pp.
          <fpage>367</fpage>
          -
          <lpage>371</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mujumdar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vaidehi</surname>
          </string-name>
          ,
          <article-title>Diabetes prediction using machine learning algorithms</article-title>
          ,
          <source>Procedia Computer Science</source>
          <volume>165</volume>
          (
          <year>2019</year>
          )
          <fpage>292</fpage>
          -
          <lpage>299</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M. K.</given-names>
            <surname>Hasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Das</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Hossain</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hasan</surname>
          </string-name>
          ,
          <article-title>Diabetes prediction using ensembling of diferent machine learning classifiers</article-title>
          ,
          <source>IEEE Access 8</source>
          (
          <year>2020</year>
          )
          <fpage>76516</fpage>
          -
          <lpage>76531</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Bertsimas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kallus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Weinstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. D.</given-names>
            <surname>Zhuo</surname>
          </string-name>
          ,
          <article-title>Personalized diabetes management using electronic medical records</article-title>
          ,
          <source>Diabetes care 40</source>
          (
          <year>2017</year>
          )
          <fpage>210</fpage>
          -
          <lpage>217</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. S.</given-names>
            <surname>Wells</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Spann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. G.</given-names>
            <surname>Terry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Carr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. A.</given-names>
            <surname>Landman</surname>
          </string-name>
          ,
          <article-title>Prediction of type ii diabetes onset with computed tomography and electronic medical records, in: Multimodal Learning for Clinical Decision Support and Clinical Image-Based Procedures: 10th International Workshop</article-title>
          , ML-CDS
          <year>2020</year>
          ,
          <article-title>and</article-title>
          9th International Workshop, CLIP 2020,
          <article-title>Held in Conjunction with MICCAI</article-title>
          <year>2020</year>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>13</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Vu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Turaga</surname>
          </string-name>
          ,
          <article-title>Learning temporal state of diabetes patients via combining behavioral and demographic data</article-title>
          ,
          <source>in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>2081</fpage>
          -
          <lpage>2089</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Qiu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Uncovering the gene regulatory network of type 2 diabetes through multi-omic data integration</article-title>
          ,
          <source>Journal of Translational Medicine</source>
          <volume>20</volume>
          (
          <year>2022</year>
          )
          <fpage>604</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          , E. Blomqvist,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cochez</surname>
          </string-name>
          , C. d'Amato,
          <string-name>
            <given-names>G. D.</given-names>
            <surname>Melo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gutierrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kirrane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E. L.</given-names>
            <surname>Gayo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Neumaier</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Knowledge</surname>
            <given-names>graphs</given-names>
          </string-name>
          ,
          <source>ACM Computing Surveys (Csur) 54</source>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>37</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D. L.</given-names>
            <surname>Rubin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. H.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. F.</given-names>
            <surname>Noy</surname>
          </string-name>
          ,
          <article-title>Biomedical ontologies: a functional perspective, Briefings in bioinformatics 9 (</article-title>
          <year>2008</year>
          )
          <fpage>75</fpage>
          -
          <lpage>90</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>Knowledge graph embedding: A survey of approaches and applications</article-title>
          ,
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          <volume>29</volume>
          (
          <year>2017</year>
          )
          <fpage>2724</fpage>
          -
          <lpage>2743</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kulmanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. Z.</given-names>
            <surname>Smaili</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hoehndorf</surname>
          </string-name>
          ,
          <article-title>Semantic similarity and machine learning with ontologies</article-title>
          ,
          <source>Briefings in Bioinformatics</source>
          <volume>22</volume>
          (
          <year>2021</year>
          )
          <article-title>bbaa199</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          , et al.,
          <article-title>Identification of type 2 diabetes based on a tengene biomarker prediction model constructed using a support vector machine algorithm</article-title>
          ,
          <source>BioMed Research International</source>
          <year>2022</year>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Mansoori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ghaedi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sadatamini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vahabpour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rahimipour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shanaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saeidi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Kazerouni</surname>
          </string-name>
          ,
          <article-title>Downregulation of long non-coding rnas linc00523 and linc00994 in type 2 diabetes in an iranian cohort</article-title>
          ,
          <source>Molecular biology reports 45</source>
          (
          <year>2018</year>
          )
          <fpage>1227</fpage>
          -
          <lpage>1233</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>F.</given-names>
            <surname>Kazerouni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bayani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Asadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saeidi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parvizi</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z. Mansoori,</surname>
          </string-name>
          <article-title>Type2 diabetes mellitus prediction using data mining algorithms based on the long-noncoding rnas expression: a comparison of four data mining approaches</article-title>
          ,
          <source>BMC bioinformatics 21</source>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>L.</given-names>
            <surname>Saeidi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ghaedi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sadatamini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vahabpour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rahimipour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shanaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Mansoori</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          <article-title>Kazerouni, Long non-coding rna ly86-as1 and hcg27_201 expression in type 2 diabetes mellitus</article-title>
          ,
          <source>Molecular biology reports 45</source>
          (
          <year>2018</year>
          )
          <fpage>2601</fpage>
          -
          <lpage>2608</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          , L. Cheng, X. Cheng, et al.,
          <source>Gene expression</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>