<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Posters and Demos, October</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Conceptual Approach to Using Relevant Patterns in Genomic Data Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>MireiaCosta</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>AlbertoGarcía S</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>AnnaBernascon</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>StefanoCeri</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>OscarPasto</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Conceptual Modeling, Genomic Datasets, Genomics, Multi-level Querying, Analysis Patterns</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Electronics</institution>
          ,
          <addr-line>Information and Bioengineering - Politecnico di Milano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>PROS Research Center, VRAIN Research Institute - Universitat Politècnica de València</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>2</volume>
      <fpage>8</fpage>
      <lpage>31</lpage>
      <abstract>
        <p>Several models have been proposed to represent human genomic information. An interesting approach for supporting genomic applications for health consists of a two-layer representation. In this approach, high-level concepts describing distinct aspects of the human genome at an abstract level are mapped to data representing actual physical measurements. This two-layer method allows users to formulate high-level queries on the concepts and map them onto real datasets. Additionally, the approach is extensible, allowing new conceptual views corresponding to specific genomic features to be mapped to the lower data layer without impacting previous mappings.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>The Human Genome, with its vast complexity, presents challenges in capturing, representing,
and utilizing its extensive information; consequently, the landscape of genomic data sources is
wide and diverse. Commonly used databases include The Cancer Genome Atlas, a landmark
cancer genomics program now embedded within Genomic Data Comm1o]n; st[he 1000 Genomes
Project2[], a catalog of common human genetic variation; G3T]E,ax r[esource database to
study the relationship between genetic variation and gene expression in multiple reference
tissues; and GEO4[], the most general and widely used among genomic repositories.</p>
      <p>In the genomics domain, conceptual models have long been employed to efectively manage
and represent extensive data, as well as to accurately depict the structure and functions of</p>
      <p>CEUR
Workshop
Proceedings
the genome. Starting in the late nineties, pioneers such as Okayama5]evteanlt.u[red into
representing DNA genomic sequences in databases. In the 2000s, Paton6e]tinatl.r[oduced
data models for transcription/translation processes, alongside genomic sequences and protein
structures. Subsequent works leveraged conceptual models to articulate biological entities and
interactions, leading to databases like GenMapper Ware7h]oaunsde B[ioMart8[].</p>
      <p>This background research has later motivated conceptual modeling-based approaches
focusing on either characterizing the genome’s structure conce9p]touralalpyp[lying it in data-driven
contexts1[0, 11]. Bridging these perspectives emerged as a pertinent issue, more recently
addressed in 1[2]. In their proposal, the authors describe a novel conceptual model that merges
concepts-based and data-based perspectives for genomic information modeling. Specifically,
they link aconcepts-layer delineating genome elements and their connectiondsattao-layer,
detailing real-world datasets from genome sequencing. This dynamic linkage facilitates focused
visualization, understanding of commonalities, and complex query expression across genomic
data types, expanding the modular view-based approach to genomic data management.</p>
      <p>This work focuses on the perspective of data users who frequently need to access and query
genomic data resources. Genomic data practitioners typically perform similar types of queries
repeatedly. Currently, there are systems that allow for basic data extraction using simple
queries (conjunctive/disjunctive) over data. Examples of such systems in1]clfuodresi[ngle
consortia databases an13d][and [14] for integrated databases. However, while basic queries are
supported, more sophisticated approaches tailored for more advanced data analysis purposes
are still lacking.</p>
      <p>We propose apattern-driven approach to bridge the existing gap, facilitating more complex
and specific data analysis tasks. This approach largely leverages the conceptual linking provided
by the two-layer conceptual model describe1d2]in,s[erving as a foundation for generating
these query patterns over paramount genomic data sources. The efectiveness of this approach
is demonstrated through the instantiation of query patterns that yield significant results in
contemporary clinical and genetic research. These patterns include the extraction of datasets
for genetic case-control stud1i5e,s16[, 17], integrative multi-omics analys1e8s, 1[9, 20, 21], and
family trio analyse2s2[, 23]. Our proposal aims to ofer a flexible and expandable representation
of concepts, data, and their typical interconnections, providing simple query templates for
guiding concept exploration, inspiring the identification of novel correspondences among data,
and enhancing the findability of interoperable data instances.</p>
      <p>In the remainder of the manuscript, Sec2tipornovides notions on the two-layer conceptual
model; Section3 describes our core contribution, i.e., the data analysis patterns and several
example instantiations; Sect4iodniscusses the implications and limitations of the approach;
and Section5 concludes.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Two-Layer Genomic Representation</title>
      <p>Our work takes inspiration from the holistic view presen1t2e]d,winh[ich bridges a model of the
genomic concepts and a model of the genomic datasets to facilitate genome data management
through robust conceptual modeling support. More specifically, we consider a two-layer
conceptual model: 1) the “concepts-layer” encapsulates human genome mechanism knowledge; and 2)
the “data-layer” portrays genomic data types and experiments through structured information
formats. The abstract idea can be appreciated in F1i:gguerneomic information can be viewed
as a dual system approached in opposite directions: connecting data to pre-existing abstract
concepts (top-down) or building concepts based on available data (bottom-up).</p>
      <p>A top-down approach initially models biological entities and then verifies data sources; this
direction allows us to reveal issues with data structure definition and quality. Conversely, a
bottom-up strategy starts from available data and subsequently constructs models to systematize
and organize it, aiming to create user-friendly systems for domain experts.</p>
      <p>Concepts layer</p>
      <p>Gene expression view</p>
      <p>Gene</p>
      <p>DNA methylation view</p>
      <p>CpGisland
Data layer
microRNA expression view
miRNA</p>
      <p>ExpTeyrpimeent 1</p>
      <p>1* 1..*
Donor 1 * Biosample 1 * Sample 1</p>
      <p>1*
Project</p>
      <p>ExpressionLevel
*1 Dataset</p>
      <p>1*
Schema</p>
      <p>1*
1..* SRaemgipolne</p>
      <p>DNA variation view</p>
      <p>Variant
Reading</p>
      <p>The data-layer (depicted in blue in in Fig1u),rceenters on thSeample concept, representing
a typical genomic data file, which contains a seStaomfpleRegions, i.e., rows in the file,
which represent an interval of the genome on a specific chromosome strand, with start and
end coordinates. Multiple samples are collected wDiatthaisnets, which are homogeneous
in theSchema (i.e., their sample regions have the same columns and semantics) and in the
ExperimentType (a description of the experimental assay run to produce the data). The
experiment has been performed on biological material, which is describedBiboyStahmeple
class, which belongs in turn tDooanor (an actual living patient tissue or an immortalized cell
line or single cells that have undergone a sequencing process). Samples are grouped within
Projects (informing on the management metadata information).</p>
      <p>The concepts-layer has diferent modules (voierws, depicted as light blue rectangles in
Figure1) describing aspects of the human genome, such as DNA variation, gene or microRNA
(miRNA) expression quantification, DNA methylation levels, or any other genomic data type.
To each experiment type in the data-layer, we associate aggenivomenic data view (see light
blue arrows). Each view includes classes representing concepts that are measurable through
genomic sequencing technologies (e.g., the expression levels of genes or the reading of a DNA
variation). In Figur1e,these concepts are drawn in red; given concepts could be common to
diferent views (e.g., the “expression level”).</p>
      <p>The concepts-layer and the data-layer are linked through relationships between concepts (such
as a variation in the DNA) and instances of data-layer classes (i.e., a specific data record). For
example, aSampleRegion from a DNA-Seq experiment can be represented by its corresponding
concept, aVariant spanning positions 43,044,295 to 43,170,245 on the negative strand of
chromosome 17.</p>
      <p>New links between the concepts and data-layers can be established when specific data
types are selected (in the data-layer), thereby triggering the selection of specific views (of the
concepts-layer). Through a classical Ontology-Based Data Access ap2p4r]o,aitc his[possible
to allow access to datasets of a specific genomic data type by specifying a query on the view of
interrelated concepts.</p>
      <sec id="sec-3-1">
        <title>2.1. The Concepts-Layer Model</title>
        <p>While the data-layer is static because new genomic data types are simply another instance
of the related entities, the concepts-layer is flexible and can grow according to the specific
needs of a use case. Figur2eillustrates a portion of the concepts-layer model containing classes
associated with DNA variations, familial relationships, gene expression, miRNA expression,
and DNA methylation.
located_in</p>
        <p>1..*
ChromosomeElement located_in ElementPosition
--ndaesmcerip:tsiotrnin:gstring 1 *</p>
        <p>In this model, thIendividual is the primary class, representing a person. Individuals
can be classified as aHealthyIndividual or anUnhealthyIndividual based on their
diagnosis of a specific Disease. It is possible to establish familial links among individuals
(FamiliarRelationship); individuals aggregateGirnoupOfIndividuals, such asFamily. These
aggregations are modeled with the purpose, for instance, of exploring the interaction between
DNA variations and diseases within families; this aspect is crucial for determining patterns of
inheritance and the pathogenicity of variants. Individuals are composedLocfamtaionnys,
such asTissues.</p>
        <p>Diferent Measurements are performed on individuals. Two types of measurements are
captured in the model, as relevant to our portioRne. aTdhineg describes the appearance of
a DNA variation (oVrariant) in an individual. Variants are distinguished by a name and
description, a type (substitution, insertion, or deletion), and a set of alleles (i.e., reference,
alternative, and ancestral). Each variant can have multiple pVoasriitainontPso(sition), each
determined according to a specific reference system, also knowAnssaesmbly. Variants are
crucial in understanding the genetic basis of many diseases.</p>
        <p>The second type of measurement is tEhxepressionLevel, which is always related to a
genetic component. Unlike the previously mentioned type of measurement (i.e., readings), the
expression level is specific for a giveTnissue, with significant diferences among tissues. Three
are the diferent expression levels considered in the excerpt:
• GeneExpression: a biological process that ensures coGrreencest(which
areTranscriptableElements) are expressed at the right time and in appropriate amounts, enabling cells
to perform their functions correctly. Gene expression measurement can help identify
diferentially expressed genes between normal and cancerous tissues.
• miRNAExpression: a biological process associated with biological components that
regulate gene expression. miRNA expression measurements capture the lemvieRlNsAof,
a kind of non-coding RNAnc(RNA), corresponding toMaatureTranscript (as opposed
to genes, which are a kindPorfimaryTranscript). Measurement of miRNA levels allows
for a better understanding of cancer development and progression, providing insight into
the regulatory mechanisms underlying cancer.
• DNAMethylation: a biological process altering gene expression, happening in
correspondence withCpGIslands, particular featured regionCshionmosomes. Measuring
DNA methylation is crucial to understanding how environmental factors afect -for
instance- cancer development and progression.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Concepts-Data Typical Analysis Patterns</title>
      <p>
        As suggested in1[2], the two-layer representation –drawing direct linking between concepts
and data in the genomic domain– allows:
(a) real data inspection improving its conceptual representation (e.g., by identifying cases
where many diferent variant positions exist from chromosome elements or variants);
(b) use of abstract knowledge (i.e., concepts-layer) as an extractor of existing datasets (i.e.,
data-layer), for instance, by leveraging the explicit conceptual relation between positions
and elements (including genes and transcripts); and
(c) formulation of inter-data type queries over data (i.e., coming from Mdifearseunrtement
types), by controlling the datasets’ concepts that regard diferent genomic data types.
These aspects could be translated into simple view-driven queries, where concepts are selected
in the upper layer and are translated into queries over the data. Here, we propose to make a
step forward with respect to (a)–(c): we use conceptual linking as a glue for generating classical
genomic analysis patterns that are typically used in research practice. In this section, we describe
the most relevant ones, selected according to the enduring experience of the authors in the
ifeld, developed during several interdisciplinary collaborations with clinicians, biologists, and
geneticists. In this work, we focus on:
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) observational studies in which two existing groups –that difer in outcome (e.g., healthy
or non-healthy)– are compared based on a supposed causal attribute (e.g., presence of a
DNA mutation);
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) biological analyses in which the datasets are multiple “omes” (e.g., the genome and the
transcriptome) used to study life overlapping multiple layers; and
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) data analyses that investigate aspects within the genetic hierarchical relationships of
families (e.g., causal variations for inherited diseases).
      </p>
      <p>
        Finally, we show how patterns can be combined in complex patterns, e.g., joining the approach
described in (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ); additional patterns can be built along similar lines. Next, we describe
patterns one by one by exposing relevant examples in the scientific literature and showing a
UML instance diagram25[] depicting the concepts-data-layers linking.
      </p>
      <sec id="sec-4-1">
        <title>3.1. Case-Control Studies</title>
        <p>Case-control studies constitute a commonplace method in clinical research aimed at comparing
diverse genomic datasets from ill individcuaasles)(and unafected individualsco(ntrols) to
delineate genetic elements that contribute to increased susceptibility or severity to disease.
Publicly accessible data repositories such as The Cancer Genome Atlas (1T]C)fGoAr,c[ases and
The GTEx Consortium3[] for controls are fundamental for increasing sample sizes and identify
cases and controls in scenarios where they were previously unavailable, thereby enhancing the
eficiency and robustness of genomic studies. Two types of case-control analyses are typically
produced:
1. At thepopulation-level, examining cases and controls using data that is specific to a
particular tissue, with the purpose of investigating how diseases or phenotypic traits
afect that tissue.
2. At thepatient-level, analyzing both healthy and diseased tissue samples extracted from
the same patient to determine the impact of cancer processes on a given tissue.
Population-level case-control. The advantages of population-level case-control analyses
focusing on the same tissue type have been extensively described in the literature. In the
concrete domain of cancer genomics, such studies typically involve comparing healthy tissues
(derived from patients without cancer) of the same tissue type as those giving rise to cancer
(obtained from patients with a specific cancer subtype).</p>
        <p>For instance, Araent al. [15] combined data from the TCGA and GTEx projects to analyze
gene expression disparities between healthy and across eight tissues and corresponding tumor
types. This approach facilitated the comprehension of tumor development and the discovery of
novel biomarkers, critical for efective prevention and therapeutic stratagem selection.</p>
        <p>In Figure3 we present a simplified version of the scenario outline1d5i]n,f[eaturing two
distinct patients: one aflicted (extracted from TCGA, with ID = “TCGA-A2-A04N”) and one
healthy (extracted from GTEx, with ID = “074b0792-df3c-4b59-9f50-793bc14bcb81”) individual.</p>
        <p>Note that the aflicted patient has a non-healthy biosaimsp_lheea(lthy = false) associated
with theDuctal and Lobular Neoplasm disease.</p>
        <p>The process of selecting cases and controls with specific conditions and from the same tissue
type within such datasets is nontrivial and necessitates sophisticated instance modeling. For
example, identifying patients who are “male, white, and 79 years old” within1T]CisGnAot[
feasible. Consequently, pairing cases and controls entail not only ontological mediation (via
the concepts-layer) but also an understanding of the data sources. Our proposed approach
streamlines this process by enabling a consistent representation of biological concepts and
a technologically-independent data representation. The expression levels are observed on
a specific gene (TP53), which is fixed at the conceptual level and therefore searched in the
SampleRegions of GTEx and TCGASamples to extract appropriate values.</p>
        <p>Patient level case-control. Numerous studies have highlighted the clinical advantages of
performing case-control analyses at the patient level. For individual patients, the analysis
involves comparing samples from adjacent tumor tissue (considecroendtroals) and tumor tissue
(case). Collectively, these two types of samples are referrepdatiroedassamples.</p>
        <p>In [16], Kim et al. analyzed paired samples from patients wCiotlhon Adenocarcinoma, showing
that this type of analysis significantly impacts the prediction of cancer recurrence. Oh and
Lee [17], instead, examined the diferences in gene expression between paired samplLeusnign
Adenocarcinoma andBreast Invasive Carcinoma, among others. Using machine learning models,
they concluded that such analyses can aid in predicting the prognosis of certain cancers, thus
facilitating appropriate clinical treatments. Both studies employed the TCGA public resource
to obtain patient data. In Fig4uwree show the case of a patient possibly included in this
patient-level case-control anal1y7s]i.sT[his patient, diagnosed with Lung Adenocarcinoma (id
= “TCGA-44-6146”), holds paired samples available in TCGA.</p>
        <p>TCGA is a repository that provides information on files resulting from specific genomic
analyses of healthy and tumor tissues from cancer patients. The analytical nature of TCGA
makes the search for paired samples challenging, requiring advanced data processing and
search functionality to identify analyses from the same patient (e.g., files associated with the
same Donor). Currently, this is not allowed even by the updated TCGA major entr1]y. point [
Note that, in the concepts-data framework, the data-layer enables easy identification of the
patient (oDronor) from whom each sample originates, while the concepts-layer facilitates the
identification of both samples as belonging to the same individual.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Integrative Multi-Omics Studies</title>
        <p>Multi-omics approaches are innovative frameworks that integrate multiple omics datasets to
enhance understanding of genetic dise2a6s]e. P[articular attention is given to multi-omics
studies to study cancer’s molecular and clinical features. Here, areas of research include
segmentation into subtypes, improvement of survival predictions and therapeutic outcomes,
and uncovering key pathophysiological processes across diferent molecular layers. Again, we
refer to the TCGA data source, while other cancer genomics could also be considered (see the
ICGC [27]). For multi-omics analysis, two types of inquiries typically hold significant interest:
1. At thepopulation-level, analyzing a specific disease or phenotypic trait, by using genomic
samples that refer to diferent genomic data types (i.e., features in the genome).
2. At thepatient-level, analyzing a specific disease or phenotypic trait, considering specific
patients whose samples have been analyzed according to multiple genomic data tests (i.e.,
for whom multiple data types are available).</p>
        <p>Population-level multi-omics. Associating variation signatures or gene
expression/methylation/miRNA profiles with diagnostic/prognostic values is of high importance in cancer research.
Pinoliet al. [18] examine the rich presence of variants, abnormal methylation levels, as well as
copy number alteration events, in the proximity of specific topological structures for 26 cancer
types. Mehrgou and Teimourian19[] utilize gene expression, methylation, and miRNA datasets
from both TCGA and GEO sources to derive insightCsoolonrectal cancer, with applications in
diagnosis, prognosis, and targeted therapy.</p>
        <p>Patient-level multi-omics. Focusing on specific tissues, we aim to find patients whose
biological samples have been analyzed using diferent genomic experiments (i.e., for whom
multiple data types are available). Grouping data by the same patient enables building richer
disease models. We call this pattoenren-to-one linking and to the multiple samples derived
from the same patient lainsked multi-omics samples (connecting mutations, expression, and
epigenomic signals such as methylation levels).</p>
        <p>Figure5 illustrates a patient (ID = “TCGA-33-4589”) with lung adenocarcinoma for whom
data on variants, methylation levels, miRNA, and gene expression are available in TCGA.
The comprehensive data of this patient facilitates multiple analyses with significant clinical
applications. For instance, 2in0][, miRNA and gene expression data were analyzed in patients
with lung adenocarcinoma to classify patients based on survival. This has crucial implications
for cancer prognosis, enabling the identification of patients who may require more intensive
monitoring due to a poor prognosis. Similar studies have been conducted for survival prediction
in breast cancer, utilizing gene and miRNA expression, DNA methylation, and CN2V1]d. ata [</p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. Family Trio Analyses</title>
        <p>Rare disorders are conditions with a low frequency in the population and often have a genetic
component. Despite the significance of genetics in these disorders, most patients remain
undiagnosed after standard genetic test8in].gF[amily trio analysis involves comparing the
genetic information of the patient with that of their parents. Consequently, it is possible
to identifdye novo variations, i.e., DNA variations unique to the patient and not inherited
from either parent. This kind of analysis has been shown to positively impact rare disease
contexts, by improving diagnostic and serving as a powerful tool in identifying disorder-causing
variations23[].</p>
        <p>Figure6 illustrates information about a family trio repo2r2t]e.dIninth[is study, the authors
examine family trios in the contexAtmoyfotrophic Lateral Sclerosis (ALS) to identify risk factor
variants associated with this devastating disease. They identifieddseenvoevroavlariations, such
as the v1 instance of the Variant class in F6i.gTuhries variant is considerdeednovo because it
was identified only in the afected individual (son instance oUfnthealthyIndividual class)
and not in either parent (father and mother instances in the concepts-layer). The identified
variants helped the authors improve the understanding of the genetic role in ALS.</p>
        <p>In this particular pattern, the data-layer represents fundamental information about the
experiment, such as the sequencing technology, which cannot be represented in the
conceptslayer. On the other hand, the concepts-layer allows us to infer that the variant shown in
Figure6 is de novo, as it captures the familial relationships between individuals. The ontological
connection between both models provides a holistic representation of all the relevant information
needed for family trio analysis.</p>
        <p>An important genomic data source, the 1000 Genomes Project, collected a huge dataset
intending to identify all the genetic variants with frequencies of at least 1% in several
worldwide populations. The last release of the project covered 26 populations and observed single
nucleotide variants (SNVs) and insertions/deletions (indels) from diferent 602 parent/child trios
produced within the projec28t][. The pattern described in Figu6rcean be reproduced on 1000
Genomes data to perform a database-wide analysfiasmoinly-trio samples.</p>
      </sec>
      <sec id="sec-4-4">
        <title>3.4. Complex Patterns</title>
        <p>Above, we have demonstrated traditional data analysis patterns in genomics. However, more
complex analysis patterns have gained attention.</p>
        <p>One example of a complex pattern involves family-trio case-control analysis, which
corresponds to performing case-control analyses (see Se3c.1t)ioonn family trios (see Section
3.3). Specifically, it compares the genetic information of families with an afected individual
to families with no afected individuals.2I9n],[this strategy was employed to identify genes
and mutation types that are highly associated with Schizophrenia7. iFlilguusrterates that
scenario, by using only two families (for simplification purposes). Here, it can be observed that
the v2Variant appears in the ofspring of both families; this insight can be used to rule out the
association of this variation with schizophrenia (as sU1nihseaanlthyIndividual, whereas s2
is aHealthyIndividual). Conversely, the vV1ariant appears only in the family member s1,
who is afected by schizophrenia; however, it does not appear in any of her/his parents, which
ofers strong evidence of the potential relationship between variant v1 and schizophrenia.</p>
        <p>Another example of a complex pattern is multi-omics case-control analyses. Here, experts
compare diferent data types from patients with a certain characteristic (cases) to those without
it (controls) to determine if there is a clinically relevant relationship between any omic feature
and the characteristic under study. For instan3c0e],, tinhe[ authors used multi-omic data to
predict the risk of developing asthma, an3d1i]n,t[hey employed this type of analysis to predict
the development of preeclampsia.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Discussion</title>
      <p>The two-layer framework described in Sec2tiaolnlows us to incorporate new concepts and
relationships according to a data-agnostic approach. Indeed, as acquired knowledge in genomics
is constantly evolving, new concepts will be added and changes will recur in the
conceptslayer. The model will not remain the same as the one presented in2F,iwguhriceh -in
turnextends previous work12[]. Conversely, the data-layer is typically not impacted by genomic
concepts’ changes as long as all genomic data can be represenStaemdplaess containing
SampleRegions. Even when experts’ understanding of genomic-related knowledge mutates,
possibly impacting the interpretation of data analysis results, data keeps the same model; this
favors the maintainability of potential data mappings, processing pipelines, and bio-tools that
leverage this representation.</p>
      <p>Here, we show that a strong connection between data and concepts compensates for the
limitations of approaches that consider the layers separately, allowing a holistic representation
of the genomic domain. The identification of interesting patterns of analysis and the consequent
reasoning can only be explained by using an interactive two-layer representation. Our rationale
is to use the conceptual linking as a glue for generating classical patterns of case-controls,
multiomics, or family-trios by having the conceptuals-layer model in the middle and instantiating
the data-layer model as many times as needed. At the patient-level, case-controls typically have
two-instance-replication (see Fig4u).rIenstead, multi-omics have many-instance-replication;
we showed four, in the example of Figu5,rreeplicating the data model forgfeonuormic data
types thereby creating one copy for each sample of a same patient. In this way, we let classical
genomic data analysis patterns emerge, where we “pivot” upon ontological knowledge
(conceptslayer) as the mediator across several instantiations of the data-layer, playing clearly identified
roles. We demostrated the capability to perform queries with high complexity, which facilitates
the extraction of relevant data from highly-heterogeneus disorganized repositories and the
advancement of data exploitation in the domain.</p>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusion</title>
      <p>In this paper, we explain five basic patterns and then hint at how they can be composed to
form more complex patterns. This framework will allow easy extension to novel query patterns
that will go along with the rapidly-evolving state of genomic knowledge In current practice,
domain experts typically navigate genomic data source interfaces and download data without
a clear formalization of the underlying concepts and their semantic relationships. In this
work, we describe a conceptual modeling-based framework that enables a unified querying
strategy. Building on this, we envision a next-generation genomic data query builder that,
starting from high-level concepts, allows users to execute abstract queries. This approach will
relieve practitioners from the complexities of data formats and heterogeneity, enabling them
to seamlessly formulate data extractions that align more closely with the classical problem
formulations they are familiar with. The patterns presented in this work demonstrate initial
prototypes of modular queries that can be implemented in such a system. Prospectively,
this data query builder will be the main component of a visual model-driven query system
for practitioners, where the conceptual model in the concepts layer is used to identify data
instances in the underlying data layer.</p>
      <p>Acknowledgements. This work is supported in part by CIPROM/2021/023),
PDC2021-121243I00), PID2021-123824OB-I00, MICIN/AEI/10.13039/501100011033, and ACIF/2021/117 grants.
[12] A. Bernasconi, et al., PoliViews: A comprehensive and modular approach to the conceptual
modeling of genomic data, Data Knowl Eng 147 (2023) 102201.
[13] F. Albrecht, et al., DeepBlue epigenomic data server: programmatic data retrieval and
analysis of epigenome region sets, Nucleic Acids Res 44 (2016) W581–W586.
[14] A. Canakoglu, et al., GenoSurf: metadata driven semantic search system for integrated
genomic datasets, Database-Oxford 2019 (2019).
[15] D. Aran, et al., Comprehensive analysis of normal adjacent to tumor transcriptomes, Nat</p>
      <p>Commun 8 (2017).
[16] J. Kim, et al., Transcriptomes of the tumor-adjacent normal tissues are more informative
than tumors in predicting recurrence in colorectal cancer patients, J Transl Med 21 (2023)
209.
[17] E. Oh, H. Lee, Transcriptomic data in tumor-adjacent normal tissues harbor prognostic
information on multiple cancer types, Cancer Med 12 (2023) 11960–11970.
[18] P. Pinoli, et al., Pan-cancer analysis of somatic mutations and epigenetic alterations in
insulated neighbourhood boundaries, PloS one 15 (2020) e0227180.
[19] A. Mehrgou, S. Teimourian, Update of gene expression/methylation and mirna profiling
in colorectal cancer; application in diagnosis, prognosis, and targeted therapy, Plos one 17
(2022) e0265527.
[20] K. Asada, et al., Uncovering prognosis-related genes and pathways by multi-omics analysis
in lung cancer, Biomolecules 10 (2020) 524.
[21] L. Tong, et al., Deep learning based feature-level integration of multi-omics data for breast
cancer patients survival analysis, BMC Med Inform Decis 20 (2020) 225.
[22] A. Chesi, et al., Exome sequencing to identify de novo mutations in sporadic als trios, Nat</p>
      <p>Neurosci 16 (2013).
[23] M. Mousa, et al., Whole-exome sequencing in family trios reveals de novo mutations
associated with type 1 diabetes mellitus, Biology 12 (2023) 413.
[24] D. Calvanese, et al., Ontology-based database access, in: SEBD, 2007, pp. 324–331.
[25] G. Booch, et al., The Unified Modeling Language User Guide, Addison-Wesley, Reading,</p>
      <p>MA, 1999.
[26] S. Graw, et al., Multi-omics data integration considerations and study design for biological
systems and disease, Mol Omics 17 (2020).
[27] J. Zhang, et al., The international cancer genome consortium data portal, Nat Biotechnol
37 (2019) 367–369.
[28] M. Byrska-Bishop, et al., High-coverage whole-genome sequencing of the expanded 1000
genomes project cohort including 602 trios, Cell 185 (2022) 3426–3440.
[29] B. Xu, et al., De novo gene mutations highlight patterns of genetic and neural complexity
in schizophrenia, Nat Genet 44 (2012) 1365–1369.
[30] X.-W. Wang, et al., Benchmarking omics-based prediction of asthma development in
children, Respiratory Research 24 (2023) 63.
[31] A. Rahnavard, et al., Molecular epidemiology of pregnancy using omics data: advances,
success stories, and challenges, Journal of Translational Medicine 22 (2024).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Grossman</surname>
          </string-name>
          , et al.,
          <article-title>Toward a shared vision for cancer genomic data</article-title>
          ,
          <source>New Engl J Med</source>
          <volume>375</volume>
          (
          <year>2016</year>
          )
          <fpage>1109</fpage>
          -
          <lpage>1112</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>1000</given-names>
            <surname>Genomes Project</surname>
          </string-name>
          <string-name>
            <surname>Consortium</surname>
          </string-name>
          ,
          <article-title>A global reference for human genetic variation</article-title>
          ,
          <source>Nature</source>
          <volume>526</volume>
          (
          <year>2015</year>
          )
          <fpage>68</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lonsdale</surname>
          </string-name>
          , et al.,
          <article-title>The genotype-tissue expression (gtex) project</article-title>
          ,
          <source>Nat Genet</source>
          <volume>45</volume>
          (
          <year>2013</year>
          )
          <fpage>580</fpage>
          -
          <lpage>585</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Barrett</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>NCBI GEO</surname>
          </string-name>
          <article-title>: archive for functional genomics data sets-update</article-title>
          ,
          <source>Nucleic Acids Res</source>
          <volume>41</volume>
          (
          <year>2012</year>
          )
          <fpage>D991</fpage>
          -
          <lpage>D995</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Okayama</surname>
          </string-name>
          , et al.,
          <article-title>Formal design and implementation of an improved DDBJ DNA database with a new schema and object-oriented library</article-title>
          ,
          <source>Bioinformatics</source>
          <volume>14</volume>
          (
          <year>1998</year>
          )
          <fpage>472</fpage>
          -
          <lpage>478</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>N. W.</given-names>
            <surname>Paton</surname>
          </string-name>
          , et al.,
          <source>Conceptual modelling of genomic information, Bioinformatics</source>
          <volume>16</volume>
          (
          <year>2000</year>
          )
          <fpage>548</fpage>
          -
          <lpage>557</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.-H.</given-names>
            <surname>Do</surname>
          </string-name>
          , E. Rahm,
          <article-title>Flexible integration of molecular-biological annotation data: The GenMapper approach</article-title>
          , in: International Conference on Extending Database Technology, Springer,
          <year>2004</year>
          , pp.
          <fpage>811</fpage>
          -
          <lpage>822</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Smedley</surname>
          </string-name>
          , et al.,
          <article-title>The BioMart community portal: an innovative alternative to large, centralized data repositories</article-title>
          ,
          <source>Nucleic Acids Res</source>
          <volume>43</volume>
          (
          <year>2015</year>
          )
          <fpage>W589</fpage>
          -
          <lpage>W598</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>García</surname>
          </string-name>
          , et al.,
          <article-title>Towards the understanding of the human genome: a holistic conceptual modeling approach</article-title>
          ,
          <source>IEEE Access 8</source>
          (
          <year>2020</year>
          )
          <fpage>197111</fpage>
          -
          <lpage>197123</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bernasconi</surname>
          </string-name>
          , et al.,
          <article-title>Conceptual modeling for genomics: building an integrated repository of open data</article-title>
          ,
          <source>in: Int. Conference on Conceptual Modeling</source>
          , Springer,
          <year>2017</year>
          , pp.
          <fpage>325</fpage>
          -
          <lpage>339</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gundersen</surname>
          </string-name>
          , et al.,
          <article-title>Recommendations for the fairification of genomic track metadata</article-title>
          ,
          <source>F1000Research</source>
          <volume>10</volume>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>