=Paper=
{{Paper
|id=Vol-1546/paper_12
|storemode=property
|title=Linked Functional Annotation For Differentially Expressed Gene (DEG) Demonstrated using Illumina Body Map 2.0
|pdfUrl=https://ceur-ws.org/Vol-1546/paper_12.pdf
|volume=Vol-1546
|authors=Alokkumar Jha,Yasar Khan,Aftab Iqbal,Achille Zappa,Muntazir Mehdi,Ratnesh Sahay,Dietrich Rebholz-Schuhmann
|dblpUrl=https://dblp.org/rec/conf/swat4ls/JhaKIZMSR15
}}
==Linked Functional Annotation For Differentially Expressed Gene (DEG) Demonstrated using Illumina Body Map 2.0==
<pdf width="1500px">https://ceur-ws.org/Vol-1546/paper_12.pdf</pdf>
<pre>
Linked Functional Annotation For Differentially
  Expressed Gene (DEG) Demonstrated using
            Illumina Body Map 2.0

    Alokkumar Jha , Yasar Khan, Muntazir Mehdi, Aftab Iqbal, Achille Zappa,
               Ratnesh Sahay, and Dietrich Rebholz-Schuhmann

               Insight Centre for Data Analytics, NUI Galway, Ireland
                                       Galway
     {alokkumar.jha,yasar.khan,aftab.iqbal,achille.zappa,muntazir.mehdi,
                   ratnesh.sahay,rebholz}@insight-centre.org

        Abstract. Semantic Web technologies are core for the integration of
        disparate data resources. It can be used to exploit data from next gen-
        eration sequencing (NGS) for therapeutic decisions regarding cancer. In
        this manuscript, we describe how different data resources, which inform
        on the expression of specific genes in a tissue and its variants, can be
        brought together to indicate a risk for tissue-specific cancer for NGS
        data. This approach can be used to judge patient genomic data against
        public reference data resources.
        The TCGA and COSMIC repositories are being processed to connect and
        query information concerning the expression of genes, copy number vari-
        ants (CNV), and somatic mutations. We annotated sets of differential ex-
        pression data provided from the Illumina Body map 2.0 (HBM) concern-
        ing 16 different tissue types and identify genes with an RPKM (Reads Per
        Kilobase of transcript per Million mapped reads) value greater than 0.5
        as measure indicating an associated risk for cancer. Thus, the differential
        expressed genes from HBM can be associated with a tissue type and gene
        expressions in COSMIC and TCGA leading to a potential biomarker for
        that particular tissue specific cancer. In the case of ovarian cancer, we
        retrieved the genomic positions (loci) and the associated genes of poten-
        tial biomarker candidates, and suggest that this approach and platform
        can serve future studies well.
        Altogether, the presented linked annotation platform is the first approach
        to represent the COSMIC data in an RDF format and to link the data
        with the TCGA datasets. The proposed approach enriches mutations by
        filling in missing links from COSMIC and TCGA datasets which in turn
        helped to map mutations with associated phenotypes.
        Keywords: Differentially Expressed genes(DEG),Linked data, Clinical
        genomics,Copy Number Variation (CNV)


1     Introduction
Next Generation Sequencing (NGS) technologies open new diagnostic and ther-
apeutic ways for cancer research. However, the resulting high-throughput se-
quencing data has to be processed in complex data analytics pipelines including
annotation services. Unfortunately, there is not yet a well-integrated platform
available for both clinical and translational [12] research to fulfill these anno-
tation and analytical tasks. In addition, the large volumes of NGS data poses
another challenge, since the computational infrastructure for the biological in-
terpretation will have to cope with very large quantities of data originating from
clinical facilities. Last, but not least, the functional annotation of genomics data
for cancer has to take tissue specificity into consideration and thus has to avoid
ambiguity while aggregating clinical outcomes from disparate resources. In this
paper we focus on exploring gene expression patterns across different cancer and
tissue types. Our experiments are based on semantic integration of gene expres-
sion, CNV, complete mutation data from two disjoint resources, i.e. COSMIC1
and TCGA2 . By doing this we can assist in variant and mutation prioritization
using 16 different tissue types given by the Illumina Body Map 2.0 and evaluated
in a case study for Ovarian cancer.
    In order to link and retrieve patterns of a gene and tissue specific information
from various cancer mutation (TCGA) and database with global mutation list
and mutation type (COSMIC), we encountered the following three challenges:
(i) to transform heterogeneous data repositories and their storage formats into
standard RDF; (ii) to discover associations (aka. links) by finding specific pat-
terns (i.e. correlations) for a gene with regards to CNV, mutation and its gene
expression data sets; and (iii) to query in a scalable way the large volume and
frequently updating datasets covering 16 different tissue types and the gene ex-
pression data from different repositories.
    The experiments conducted in this paper is aligned to the transcriptome
study based on the Human Body Map 2.0 (HBM)3 from Illumina which covers
the following tissues: adrenal, adipose, brain, breast, colon, heart, kidney, liver,
lung, lymph, ovary, prostate, skeletal muscle, testes, thyroid, and white blood
cells. The HBM provides gene specific information across one or more tissue types
and intends to support the identification of potential biomarker for targeted
therapy. In this study our results not only depicts novel biological outcomes but
also provides a linked annotation framework that assimilates clinical outcomes
from related data repositories.
    The rest of the paper is structured as follows: Section 2 motivates our working
scenario exploring on the HBM use case and the annotation databases; Section 3
presents the methodology and architecture of the proposed functional annotation
framework; Section 4 gives an evaluation of the functional annotation framework;
Section 6 presents the related work in linking the TCGA repository and Section 7
draws the conclusion from our work.
2       Motivation
In order to understand the outbreak of disease, in particular cancer, it is one ap-
proach to compare normal and diseased tissue samples to interpret the changes in
the expression patterns of the genes with regards to the observed disease status.
    1
      http://cancer.sanger.ac.uk/cosmic
    2
      https://tcga-data.nci.nih.gov/tcga/
    3
      https://www.ebi.ac.uk/gxa/experiments/E-MTAB-513
In our case, HBM serves the purpose to identify similarities in gene expression
patterns using the studies across different tissue types, where HBM discloses
the similarities between human tissues on the molecular and genetic level. Due
to overlaps between cancer behaviours, progression and mutated genes, we have
annotated top 100 genes distilled by our filtering criteria with COSMIC to ex-
plore previously observed studies from TCGA database, e.g., somatic mutation,
genomic loci and other mutations linked to these genes retrieved from healthy
tissues.
Human Body Map (HBM) 2.0 from Illumina: HBM covers data from
transcriptome studies for 16 tissue types (see above). Samples for these 16 tis-
sue types have been processed, aligned and finally expression level have been
determined [1]. Sequencing has been performed to provide both paired-end and
single-end libraries (read-length of 50bp and 75bp). Therefore, the data process-
ing platform requires a list of differentially expressed genes as input, which is
the outcome of the RNA seq data analysis pipeline.
    The gene expression data extracted from HBM samples returns a very large
list of more than 52000 genes. For data processing reasons we chose to reduce
the list and therefore defined the cut off for each RPKM value according to the
method suggested by Sandberg et. al[8]. As a result, the data for each tissue
type includes both the coverages and the RPKM values as the corresponding
expression level. In addition, the RNA seq data set provides further relevant data
such as CNV, fusion genes, structural variation, differentially expressed genes,
novel mutations, splice junctions and trascriptome variations [4]. Identifying the
associations and relations between these datasets, i.e. the logical connections,
enables further insightful research into the cancer disease the biological and
clinical interpretation of given data.
Annotation databases (COSMIC & TCGA): The main focus of this work
is the identification of patterns for cancer mutations (given by TCGA) and glob-
ally known mutations and their types (given by COSMIC) for selected differen-
tially expressed genes across different tissue types. Figure 1 shows the correspon-
dences, i.e., the associations or connections that have been established between
the TCGA and COSMIC databases for this purpose. For this task our primary
concern have been the associations between the CNV, the known mutations and
the gene expression data.


                                                                         GENE_EXPRESSION     COMPLETE_MUTATION      CNV(COPY NUMBER)
       GENE_EXPRESSION      COMPLETE_MUTATION         CNV(COPY NUMBER)
                                                                         ID_SAMPLE         Gene name             ID_SAMPLE
       Composite Element   Hugo_Symbol                                                                           ID_tumour
                                                      Chromosome         SAMPLE_NAME       Sample name
       REF Protein                                    Start              GENE_NAME         Tumour origin         Primary site
       Expression          Entrez_Gene_Id                                                                        Histology subtype
                                                      End                REGULATION        TCGA_ID
                                                      Segment_Mean                         ID_tumour             MUT_TYPE
                           ChromosomeStart_Position                                                              Chromosome:G_Start..G_Stop
                                                                                           Sample source
                           UID                                                             ……..

                           Validation Status
                           ……


                       Fig. 1: Links between COSMIC and TCGA repositories
   As part of this work, specific basic curation for data refinements have been
performed: we had to identify instances to link two databases or a couple of
events within the databases (see fig. 1). For example, MUTATION and GENE
EXPRESSION data in COSMIC could be linked to GENE NAME but CNV
had SAMPLE IDs as expected. Later we used SAMPLE IDs after first iteration
with GENE NAME. Also, chr:start end position and GENE NAME were used
to link COSMIC and TCGA (see green arrows in Figure 1). The RDFized
version (see section 3.1) has kept this redundancy problem to have FDR rate as
low as possible.

3   Methodology & Architecture
The annotation architecture is summarized in Figure 2 showing all three major
components. First, the RDFization component that generates Linked Data from
the TCGA and COSMIC databases leading to several SPARQL endpoints for
public use. Second, the linking component that searches and discovers correspon-
dences between selected datasets (TCGA variants, COSMIC diff expression,
COSMIC Mutation, etc.). The links discovered by this component have an
effect on the efficiency in the source selection, on the query planning, and on
the overall query execution in a decentralized setting. Third, the scalable query
federation component: it a single-point-of-access through which distributed data
sources can be queried in concerto.
             Illumina Body Map 2.0
             Differentially expressed genes (DEG)
                                                                                        TCGA_variants


                  SPARQL Query                                                                                                                                COSMIC
                           SAFE Query Federation Engine
                                                                                                          Linked TCGA and COSMIC Repositories


                                                                                                                                                              GENE_NAME
                                                                                                                                                              Mutation _ID
                                                                                                                                                              Sample_ID
                  Source                                                                                                                                      Tumor_ID
                                                                                                                                                              Chr_start
                                                                                 COSMIC_diff_expression


                 Selection                                                                                                                                     Chr_end
                                                                                                                                                               TCGA_ID
                                                  Data
                                                Summaries
                                                                                                                                                              GENE_EXPRESSION
                                                                                                                                                RDFization


                                                                                                                                                             COMPLETE_MUTATION

                   Query
                                                                        COSMIC_Mutation


                                                                                                                                                              CNV(COPY NUMBER)

                  Planning
                                                                                                                                                              TCGA_ID
                                                                                                                                                              Chr_end
                                                                                                                                                              Chr_start
                                                  Access
                                                              TCGA_protein_expression


                                               Policy Model                                                                                                   Variant_type
                                                                                                                                                              Ref_Protien
                   Query                                                                                                                                      Expression
                 Execution
                                                                                                                                                                 TCGA


                       Results


         Fig. 2: Linking and functional annotation of gene expression data
    The scalable query federation is based on the SPARQL query federation en-
gine called SAFE [7], which has been developed for accessing distributed clinical
trial repositories. SAFE has been adapted to improve the efficient integration of
data from the different TCGA and COSMIC SPARQL endpoints. More specifi-
cally, SAFE makes use of a favourable distribution of data to reduce the number
of sources required for processing federated SPARQL queries (without compro-
mising recall). This approach is based on the principle that integrated data
sources allow querying of multiple data sources in a single search, independently
of their status being distributed or centralized, whereas traditional methods of
data integration rather map the data models to a single, unified, model. Such
methods tend to resolve syntactic differences between models, but do not ad-
dress possible inconsistencies in the concepts defined in those models. Semantic
integration resolves the syntactic heterogeneity present in multiple data models
as well as the semantic heterogeneity among similar concepts across those data
models.
3.1      RDFization

COSMIC raw data files are delivered as tab separated text (tsv) and are being
processed with the COSMIC RDFizer tool that generates the N3 triples for
the SPARQL endpoint and statistical information related to the data. Only
three types of data have been included, i.e., gene expression data, gene mutation
and CNV data. Table 1 shows the overall statistics of the RDFization: row 1
represents for the COSMIC gene expression data the number of records (column
2), it’s size (column 4), the corresponding triples generated (column 2) and again
it’s size. The other two rows represent the same type of data for the COSMIC
gene mutation and CNV data. A total of 154 million records has been RDFized,
producing approximately 1.2 billion triples. Row 5 represents the statistics for
the RDF version of TCGA-OV (TCGA Ovarian), which forms a subset of the
linked TCGA4 data. The RDF file for the COSMIC data can be made available
inline with the COSMIC data policy.

                                 Table 1: COSMIC data statistics
                   No.   Data   Records Triples Original Size RDFized Size
                    1 COSMIC GE 149 M 1185 M       7.5 GB        20 GB
                    2 COSMIC GM 3.7 M    84 M     916 MB        1.44 GB
                    3 COSMIC CNV 0.9 M   11 M      82 MB        161 MB
                    4    Total   154M 1.28 B      8.5 GB         22 GB
                    5  TCGA-OV    18M   100 M     2.15 GB         5 GB


3.2      COSMIC and TCGA linking

The main integration is based on owl:sameAs constructs as can be seen in the
listing 1.1 where two COSMIC sample ids have been identified as being identical
to two TCGA patient bar code ids. These links are at the core of facilitating
data integration and the data analysis tasks.

                     Listing 1.1: COSMIC and TCGA Linking Example
<Link−1>
  <S o u r c e>COSMIC</S o u r c e>
  <Target>TCGA−OV</Target>
  <l i n k >
   h t t p : / / c o s m i c . s e l s . i n s i g h t . o r g / schema / ID Sample /TCGA−13−0920
   owl : sameAs
   h t t p : / / t c g a . d e r i . i e /TCGA−13−0920
  </ l i n k >
  </Link−1>
<Link−2>
  <S o u r c e>COSMIC</S o u r c e>
  <Target>TCGA−OV</Target>

    4
        http://tcga.deri.ie/
  <l i n k >
   h t t p : / / c o s m i c . s e l s . i n s i g h t . o r g / schema / ID Sample /TCGA−24−1850
   owl : sameAs
   h t t p : / / t c g a . d e r i . i e /TCGA−24−1850
   </ l i n k >
</Link−2>


3.3     Scalable query federation
SAFE has been developed for accessing sensitive clinical data in data cubes at
different locations [7]. Two main changes have been introduced to SAFE for
efficiently querying the TCGA and COSMIC SPARQL endpoints. First, stan-
dardize RDF query representation: in the initial versions, SAFE issues queries
for statistical clinical information stored within distinct names graphs for RDF
data cubes [3]. Therefore, the internal query processing (i.e., source selection,
query planning, query execution) had to be adapted to query the regular RD-
Fized versions of the TCGA and COSMIC repositories. Second, access control
had to be disabled: SAFE imposes restrictions for data-access as a feature (de-
fined as Access Policy Model [7]) while federating queries over multiple clinical
site, i.e. imposing the data restrictions for different data repositories. Since, ex-
periments conducted in this paper mainly involve public repositories this feature
has been disabled.
     The listing 1.2 shows a sample SPARQL query, which federates across COS-
MIC and TCGA data asking for genomic loci of a mutated gene by chro-
mosome start points which then returns the disease metastasis information
along with the mutation type. Answering such a query requires the integra-
tion of COSMIC with TCGA and merging results from both TCGA and COS-
MIC, and thus has to make use of query federation. The results for the first
four triples in the given query (i.e. cosmic-s:ID Sample, cosmic-s:Gene Name,
cosmic-s:Chrom start) are fetched from COSMIC and the results for the next
three triples (i.e. tcga:tcga id, tcga:start) are fetched from TCGA. To pro-
duce the required information, both results are merged on the basis of the last
triple which integrates COSMIC with TCGA. Sample results for this query can
be seen in Figure 5.
                              Listing 1.2: Federated SPARQL Query
PREFIX c o s m i c−s : <h t t p : / / c o s m i c . s e l s . i n s i g h t . o r g / schema/>
PREFIX t c g a : <h t t p : / / t c g a . d e r i . i e />
PREFIX owl : <h t t p : / /www. w3 . o r g / 2 0 0 2 / 0 7 / owl#>
SELECT ∗ WHERE {
  ? c o s m i c r e s u l t a c o s m i c−s : r e s u l t ;
      c o s m i c−s : ID Sample ? i d s a m p l e ; c o s m i c−s : Gene Name ? g e n e ;
      c o s m i c−s : c h r o m s t a r t ? c h r o m s c o s m i c .
  ? t c g a r e s u l t a tcga : r e s u l t ;
      tcga : Id ? t c g a i d ; tcga : S t a r t ? t c g a c h r o m s t a r t .
  ? i d s a m p l e owl : sameAs ? t c g a i d .
}


4     Biological questions and annotation results from HBM
We analysed all genes have an RPKM > 0.5 [8] and that are differentially ex-
pressed in all tissue types. Figure 3 [2] is a schematic representation which
satisfies all mentioned conditions and delivers 99 genes per query. We have iden-
tified potential cancer types based on gene patterns for different tissues and
further helped to understand the behaviour of most amplified cancer types.


                                                                                                                                                                                            CCL21
                                                                                                                                                                                            JCHAIN
                                                                                                                                                                                            IGHA1
                                                                                                                                                                                            IGHM
                                                                                                                                                                                            IGKV1−5
                                                                                                                                                                                            IGHV1−2
                                                                                                                                                                                            IGKV3−20
                                                                                                                                                                                            IGLV3−25
                                                                                                                                                                                            IGHG1
                                                                                                                                                                                            IGKV4−1
                                                                                                                                                                                            IGHG2
                                                                                                                                                                                            IGHV3−23
                                                                                                                                                                                            IGKC
                                                                                                                                                                                            IGLV3−19
                                                                                                                                                                                            IGLC2
                                                                                                                                                                                            IGLC3
                                                                                                                                                                                            IGFBP4
                                                                                                                                                                                            RPS11
                                                                                                                                                                                            RPS27
                                                                                                                                                                                            MYL9
                                                                                                                                                                                            SEMG1
                                                                                                                                                                                            TXNIP
                                                                                                                                                                                            FABP4
                                                                                                                                                                                            EEF1A1
                                                                                                                                                                                            ACTB
                                                                                                                                                                                            RPS12
                                                                                                                                                                                            S100A8
                                                                                                                                                                                            S100A9
                                                                                                                                                                                            LYZ
                                                                                                                                                                                            CD74
                                                                                                                                                                                            HLA−DRA
                                                                                                                                                                                            TMSB10
                                                                                                                                                                                            SRGN
                                                                                                                                                                                            TMSB4X
                                                                                                                                                                                            B2M
                                                                                                                                                                                            FTL
                                                                                                                                                                                            HLA−E
                                                                                                                                                                                            SCGB1A1
                                                                                                                                                                                            HBB
                                                                                                                                                                                            SPP1
                                                                                                                                                                                            GPX3
                                                                                                                                                                                            MALAT1
                                                                                                                                                                                            MT−ND6
                                                                                                                                                                                            MTATP6P1
                                                                                                                                                                                            MT−ATP8
                                                                                                                                                                                            MT−ND4L
                                                                                                                                                                                            MT−CO1
                                                                                                                                                                                            MT−ND5
                                                                                                                                                                                            MB
                                                                                                                                                                                            FABP3
                                                                                                                                                                                            MYL2
                                                                                                                                                                                            MYH7
                                                                                                                                                                                            ACTC1
                                                                                                                                                                                            MYL3
                                                                                                                                                                                            PLN
                                                                                                                                                                                            MT−ATP6
                                                                                                                                                                                            MT−CYB
                                                                                                                                                                                            MT−ND3
                                                                                                                                                                                            MT−CO3
                                                                                                                                                                                            MT−ND4
                                                                                                                                                                                            MT−CO2
                                                                                                                                                                                            MT−ND2
                                                                                                                                                                                            MT−ND1
                                                                                                                                                                                            MT−RNR2
                                                                                                                                                                                            MT−RNR1
                                                                                                                                                                                            ACTA1
                                                                                                                                                                                            GAPDH
                                                                                                                                                                                            CKM
                                                                                                                                                                                            MYBPC1
                                                                                                                                                                                            KLHL41
                                                                                                                                                                                            MT−TP
                                                                                                                                                                                            PDK4
                                                                                                                                                                                            DES
                                                                                                                                                                                            TNP1
                                                                                                                                                                                            PRM2
                                                                                                                                                                                            PRM1
                                                                                                                                                                                            TG
                                                                                                                                                                                            SCD
                                                                                                                                                                                            C3
                                                                                                                                                                                            RBP4
                                                                                                                                                                                            SAA1
                                                                                                                                                                                            AGT
                                                                                                                                                                                            APOA1
                                                                                                                                                                                            SERPINA1
                                                                                                                                                                                            SAA2
                                                                                                                                                                                            FGG
                                                                                                                                                                                            FGB
                                                                                                                                                                                            HP
                                                                                                                                                                                            ORM2
                                                                                                                                                                                            ORM1
                                                                                                                                                                                            CRP
                                                                                                                                                                                            APOA2
                                                                                                                                                                                            FGL1
                                                                                                                                                                                            GC
                                                                                                                                                                                            APOH
                                                                                                                                                                                            APOC3
                                                                                                                                                                                            FGA
                                                                                                                                                                                            ALB
                                                                                                                                                                                            AMBP
                                                                                                                                                                                            APCS
                              thyroid


                                        ovary


                                                prostate


                                                           testis


                                                                    brain


                                                                            breast


                                                                                     colon


                                                                                             adipose


                                                                                                       kidney


                                                                                                                adrenal


                                                                                                                          leukocyte


                                                                                                                                      lung


                                                                                                                                             lymph node


                                                                                                                                                          skeletal muscle


                                                                                                                                                                            heart


         Fig. 3: List of Genes expressed in all tissues and highly expressed.                                                                                                       liver


5   Results
The overall goal of this study is to understand the relevance of mutations and
genes along with their associated expression levels measured in data sets for
normal tissues e.g. HBM 2.0, and then evaluate (e.g., query) them against the
mutations retrieved and linked from somatic and patient specific data e.g COS-
MIC, TCGA. Further focus of this work is put to the linked annotations where
a single query can retrieve all other possibly relevant annotations.
    Initially, we have sampled 99 genes that are highly expressed in all 16 tis-
sues as shown in Figure 3 to retrieve their CNV, mutation and gene expression
annotations from cBIO portal (for TCGA) and CNV annotator (for COSMIC)
to determine current state of the art and provide a baseline comparison for
the proposed linked annotation solution. The results for TCGA 4 clearly indi-
cate an elevated distribution of these genes in uterine and ovarian cancer with
a large number of mutations and CNVs. As an outcome, ovarian cancer has
been selected as a good candidate for further investigation due to its elevated
amplification rate and its multiple repetition in different experiments. Further
studies have been retrieved that were conducted to understand the somatic rel-
evance and the loci(genomic position) of these genes, and further detailed in-
formation for mutations could be retrieved as well. This study demonstrated
a focus to genes such as ACTC1, B2M, CRP, FABP3, FABP4, FGA, FGB,
GC, MYH7, RPRH2, SLC26A3, TG, TXNIP which form most relevant driver
genes transforming healthy human tissues into ovarian cancer. The same 99
genes have been queried against 99 genes from HBM 2.0 to get the results for
the somatic mutations in cancer. This repeat annotation will not only provide
detailed statistics reported in COSMIC but also a validation for out earlier ex-
periments. Table 2 clearly indicates that locations chr1,chr4,chr14 and genes
CRP,FGA,FABP3,MYH7 could be potential genes with high relevance for
the development of ovarian cancer.
                                 80%


                                 70%
        Alteration frequency


                                 60%


                                 50%


                                 40%


                                 30%


                                 20%


                                 10%


                                  0%
                    Cancer type

        Mutation data                           +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   -   +   +   +   +   -   -   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +

                               CNV data         +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +   +
                                                                    (TC )
                                                                 a (T A)
                                                                           A)

                                                                           A)

                                                                 o (T b)
                                                                           A)

                                                                 u (T b)
                                                                           A)

                                                                             )
                                                                            b)

                                                                    (TC )
                                                                           A)

                                                                 k (T b)
                                                                           A)

                                                                  t (T b)

                                                                    (TC )
                                                                             )

                                                                 e (T b)
                                                                             )

                                                                    (TC )
                                                                 s (T A)
                                                                           A)

                                                                           A)

                                                                    (TC )
                                                                             )
                                                                           A)

                                                                              )

                                                                    (TC )
                                                                 a (T A)

                                                                    (TC )

                                                                    (TC )

                                                                    (TC )
                                                                             )

                                                                    (TC )
                                                                             )

                                                                    (TC )

                                                                    (TC )
                                                                             )
                                                                            b)
                                                                           A)

                                                                 L (T b)
                                                                           A)

                                                                             )
                                                                           A


                                                                        GA


                                                                            b


                                                                           A
                                                                        GA


                                                            rea CGA

                                                                           A


                                                                         ub
                                                                        GA


                                                                          15

                                                                          ub


                                                                           A
                                                                        GA

                                                                        GA

                                                                        GA

                                                                         13
                                                                        GA

                                                                          ub
                                                                        GA

                                                                        GA


                                                                         08
                                                                        pu


                                                                        pu


                                                                        pu

                                                                        pu


                                                                        pu


                                                                          u


                                                                          u


                                                                        pu


                                                                         u
                                                                       CG

                                                                        G
                                                                       CG

                                                                       CG


                                                                       CG


                                                                       CG


                                                                        G


                                                                       CG


                                                                       CG


                                                                       CG

                                                                        G
                                                                       CG

                                                                       CG


                                                                       CG


                                                                        G
                                                                       CG


                                                                       CG


                                                                      CG
                                                                     Ap


                                                       Ute GA p


                                                                     Ap


                                                                      20
                                                                     Ap


                                                                      20


                                                                        p


                                                                     Ap


                                                                      20
                                                                    (TC
                                                                   GA


                                                                   GA


                                                                   GA

                                                     Sto CGA


                                                                   GA


                                                                   GA


                                                                   GA
                                        S (T


                                                                    (T


                                                                 s (T


                                                                    (T


                                                                  l (T


                                                     AM id (T
                                                                 GA


                                                                 GA


                                                                 GA
                                                                CG


                                                                CG


                                                                CG


                                                                CG
                                                               an


                                                               er
                                                             (TC


                                                                C


                                                               er
                                                             (TC


                                                               ch
                                                             (TC


                                                 Ute oma

                                                                C


                                                                al


                                                              BC


                                                                te


                                                                C


                                                                M

                                                                C
                                                              CC


                                                                C
                                                             (TC

                                                              PG

                                                                C
                                                             (TC
                                                             cta
                                                              as
                                                             ec


                                                             gu
                                                            om


                                                             en


                                                             sq


                                                             rin


                                                           (TC


                                                            om


                                                           (TC


                                                            AM
                                                           (TC
                                      eC


                                                          o (T


                                                             (T


                                                          k (T


                                                          e (T


                                                           t (T


                                                           l (T

                                                           RC


                                                           AC


                                                           RC


                                                           RC


                                                          L (T
                                                          rvic


                                                           yro
                                                           sta


                                                          GB
                                                            dd


                                                           Liv


                                                          ma
                                                           ari


                                                         Bre


                                                         DL


                                                         pR


                                                         PC
                                                        &n


                                                       lore
                                                         ha
                                                         ad


                                                        ng
                                                       lan


                                                         er


                                                        an

                                                        ch


                                                          u


                                                         rc


                                                        nc


                                                        Gli


                                                         C


                                                         id
                                                       Bla


                                                     Pro


                                                      cta

                                                        cc


                                                       ch


                                                       ch


                                                       Th
                                  rin
                                           Ov


                                                       as
                                                      Ce
                                                      ec
                                                      en


                                                      sq


                                                      rin


                                                       te


                                                       M


                                                       M
                                                    RC
                                                     op


                                                    yro
                                                     Sa
                                                     dd


                                                    Lu


                                                    ma


                                                    Pa
                                                    ng


                                                    ari


                                                   Co
                                                   Me


                                                  Bre
                                Ute


                                                   ad


                                                   sta


                                                  GB


                                                  GB
                                                  &n


                                                 lore
                                                  ad


                                                  ng


                                                  Es
                                                Bla


                                                 cc


                                                 Th
                                                 Lu


                                                Ov
                                               Sto


                                                He


                                              Pro
                                               Lu
                                               ng


                                             Co
                                              ad
                                            Lu


                                           He


                                                                                Mutation                Deletion            Amplification           Multiple alterations


                                                            Fig. 4: TCGA query output from cBIO Portal[5]
   We now query the same 99 genes from before using the linked annotation
mechanism. The result snippet for ovarian cancer is depicted in Figure 5 with
detailed functional annotations together with the TCGA ids, again for ovarian
cancer.
   cs:result cs:Sample_Name c:TCGA-13-0920-01;
    cs:Gene_Name c:MYH7;
                                                                                                                                        ts:result ts:bcr_patient_barcode t:TCGA-13-0920;
    cs:Regulation "over".
                                                                                                                                        ts:beta_value "0.0419..";
    cs:chr_no c:1;
                                                                                                                                        ts:chromosome t:14;
    cs:chrom_start_m c:23418303;
                                                                                                                                        ts:chromosome t:1;
    cs:chrom_stop_m c:23418303;
                                                                                                                                        ts:start t:1288070;
    cs:chr_no_m c:14;
                                                                                                                                        ts:stop t:1293914;
    cs:mut_type "GAIN";
                                                                                                                                        ts:scaled_estimate "773.555".
   cs:Primary_Site c:ovary;
   cs:Primary_Histology c:carcinoma;

                                                            Fig. 5: Linked annotations for MYH7 - COSMIC
    Figure 5 represents the COSMIC and TCGA annotations, respectively.
MYH7 corresponds to chr-1 which is evident from previous annotations and
replicated again in our study along with its TCGA ID:TCGA-13-0920-01. Its
mutation type is primarily the GAIN type of a mutation for chr1 and chr14 which
is a dominant mutation with all its regulation of over, under and normally ex-
pressed. Translational researchers may want to repeat and re-validate the study
for Pubmed ID:1398522 additionally with beta value (measure of methylation)
of 0.041999536 and scaled estimation (Tumour purity) of 773.555 also supports
this gene from the epigenetic point of view. Further multiple genomic locations
will help clinical practitioners to track CNV for the targeted study and which
ultimately leads the direction towards a better prognosis.
Table 2: loci information for highly expressed gene in ovarian cancer from HBM 2.0
    CHROMOSOME               Mutation Type Pubmed GENES
    chr1:246407146-246740944 CNV         17122850 CNST, LOC255654, SMYD3, TFB2M
    chr1:159683086-159684133 Loss        20364138 CRP
    chr1:31842502-31849609   Deletion    20482838 FABP3
    chr1:31842502-31849609   Deletion    20482838 FABP3
    chr4:155506556-155506859 Insertion   20981092 FGA
    chr14:23857082-23886607 Loss         20981092 MIR208A, MYH6, MYH7
    chr14:23857092-23886486 Loss         20981092 MIR208A, MYH6, MYH7


6    Related Work
Kandoth et al. [6] performed a cancer study with 12 cancer types to enable
logical classifications for the large amoung of data generated by TCGA and
ICGC. Saleem et. al. [11] have covered TCGA database with few cancer types
and for a limited number of patient data.
     Likewise a reduced version of the COSMIC database has been RDFized to
explore on the mechanism of TP53 [13] further for CNV explains the linked
infrastructure to annotated CNV. The federation platform [10] called “TopFed”
is being developed to measure the query execution time on TCGA data set, which
then has been further extended to cover the biological outcomes identified from
Medline abstracts [9]. Our work covers all CNVs, mutations and gene expression
data and has been extended with TCGA for the same type of data, thus forming
a proof of concept for an annotation platform that covers comprehensive linked
life science data. It is important to note that the work presented in this paper
is a preliminary approach for transforming COSMIC into the RDF format and
link it with the TCGA datasets.

7    Conclusion
In this paper we have presented a linked data infrastructure for functional an-
notation which enables querying different types of mutations and genomic al-
terations to contribute to molecular and clinical insights of cancer by defining
most relevant variants and their prioritization. This knowledge could be highly
advantageous for a targeted therapy and personalized medicine based on gene
expression data. The presented experiments are based on TCGA, COSMIC and
HBM 2.0 datasets and have been used to identify sets of genes with relevance for
ovarian cancer and with comprehensive set of mutations. Similar studies have
to be performed for other cancer types. We have covered CNV, gene expression
and mutation data from COSMIC and TCGA (only for ovarian cancer). We
have processed 1.2 billion COSMIC triples and 100 million TCGA triples which
in turn generated 27 GB of data. In future, this work will be expanded to cover
level 1, 2 and 3 along with other datasets from COSMIC to provide in-depth
biological insight for each queried gene.
8    Acknowledgment

This publication has emanated from research supported by the research grant
from Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289.
References
 1. Y. W. Asmann, B. M. Necela, K. R. Kalari, A. Hossain, T. R. Baker, J. M. Carr,
    C. Davis, J. E. Getz, G. Hostetter, X. Li, et al. Detection of redundant fusion
    transcripts as biomarkers or disease-specific therapeutic targets in breast cancer.
    Cancer research, 72(8):1921–1928, 2012.
 2. E. bioinformatics institute. Illumina body map 2.0 european bioinformatics insti-
    tute.
 3. J. J. Carroll, C. Bizer, P. Hayes, and P. Stickler. Named graphs, provenance and
    trust. In WWW, pages 613–622, 2007.
 4. J. J. Crowley, V. Zhabotynsky, W. Sun, S. Huang, I. K. Pakatci, Y. Kim, J. R.
    Wang, A. P. Morgan, J. D. Calaway, D. L. Aylor, et al. Analyses of allele-specific
    gene expression in highly divergent mouse crosses identifies pervasive allelic imbal-
    ance. Nature genetics, 2015.
 5. J. Gao, B. A. Aksoy, U. Dogrusoz, G. Dresdner, B. Gross, S. O. Sumer, Y. Sun,
    A. Jacobsen, R. Sinha, E. Larsson, et al. Integrative analysis of complex cancer
    genomics and clinical profiles using the cbioportal. Science signaling, 6(269):pl1–
    pl1, 2013.
 6. C. Kandoth, M. D. McLellan, F. Vandin, K. Ye, B. Niu, C. Lu, M. Xie, Q. Zhang,
    J. F. McMichael, M. A. Wyczalkowski, et al. Mutational landscape and significance
    across 12 major cancer types. Nature, 502(7471):333–339, 2013.
 7. Y. Khan, M. Saleem, A. Iqbal, M. Mehdi, A. Hogan, A. N. Ngomo, S. Decker, and
    R. Sahay. SAFE: policy aware SPARQL query federation over RDF data cubes.
    In Proceedings of the 7th International Workshop on Semantic Web Applications
    and Tools for Life Sciences, Berlin, Germany, December 9-11, 2014., 2014.
 8. D. Ramskold, E. T. Wang, C. B. Burge, and R. Sandberg. An abundance of
    ubiquitously expressed genes revealed by tissue transcriptome sequence data. PLoS
    Comput Biol, 5(12):e1000598, 2009.
 9. M. Saleem, M. R. Kamdar, A. Iqbal, S. Sampath, H. F. Deus, and A.-C. N. Ngomo.
    Big linked cancer data: Integrating linked tcga and pubmed. Web Semantics:
    Science, Services and Agents on the World Wide Web, 27:34–41, 2014.
10. M. Saleem, M. R. Kamdar, A. Iqbal, S. Sampath, H. F. Deus, and A.-C. Ngonga.
    Fostering serendipity through big linked data. Semantic Web Challenge at ISWC,
    2013.
11. M. Saleem, S. S. Padmanabhuni, A.-C. N. Ngomo, J. S. Almeida, S. Decker, and
    H. F. Deus. Linked cancer genome atlas database. In Proceedings of the 9th
    International Conference on Semantic Systems, pages 129–134. ACM, 2013.
12. J. Xuan, Y. Yu, T. Qing, L. Guo, and L. Shi. Next-generation sequencing in the
    clinic: promises and challenges. Cancer letters, 340(2):284–295, 2013.
13. A. Zappa, A. Splendiani, and P. Romano. Towards linked open gene mutations
    data. BMC bioinformatics, 13(Suppl 4):S7, 2012.

</pre>