Finding novel associations across domains using linked data: a case study on genetic variants disrupting transcription start sites Eelke van der Horst*1, Rajaram Kaliyaperumal*1, Zuotian Tatum1, Mark Thompson1, Erik Schultes1,2, Eleni Mina1, Ivo F.A.C. Fokkema1, Johan T. den Dunnen1, Marco Roos1, Jeroen F.J. Laros1, Barend Mons1, Kristina M. Hettne##1, Peter A.C. ‘t Hoen##1 1 Leiden University Medical Center, Leiden, the Netherlands 2 Leiden Institute of Advanced Computer Science, Leiden, the Netherlands eelkevanderhorst@gmail.com, r.kaliyaperumal@lumc.nl, ztatum@live.com, {m.thompson, E.A.Schultes, E.Mina, I.F.A.C.Fokkema}@lumc.nl, ddunnen@humgen.nl, {m.roos, J.F.J.Laros, B.Mons, k.m.hettne, P.A.C._t_Hoen}@lumc.nl ## shared last author Abstract. With the widespread use of Next Generation Sequencing technologies, the primary bottleneck of genetic research has shifted from data production to data analysis. However, heterogeneous data sets makes comparisons and integration challenging and time consuming. Here, we apply a data interoperability approach that provides unambiguous (machine readable) description of genomic annotations based on nanopublications. We show that novel associations can be discovered by performing queries that span genomic annotations from the FANTOM5 consortium (catalogue of transcription start site(s) (TSS)) and an instance of the Leiden Open Variation Database containing population-based variant frequency data. A correlation was established between the tissue specificity of a TSS and the frequency and size of genetic variants in the genomic region covering the TSS: TSS that are used in many tissues are less tolerant to disruption by insertions and deletions than those specific to just a few tissues. Keywords: Linked Data, Nanopublication, Transcription start sites, Genetic variants 1 Introduction Next Generation Sequencing (NGS) technologies produce huge amounts of data on genetic variation between organisms and between individuals. NGS data are reported and annotated in different formats that reflect the provenance and use case of the data. For example, RefSeq, Ensembl, and dbSNP each have their own annotation format. Feature annotations (such as genes, transcripts, transcription start sites) can be downloaded as BED files, while variants are usually exported to Variant Call Format (VCF) format and subsequently stored in dedicated databases such as Locus Specific Databases (LSDB). One way to approach the interoperability of genomic data is to homogenize the data before analysis. This often requires downloading data from multiple sources into a central location and then standardizing the data structure (e.g. by importing into a fixed schema relational database). However, there are several limitations to this approach. First, data in the centralized warehouse may not be kept up to date. The process of downloading and normalizing data is usually time consuming and error prone, especially if the data structure from the source changes over time. Secondly, once meaningful results are derived from such a central data warehouse, it may not be possible to trace back the original data used to generate the results. If the large and ever growing amounts of (NGS) data are to be transformed into actionable information for the biologist, they must be represented in a machine readable format that permits automatic interoperation and in silico reasoning even when the data reflect diverse and heterogeneous annotations. To enhance interoperability of diverse and distributed datasets, we introduced a framework to expose genomic annotation data as linked data with nanopublication as the schema backbone. A nanopublication is the smallest unit of publishable information: an assertion with provenance information and metadata about the nanopublication. Nanopublications build upon Linked Data, which provides a suitable platform for the exchange and integration of data [1, 2]. Nanopublications can be cited, offering an incentive for authors to publish their data in interoperable format [3, 4]. Patrinos et al. proposed nanopublications to encourage attribution and dissemination of genetic variant data [3]. Subsequently, we demonstrated how to model and publish genomic annotations, e.g. transcription start site(s) (TSS) from the FANTOM5 project as interoperable nanopublications [5]. In this paper, we demonstrate the power of the improved linking of heterogeneous and distributed variant and variant annotation data. We show how nanopublications can lead to novel discoveries through querying of different linked data resources and how shared ontological models allows can be exploited. In addition, the RDF [8, 9] query language SPARQL [10] allows queries over hierarchical graphs (subsumption queries) such as the FANTOM5 sample ontology. We demonstrate how federated queries across these three linked data sets (of which one is hierarchical) can lead to novel discoveries. For this, we converted variant data from the Leiden Open Variant Database (LOVD), which is a platform for storing high quality, curated variant data [6] adopted by more than 95% of the gene variant databases, into a linked data resource. To better understand the effect of a variant, annotation about the variant region is required. LOVD currently uses reference transcript annotations to determine if a variant is potentially affecting the sequence of transcripts. However, the FANTOM5 greatly extended the catalogue of transcripts and (tissue-specific) TSS. Where previous studies mainly focused at variants affecting the coding exons, we were interested in variants affecting TSS. In particular larger insertions and deletions overlapping TSS may affect transcription levels or the sequence of the 5'-end of transcripts. To understand the relationship between the tissue specificity (FANTOM5, sample ontology) of a TSS and the observed disruption of its genomic region by variants (LOVD), we converted population-based genetic variant data stored in LOVD to nanopublications. We test the hypothesis that the genomic regions of tissue specific TSS are less conserved than those of ubiquitous TSS and therefore contain more and larger genetic variants. The number of tissues in which a TSS controls gene expression indicates the number of sites that may be affected by TSS disruption. Disruption of TSS that exert regulatory control at many sites have a higher probability of causing a detrimental effect than TSS that control expression in only a few tissues. 2 Materials and Methods 2.1 Modelling LOVD. To publish LOVD variant data as interoperable nanopublications, attributes of the variant entity were mapped to a semantic representation. Attributes describing the variant type (insertion, deletion, and substitution), Human Genome Variation Society (HGVS) variant description, and genomic position were selected for conversion, as well as allele frequencies in European American and African American population. A variant was modeled as a genomic annotation that maps to a region on a reference sequence, and was expressed using the reference sequence annotation (RSA)[7] ontology. Apart from covering most of the terms and the predicates to represent the genomic annotation, the ontology also gives us flexibility to represent an annotation across different reference sequence builds. The Sequence Ontology (SO)[8] class was used for representing the variant and the Genome Assemblies [7] to reference an instance of a reference sequence, which maps to a version of the genome assembly. A schematic overview of the variant annotation model is shown in Figure 1. Example LOVD nanopublications can be viewed using the nanopublication viewer that is available at http://rdf.biosemantics.org. Fig. 1. The variant annotation model. The model states that a variant is observed at a genomic region, which has start and end position, and maps to a genomic component (chromosome) of a reference sequence. Instances of these classes (represented by a diamond shape) are the RDF resources from which a nanopublication assertion is constructed. Sequence ontology (SO), reference sequence annotation (RSA) ontology and human genome 19 reference sequence (HG19) RDF resource were used for this assertion model. FANTOM5. The modelling of FANTOM5 data as nanopublications is described in supplemental material of [5]. In short, the core FANTOM5 data was exposed as three sets of nanopublications. The first set reported the genomic region of Cap Analysis of Gene Expression (CAGE) clusters in terms of start/end position and orientation on a reference sequence. The second set expressed that a CAGE cluster was also a TSS region, and connected them, when possible, to a specific gene. A CAGE cluster is observed in tissue with an abundance measured in TPM (tags per million). A schematic overview of all three assertion models and their connection is provided in Figure 2. Note that the genomic region is the same model as the variant annotation assertions (see Figure 1). Property rsa:is_observation_of was added as an update to the RSA ontology. Fig. 2. Combined assertion models of the FANTOM5 nanopublications. Type 1 – cage cluster; type 2 – transcriptional start sites and gene associations; type 3 – cage peak level for muscle cell samples. Instances of these classes (represented by a diamond shape) are the RDF resources from which a nanopublication assertion is constructed. Sequence ontology (SO), reference sequence annotation (RSA) ontology and human genome 19 reference sequence (HG19) RDF resource were used for these assertions models. Conversion. The LOVD 3.0 whole genome dataset was converted into RDF. This set consists of variant data copied from the Exome Variant Server (EVS) and is available at: http://databases.lovd.nl/whole_genome/genes. The EVS is a publicly available collection of genetic variant data from the NHLBI GO Exome Sequencing Project [9], which aims to identify new genes and mechanisms to help diagnosis, management and treatment of heart, lung and blood disorders by screening (near) the protein coding regions of genomes of phenotypically diverse European American and African American individuals. The EVS data consists of a large number (~2M in LOVD) of mostly non-disease associated or recessive disease variants, and can be considered a representative sample of the variations that occur in or near protein coding regions in genomes of the two populations. Conversion performed using a custom script that fetched data directly from the underlying MySQL data dump, producing approximately 46 million triples. The database identifier of a variant consists of the HGNC gene symbol and number, and groups multiple observations of the variant together. The variant description at the DNA level was based on the genomic DNA reference sequence following HGVS recommendations. FANTOM5 nanopublications were generated as described previously [5 - Supplementary Note 1], using the phase 1.3 release and phase 1 prerelease version of the sample ontology. The scripts used for converting these datasets into nanopublications are publicly available and can be downloaded at https://git.lumc.nl/biosemantics/fantom5-lovd-data-integration. Sample classification. Samples are systematically classified using the FANTOM Five (FF) Sample Ontology, which aggregates other ontologies such as the cell ontology [10] for cell types and Uberon [11] for anatomical systems. The sample ontology is a class hierarchy that was used to annotate experimental artifacts of FANTOM5, such as origin and anatomical location of tissue and cell samples. In this study, we focused on experimental results from adult human tissues, excluding experimental data from (carcinoma) cell-lines, developmental. Tissue-specific sample retrieval. To retrieve the samples we used SPARQL to query the FANTOM5 sample ontology, making use of the hierarchical structure of the ontology to select samples that fall under classes that are both a subclass of "human adult sample" and "adult tissue sample" class. We defined expression as follows: a TSS is expressed in a certain tissue if it has TPM>0 in at least one sample belonging to that tissue type. The SPARQL query can be accessed via the link https://git.lumc.nl/biosemantics/fantom5-lovd-data- integration/blob/master/rscript/sparql_queries/get_tissue_specificity_of_tss.sparql Overlap calculation. A SPARQL query was used to find overlaps between variants and TSS regions. For insertion, deletion and duplication variants, the fraction of a variant or the whole variant should overlap with the TSS region. For the substitution variants the variant should occur within the TSS region. This also includes start and end positions of the TSS. The SPARQL query can be accessed via the link https://git.lumc.nl/biosemantics/fantom5-lovd-data- integration/blob/master/rscript/sparql_queries/get_variants_overlapping_with_tss.spar ql Variant disruption calculation. Any type of variant may disrupt a TSS depending on the extent of overlap between them. Disruptions due to deletions, insertions and substitutions are calculated using the formulas given in equations 1-3. For deletions, the number of base pairs of the variant overlapping with the TSS, was divided by the TSS length (number of base pairs) to get a disruption value (equation 1). For insertions, the length of the variant is divided by TSS length (equation 2). Single nucleotide substitutions are less likely to disrupt a TSS. Still, we were interested to use this most common class of genetic variation to estimate the level of sequence conservation in the TSS, as a measure for evolutionary constraint. For this, the number of unique substitutions in the EVS database per TSS was divided by the TSS length (equation 3). Deletion disruption = (variant overlapping length) / (TSS length) (1) Insertion disruption = (variant length) / (TSS length) (2) Substitution rate = (number of unique substitutions in the TSS) / (TSS length) (3) 3 Results and discussion In this work, we focused on the genomic annotations of variants and exposed this information as nanopublications. Population-wide genomic variant data from the EVS project were preprocessed and stored in an installation of LOVD 3.0 was converted to nanopublications consisting of the following type of assertions: a variant of a certain type (insertion, deletion, substitution) exists at a defined genomic position, i.e. has a specific start and end position on a specific chromosome. We used EVS data from LOVD and not the original source so that we could make use of preprocessed EVS data from LOVD. CAGE clusters and inferred TSS from FANTOM5 (184.827 CAGE clusters and 82.094 TSS) and EVS variants nanopublications were made available as a static RDF files dump (http://beehub.nl/biosemantics/fantom5-evs-integration- resources/) through a public repository that was supported by the Dutch national e- infrastructure. To demonstrate the automatic interoperation between these two datasets, we made federated queries to find variants that overlap with TSS. Samples were limited to human adult tissues, resulting in 129 tissue samples, spanning 57 tissue types. Among these, the CAGE cluster expression (TPM value > 0) is observed in 111 tissue samples and these tissue samples span 44 tissue types (according to the FANTOM5 ontology). There are only 4,073 genes with a single TSS, compared to 14,866 with more than one TSS. The tissue specificity of TSS is displayed in Figure 3. The majority of TSS are used in multiple tissues, and it is difficult to draw a line between tissue-specific and ubiquitous TSS. Even more so, because there are many tissue samples with TPM value between 0 and 1 and any cut-off on expression levels is arbitrary. For this paper, we chose to divide TSS in two equal bins, where the 50% TSS observed in the lowest number of tissues are the tissue-specific TSS and the 50% TSS observed in the highest number of tissues are the ubiquitous TSS. Fig. 3. Tissue specificity of transcription start sites. The x-axis shows the number of tissue types in adult human tissue in which a TSS is expressed. The y-axis represents the number of TSS with TPM > 0 in at least one sample of a tissue type. The maximum number of tissue types is 44. The peak at 44 is a bit artefactual because it is more likely that a TSS is present in all tissues studied than in exactly 43 tissues. The set of 1,998,175 unique genetic variants in the LOVD whole genome instance of EVS consists of 33,646 insertions, 84,761 deletions, 6,875 duplications, and 1,872,893 substitutions. The length of deletion and insertion variants range from 1 to 102 and 1 to 61. From the set of 1,998,175 variants (EVS), 18,988 showed overlap with 9,949 TSS (Among them 9896 TSS had expression in adult human tissues). Of those variants there were 479 insertions, 1018 deletions, 72 duplications, and 17,419 substitutions. The maximum number of variants per TSS was 30, this TSS was associated with the RNU11 gene (EntrezGene ID 26824). Association of the fraction of TSS disrupted by insertions and deletions with tissue specificity. Larger insertions and deletion are the classes of variants that are likely most disruptive for TSS. We assumed that the size of the insertion or deletion was an important determinant for its disruptiveness and therefore evaluated the fraction of the total TSS region length overlapping with the insertion or deletion. When a TSS overlapped with multiple insertions and deletions, we only included the variant with the largest overlap. The TSS were ranked according to their tissue specificity and to the fraction of its length overlapping with insertions or deletions. The ordered TSS are divided at the 50th percentile, both for tissue specificity and for disruption. The resulting distributions and p-values according to the chi-square test are displayed in Figure 4. The value in each quarter of the figure represents the percentage of TSS distribution in each quarter. For both deletions (Figure 4A) and insertions (Figure 4B), we observed that tissue-specific TSS tolerate a significantly higher level of disruption (larger fraction deleted or larger insertions) than ubiquitous TSS. For single nucleotide substitutions, we concentrated on the number of unique positions with variants in EVS, normalized by the length of the TSS regions (Figure 4C). Here, we observed more substitutions in tissue-specific TSS than in ubiquitous TSS, indicative for a higher selective pressure on ubiquitous TSS indels. Figure 4. Tissue-specific TSS are more tolerant towards disruption by genetic variants than ubiquitous TSS. TSS are ranked according to the number of tissues in which the TSS is expressed (x-axis). The fraction of the length of the TSS region affected by genetic variants is plotted on the y-axis. The sub plots A, B and C represent the TSS disruptions caused by deletions, insertions and substitutions, respectively. The red horizontal and vertical lines in the graphs represent 50% of the ranked samples along each axis. The values in each quarter of the plot is the percentage of TSS distribution in that quarter. Chi-square statistics and their associated p-values are also indicated in each panel. In summary, variants in coding parts genes are widely studied. However, variants in regulatory domains may also affect gene function. By integrating FANTOM5 data on TSS and LOVD data with population-based variant frequencies, we were able to study variants in TSS. From an evolutionary point of view, important regulatory elements are more conserved. This translates to finding less disruption by variants when a TSS is more important. A TSS is considered more important when it is active in many tissues. Variants in these TSS disrupt a gene's function in many tissues, thus probably having a larger impact on the organism than variants in tissue-specific TSS. Similar studies with large whole genome sequencing studies should corroborate these findings as our current study is limited by the use of exome data that are not covering the full repertoire of TSS annotated in the FANTOM5 project. Future work. In the future we would like to integrate the FANTOM5 and EVS nanopublications with other genomic RDF resources. The clear challenge here would be integrating our nanopublications with other RDF resources where different ontologies were used to describe the genomic annotations. For example, we are aware that some major data providers such as the Database Center for Life Science in Japan are using the FALDO ontology instead of the RSA ontology to expose genomic annotation. The major challenge with mapping these ontologies would be to overcome the ontological groundings. Applying our approach to VCF files might is another topic for future research. One of the challenges here lies in the enormous number of triples that would be generated based on the VCF format. 4 Conclusions Genomic annotations are crucial for interpreting NGS data. Integrating genomic annotations from different sources is a time consuming and non-trivial task. In this study, we apply a framework for genomic annotations based on nanopublications. We demonstrated automatic interoperability between TSS from the FANTOM5 consortium and population-based variant frequency data from LOVD. We showed that novel associations can be discovered by performing queries that span the integrated domains. Specifically, a correlation was established between the tissue specificity of a TSS and the frequency and size of genetic variants in the genomic region covering the TSS: TSS that are expressed in many tissues are less tolerant to disruption by insertion and deletions than those expressed in only a few tissues. We demonstrated that use of an ontology enabled subsumption queries on FANTOM5 dataset, and we also demonstrated reusability of an existing data source by using the Human genome 19 reference sequence RDF resource in the EVS nanopublication model. 5 Acknowledgements We gratefully acknowledge the statistical advice from Erik van Zwet. Funding from ODEX4all (NWO 560 650.002.002) and the European Commission (FP-7 project RD-Connect, grant agreement No. 305444) are gratefully acknowledged. This work was carried out on the Dutch national e-infrastructure with the support of SURF Foundation. References 1. Berners-Lee, T., Bizer, C., Heath, T.: Linked data-the story so far. Int. J. Semantic Web Inf. Syst. 5, 1–22 (2009). 2. Heath, T., Bizer, C.: Linked Data: Evolving the Web into a Global Data Space. Synth. Lect. Semantic Web Theory Technol. 1, 1–136 (2011). 3. Patrinos, G.P., Cooper, D.N., van Mulligen, E., Gkantouna, V., Tzimas, G., Tatum, Z., Schultes, E., Roos, M., Mons, B.: Microattribution and nanopublication as means to incentivize the placement of human genome variation data into the public domain. Hum. Mutat. 33, 1503–1512 (2012). 4. Gibson, A., Van Dam, J.C., Schultes, E.A., Roos, M., Mons, B.: Towards Computational Evaluation of Evidence for Scientific Assertions with Nanopublications and Cardinal Assertions. In: Proceedings of the 5th International Workshop on Semantic Web Applications and Tools for Life Sciences (SWAT4LS) Paris, France. pp. 28–30 (2012). 5. The FANTOM Consortium and the RIKEN PMI and CLST (dgt): A promoter-level mammalian expression atlas. Nature. 507, 462–470 (2014). 6. Fokkema, I.F.A.C., Taschner, P.E.M., Schaafsma, G.C.P., Celli, J., Laros, J.F.J., Dunnen, J.T. den: LOVD v.2.0: the next generation in gene variant databases. Hum. Mutat. 32, 557–563 (2011). 7. Tatum, Z., Roos, M., Gibson, A.P., Taschner, P.E., Thompson, M., Schultes, E.A., Laros, J.F.: Preserving sequence annotations across reference sequences. J. Biomed. Semant. 5, S6 (2014). 8. Eilbeck, K., Lewis, S.E., Mungall, C.J., Yandell, M., Stein, L., Durbin, R., Ashburner, M.: The Sequence Ontology: a tool for the unification of genome annotations. Genome Biol. 6, R44 (2005). 9. NHLBI GO Exome Sequencing Project: NHLBI GO Exome Sequencing Project (ESP) Exome Variant Server, http://evs.gs.washington.edu/EVS/. 10. Bard, J., Rhee, S.Y., Ashburner, M.: An ontology for cell types. Genome Biol. 6, R21 (2005). 11. Mungall, C.J., Torniai, C., Gkoutos, G.V., Lewis, S.E., Haendel, M.A.: Uberon, an integrative multi-species anatomy ontology. Genome Biol. 13, R5 (2012).