Increasing the nanopublication recall with a BridgeDb Identifier Mapping Service Egon Willighagen1[0000−0001−7542−0286] Dept. Bioinformatics - BiGCaT, NUTRIM, Maastricht University, NL egon.willighagen@maastrichtuniversity.nl Abstract. The volume of literature in the life sciences is continuously growing and keeping up with it is a problem. While review articles and databases help us by summarizing vast amounts of research, dissemi- nation of core research outcomes is still mostly restricted to scholarly journal. Nanopublications have been proposed as a solution to capture scientific statements. This led to a 2010 proposal to serialize nanopubs in the Resource Description Framework (RDF) and in 2016 to an inter- national network of nanopublication servers. However, RDF has a limi- tation that the Internationalized Resource Identifier (IRI) for resources does not have to be normalized and unique. To overcome this issue, the Open PHACTS project developed an Identifier Mapping Service and an approach called scientific lenses for mapping of equivalent IRIs. We here demonstrate the application of this approach to improve the recall of nanopublications from the international network. Keywords: nanopublications · identifier · BridgeDb. 1 Introduction The volume of literature in the life sciences is continuously growing and keeping up with literature is a problem [1]. While literature reviews and databases sum- marize vast amounts of research, dissemination of core research outcomes is still mostly restricted to the written word. Nanopublications have been proposed to capture scientific statements with provenance to the origin of that statement [2]. This led to a 2010 proposal to serialize nanopubs in the Resource Description Framework (RDF) [3]. Kuhn et al. introduced in 2016 an international network of connected servers to host nanopublications, with initial data sets from Dis- GeNET [4], neXtProt [5], and others [6]. Nanopublications for WikiPathways were added later [7]. The network currently hosts about 10 million nanopublica- tions [8]. However, RDF has a limitation that the Internationalized Resource Identifier (IRI) for resources does not have meaning and do not have to be unique. This causes an infinite number of possible IRIs for the same resource. And because when creating RDF content, one is not meant to reuse domain names not under your control and one often wishes to make resource IRIs dereferenceable, this 2 E. Willighagen is exactly what we see in practise: different data sets use different IRIs for the same gene, protein, or metabolite. The Open PHACTS project has had the same problem when starting to link different pharmacology data sets [9], including ChEMBL [10] and WikiPath- ways [11]. To overcome this issue, it developed an Identifier Mapping Service (IMS) based on BridgeDb [12] and the scientific lenses approach that allowed mapping of equivalent IRIs [13,14]. The alternative, of course, is to normalize IRIs in data sets before the integration, e.g. with identifiers.org [15], but re- moves flexibility to change the level of equivalence depending on the data analy- sis done [13,14]. The IMS implemented the identifier mapping as integral part of the Open PHACTS Linked Data API, hiding this need to map equivalent IRIs when querying the underlying data sets. Therefore, the assumption is that if we use an IMS as outlined here with appropriate loaded scientific lenses, we will find more nanopubs for a particular biological entity. To test this hypothesis, we searched nanopubs with information about a set of genes on the international network of nanopublication servers. 2 Methods To implement our workflow, an R Markdown document was developed to perform the various steps detailed below. The full document is available at https:// github.com/egonw/swat4hcls2018/. It uses a few R packages and introduces a helper function to simplify searching nanopublications. Genes As representative data set we took two pathways from the list of most viewed WikiPathways (https://www.wikipathways.org/index.php/Special: PopularPathwaysPage). Pathway WP241 was selected, about the human one carbon metabolism [16], with mostly NCBI Gene identifiers. WP2059 was se- lected as a second example, with predominantly Ensembl gene identifiers [17]. The rrdf package was used to query all genes in this pathways from the SPARQL endpoint [18]. Nanopublication Server As source of nanopubs we took a nearby server from the international network of mirroring servers, hosted by the Institute for Data Sciences at Maastricht University (http://graphdb.dumontierlab.com/ repositories/nanopubs) [6]. To simplify the interaction with the server, we took advantage of an online running instance of grlc (grlc.io) [19] that wraps the nanopublication server API with an OpenAPI [20]: https://github.com/ peta-pico/nanopub-api. BridgeDb Identifier Mapping Server For the IRI mapping, we used the Docker image of the IMS developed by the Open PHACTS project (https: //hub.docker.com/r/openphacts/identitymappingservice/) [12,13,9]. This was recently repurposed by Ehrhart et al. for gene-variant mappings [21]. The IMS is started and loaded with IRI mapping data as explained in [21]: 1. the user starts to Docker image, and then 2. loads the identifier mapping files into the IMS instance using a loading script. The actual mapping files were generated by J. Mélius based on data from Ensembl 87 (see http://bridgedb.org/data/ Title Suppressed Due to Excessive Length 3 linksets/current/HomoSapiens/. The source code can be found at https: //github.com/BiGCAT-UM/EnsemblLinksetsCreator). These linksets map En- sembl identifiers with NCBI Gene, HGNC, and others. Data Analysis The R Markdown notebook integrates the three aforemen- tioned approaches into a single analysis. To test the hypothesis, it first retrieves the genes from two WikiPathways and for each gene it searches for nanopubs. This is done by looking up equivalent gene IRIs using the IMS server and then for each gene IRI search for nanopublications. It then counts only the original IRI and for all equivalent IRIs and reports the differences. 3 Results With the R Markdown notebook we searched for nanopublications for two popu- lar WikiPathways, WP241 and WP2059. The first has mostly NCBI Gene iden- tifiers and the second mostly Ensembl Gene identifiers. The script reports that NCBI Gene identifiers return the most nanopublications when searching the full nanopublication network, indicating nanopublication data sets prefer to use NCBI Gene identifier-based IRIs. This observation affects the number of addi- tional nanopubs found via equivalent IRIs: for WP241 we indeed find a high number of found pathways when using only the IRI returned by the WikiPath- ways SPARQL endpoint and a lower number for WP2059. For WP241 we get on average 464 nanopublications (min: 10, max: 1000, median: 288), while for WP2059 we retrieve on average 21 nanopublications (min: 0, max:1000, median: 1). The count is currently capped at 1000 nanopublications, imposed by the grlc API wrapping around the nanopublication server, which explains this artifact. As hypothesized, using equivalent IRIs, as returned by the IMS, will retrieve additional nanopublications. The number of equivalent IRIs by the IMS is dif- ferent for both pathways. For WP241 it returns on average 9 IRIs (min: 7, max: 23, median: 7) and for WP2059 it returns on average 11 IRIs (min: 7, max: 29, median: 10). The returned IRIs are a mix of mappings to other database sources and different IRI patterns for the same database identifier. Indeed, with these additional IRIs, more nanopublications are found on the international nanopublication network. Furthermore, when the additional IRIs include NCBI Gene identifier-based IRIs, we find a higher number of additional nanopublications. This observation is similar for that observed when only using the original NCBI Gene identifier-based IRI, as explained above. Therefore, we find fewer additional nanopublications for WP241 which used predominantly NCBI Gene identifiers: on average it find 17 additional nanopublications (min: 0, max: 230, median: 3). However, for WP2059 which used predominantly Ensembl identifiers we find a much higher number of additional nanopublications, on average 44 (min: 0, max: 1099, median: 0). 4 E. Willighagen 4 Discussion It goes without discussion that the results show that we indeed find more nanop- ublications about a certain gene. However, there are some aspects that must be noted. First, the enrichment is only as good as the completeness and quality of the gene-gene identifier mapping link sets. The link sets used in this study are biased towards equivalent IRIs based on Ensembl and NCBI Gene identifiers. If nanopublication data sets use other gene identifiers, these will still not be found. Besides this completeness issue, the author had the impression that some mappings were missing, something will be explored. A more elaborate analysis is planned, involving more pathways and more identifier sources. Another effect of this completeness aspect is that the link sets used only cover gene-gene identifier mappings. However, the scientific lenses approach also allows gene-RNA and gene-protein mappings, under a lens that equates genes and proteins. This is particularly relevant for WikiPathways, where gene and proteins are frequently used as equivalent. The ability to load such additional link sets and the feature of the IMS to turn on and off the lenses, would allow returning even more nanopublications. 5 Conclusion The results show that using an IRI mapping service increases the recall when searching nanopublication, overcoming the problem that nanopublications do not (and should not need to) normalize IRIs. This paper demonstrates this with the application of a locally installed BridgeDb IMS service and an R script in combination with online nanopublication services. References 1. Pain, E.: How to keep up with the scientific literature. Science (November 2016) 2. Mons, B., Velterop, J.: Nano-Publication in the e-Science Era. In: Proceedings of the Workshop on Semantic Web Applications in Scientific Discourse (SWASD 2009), collocated with the 8th International Semantic Web Conference (ISWC- 2009), Washington DC, USA, October 26, 2009. Volume 523., CEUR Workshop Proceedings (October 2009) 3. Groth, P., Gibson, A., Velterop, J.: The anatomy of a nanopublication. Information Services and Use 30(1) (January 2010) 51–56 4. Queralt-Rosinach, N., Kuhn, T., Chichester, C., Dumontier, M., Sanz, F., Furlong, L.I.: Publishing DisGeNET as nanopublications. Semantic Web 7(5) (June 2016) 519–528 5. Chichester, C., Gaudet, P., Karch, O., Groth, P., Lane, L., Bairoch, A., Mons, B., Loizou, A.: Querying neXtProt nanopublications and their value for insights on sequence variants and tissue expression. Web Semantics: Science, Services and Agents on the World Wide Web 29 (December 2014) 3–11 Title Suppressed Due to Excessive Length 5 6. Kuhn, T., Chichester, C., Krauthammer, M., Queralt-Rosinach, N., Verborgh, R., Giannakopoulos, G., Ngonga Ngomo, A.C., Viglianti, R., Dumontier, M.: De- centralized provenance-aware publishing with nanopublications. PeerJ Computer Science 2 (August 2016) e78+ 7. Kuhn, T., Willighagen, E., Evelo, C., Queralt-Rosinach, N., Centeno, E., Furlong, L.I.: Reliable Granular References to Changing Linked Data. In d’Amato, C., Fernandez, M., Tamma, V., Lecue, F., Cudr-Mauroux, P., Sequeda, J., Lange, C., Heflin, J., eds.: The Semantic Web ISWC 2017. Volume 10587. Springer International Publishing, Cham (2017) 436–451 8. Kuhn, T., Meroo-Peuela, A., Malix, A., Poelen, J., Hurlbert, A., Centeno, E., Furlong, L.I., Queralt-Rosinach, N., Chichester, C., Banda, J., Willighagen, E., Ehrhart, F., Evelo, C., Malas, T., Dumontier, M.: Nanopublications: A Growing Resource of Provenance-Centric Scientific Linked Data. In: Proceedings of IEEE eScience 2018, arXiv.org (September 2018) 9. Williams, A.J., Harland, L., Groth, P., Pettifer, S., Chichester, C., Willighagen, E.L., Evelo, C.T., Blomberg, N., Ecker, G., Goble, C., Mons, B.: Open PHACTS: semantic interoperability for drug discovery. Drug Discovery Today 17(21-22) (November 2012) 1188–1198 10. Willighagen, E.L., Waagmeester, A., Spjuth, O., Ansell, P., Williams, A.J., Tkachenko, V., Hastings, J., Chen, B., Wild, D.J.: The ChEMBL database as linked open data. Journal of Cheminformatics 5(1) (2013) 23 11. Waagmeester, A., Kutmon, M., Riutta, A., Miller, R., Willighagen, E.L., Evelo, C.T., Pico, A.R.: Using the Semantic Web for Rapid Integration of WikiPathways with Other Biological Online Data Resources. PLOS Computational Biology 12(6) (June 2016) e1004989 12. Van Iersel, M.P., Pico, A.R., Kelder, T., Gao, J., Ho, I., Hanspers, K., Conklin, B.R., Evelo, C.T.: The BridgeDb framework: standardized access to gene, protein and metabolite identifier mapping services. BMC Bioinformatics 11(1) (January 2010) 5+ 13. Brenninkmeijer, C., Evelo, C., Goble, C., Gray, A.J.G., Groth, P., Pettifer, S., Stevens, R., William, A.J., Willighagen, E.L.: Scientific Lenses over Linked Data: An Approach to Support Task Specific Views of the Data. A Vision. In: Linked Science 2012 - Tackling Big Data. (2012) 14. Batchelor, C., Brenninkmeijer, C.Y.A., Chichester, C., Davies, M., Digles, D., Dun- lop, I., Evelo, C.T., Gaulton, A., Goble, C., Gray, A.J.G., Groth, P., Harland, L., Karapetyan, K., Loizou, A., Overington, J.P., Pettifer, S., Steele, J., Stevens, R., Tkachenko, V., Waagmeester, A., Williams, A., Willighagen, E.L.: Scientific Lenses to Support Multiple Views over Linked Chemistry Data. In Mika, P., Tudorache, T., Bernstein, A., Welty, C., Knoblock, C., Vrandei, D., Groth, P., Noy, N., Janow- icz, K., Goble, C., eds.: The Semantic Web ISWC 2014. Volume 8796. Springer International Publishing, Cham (2014) 98–113 15. Juty, N., Le Novre, N., Laibe, C.: Identifiers.org and MIRIAM Registry: commu- nity resources to provide persistent identification. Nucleic Acids Research 40(D1) (January 2012) D580–D586 16. Adriaens, M., Evelo, E., Willighagen, E., Marianthi, F., Kalafati, Pico, A., Kelder, T., Slenter, D., Hanspers, K., Txr24, Egoyenechea, Fehrhart, Thakur, G., Jeff: One Carbon Metabolism (Homo sapiens) 17. Salomonis, N., Hanspers, K., Fehrhart, Pico, A., Willighagen, E., Kelder, T., Mlius, J.: Alzheimers Disease (Homo sapiens) 18. Willighagen, E.: Accessing biological data in R with semantic web technologies. Technical report, PeerJ Inc. (March 2014) 6 E. Willighagen 19. Meroo-Peuela, A., Hoekstra, R.: grlc Makes GitHub Taste Like Linked Data APIs. In Sack, H., Rizzo, G., Steinmetz, N., Mladeni, D., Auer, S., Lange, C., eds.: The Semantic Web. Volume 9989. Springer International Publishing, Cham (2016) 342–353 20. Sferruzza, D., Rocheteau, J., Attiogb, C., Lanoix, A.: Extending OpenAPI 3.0 to Build Web Services from their Specification. In: Proceedings of the 14th Interna- tional Conference on Web Information Systems and Technologies, Seville, Spain, SCITEPRESS - Science and Technology Publications (2018) 412–419 21. Ehrhart, F., Melius, J., Cirillo, E., Kutmon, M., Willighagen, E.L., Coort, S.L., Curfs, L.M., Evelo, C.T.: Providing gene-to-variant and variant-to-gene database identifier mappings to use with BridgeDb mapping services. F1000Research 7 (September 2018) 1390