Detecting Inner-Ear Anatomical and Clinical Datasets in the Linked Open Data (LOD) Cloud Muntazir Mehdi, Aftab Iqbal, Yasar Khan, Stefan Decker, and Ratnesh Sahay Insight Centre for Data Analytics, NUI Galway, Ireland {firstname.lastname}@insight-centre.org Abstract. Linked Open Data (LOD) Cloud is a mesh of open datasets coming from different domains. Among these datasets, a notable amount of datasets belong to the life sciences domain linked together forming an interlinked “Life Sciences Linked Open Data (LSLOD) Cloud”. One of the key challenges for data publishers is to identify and establish links between newly generated domain specific datasets and LSLOD Cloud. While a number of publishing tools exist for creating links from new to existing datasets, tools to detect domain-specific relevant datasets for linking purposes are missing. In this paper, we propose an extended technique for automatically identifying relevant datasets in LSLOD Cloud for inner-ear anatomical and clinical terminologies. We validate the proposed technique with experiments over the publicly accessible LSLOD Cloud using real- world terminologies and datasets provided by clinical organizations. 1 Introduction The Linked Open Data (LOD) Cloud in its entirety is majorly composed of datasets published and updated by different publishers coming from academia, government organizations, online communities and companies alike. Most of the datasets within the LOD Cloud are accessible via at-least one SPARQL endpoint. The Datahub1 or Mannheim Linked Data2 catalogue lists such SPARQL endpoints available on the Web (though indeed some are offline or unreliable [1]). The LOD Cloud also comprises of 500 million links across datasets3 , following the fourth Linked Data principle: “links to related data”. From the perspective of a consumer, these links allow for recursively discovering and navigating detailed information about related entities elsewhere on the Web. From the perspective of a publisher, links encourage modularity, where high-quality links (once in place) can reduce the amount of content they need to host. From the perspective of the Web, these links form the mesh upon which the Web of Data is based. However, creating links with external LOD datasets is a challenging task for publishers. Addressing this challenge, a number of linking frameworks, such as 1 http://datahub.io/group/lodcloud; l.a. 2015/08/15. 2 http://linkeddatacatalog.dws.informatik.uni-mannheim.de/dataset; l.a. 2015/08/15. 3 http://lod-cloud.net/state/; l.a. 2015/08/15. Silk [11] and LIMES [7], have been proposed to help publishers link their local datasets to a remote LOD dataset. Given that there are now hundreds of remote datasets and many of them are black-boxes that do not describe their content [1], in practice it becomes a challenging task for a publisher to find potential datasets in order to setup links with the LOD Cloud. Currently, new publishers require knowledge of available LOD datasets. The publishers thus search and manually inspect the LOD datasets. In certain cases, the content of remote datasets are usually described using VoID4 , SPARQL 1.1 Service Descriptions, and specialized vocabulary (or ontology), which may help, but these are not available for many endpoints [1, 10]. This paper is based on our previous work of finding clinical trial datasets in the LOD Cloud [5, 6], specifically in LSLOD Cloud. Our main focus is on Exact Literal Matching: a direct look-up of domain-specific keywords/terms, meaning an exact case-sensitive phrase match. In this paper we extend our proposed “Multi Matching” algorithm [6] by (i) removing duplicate literal matching results and reducing the overall query execution time; and (ii) validating the extended algorithm on a real-life cochlea (inner-ear) datasets and terminologies. The approach presented in this paper will work for any remote dataset accessed via a SPARQL endpoint and will involve efficient look-ups, as opposed to inefficient post-filtering (REGEX Filters) and non-standard SPARQL (Full-text Search). Further, we present comparative evaluation of our previously proposed approaches and approach presented in this paper for a real-world use case. The rest of the paper is as follows: Section 2 presents our methodology and describes the extension of our previous work to query SPARQL endpoints in order to determine their relevance. Section 3 presents evaluation of our proposed approaches using different metrics, seeking relevant LSLOD datasets for interlinking. Section 4 discusses related work and Section 5 concludes this paper. Before moving to the next section, we introduce our motivating scenario involving data providers who wish to publish their corpora as Linked Data. Motivating Scenario: Our work is conducted in the context of the SIFEM EU project5 , which aims at developing a semantic infrastructure interlinking an open source Finite Element Tool [2] with existing data, models and new knowledge for the multi-scale modeling of the inner-ear with regards to the sensorineural hearing loss. One of the core goals of the project is to publish biomedical datasets provided by the consortium partners as high-quality Linked Data. Table 1: Example terms from datasets. Category Example Terms Inner Ear Anatomy afferent neuron, apex of cochlea, auditory nerve, basilar membrane of cochlea, bony labyrinth, hair cell, lower turn of the cochlea etc. Clinical Variables audiometry, bone vibrator, electrical noise, micro ct image, myo- genic noise, rarefaction, rinneTest, surface electrode etc. 4 http://www.w3.org/TR/void/; l.a. 2015/08/15. 5 http://www.sifem-project.eu/; l.a. 2015/08/15. The SIFEM terminologies6 , (Table 1 provides some examples of Terminologies) are used for the identification of potential LSLOD datasets. The life sciences community have been very active within the Linking Open Data movement. Of the 1,014 linked datasets that are contained in LOD Cloud, 83 datasets relate to life sciences, incorporating 3 billion facts and 191 million links in order to form LSLOD Cloud. The proposed method in this paper probe SPARQL endpoints of LSLOD datasets for detecting inner-ear anatomy and clinical datasets for linking. 2 Methodology: Detect Matching The Detect Matching technique, presented in this paper, uses the Query Terms (QTerms) generated using our previously proposed tools given in [5]. The QTerms are used to probe SPARQL endpoints in order to find relevant LSLOD datasets. Each of the approach (Multi-Matching (µMatch) [6], and Detect Matching given in this paper) takes as input a set of QTerms and a list of URLs for different SPARQL endpoints (SEndpoints). We generate SEndpoints by logging URLs of all candidate SPARQL endpoints in a particular domain. For our use-case, we created a list of SEndpoints specifically for the life sciences domain. Therefore, we considered 35 SPARQL endpoints that are made publicly available by Bio2RDF7 . Algorithm 1: DetMatch: Detect Matching Algorithm Input: A QTerm, A language tag (lang-tag) Output: A finite set of SPARQL endpoints (EPSet) 1 SEndpoints := set of cataloged LSLOD SPARQL Endpoints; 2 for EP in SEndpoints do 3 values := ProcessTerm(QT erm, lang-tag); 4 add := QueryAndLog(values, EP); 5 if add = true then 6 EP Set := EP Set ∪ {EP }; 7 end 8 end 9 return EP Set; 10 ProcessTerm(QT erm, lang-tag) 11 P Case := Proper Case QT erm; LCase := lower case QT erm; U Case := UPPER CASE QT erm; 12 T ypedQT erm := Typed QT erm; 13 T ypedP Case := Typed P Case; T ypedLCase := Typed LCase; T ypedU Case := Typed U Case; 14 //Type := http://www.w3.org/2001/XMLSchema#string 15 values := QT erm ||P Case ||LCase ||U Case || 16 QT erm@lang-tag ||P Case@lang-tag ||LCase@lang-tag ||U Case@lang-tag || 17 T ypedQT erm ||T ypedP Case ||T ypedLCase ||T ypedU Case; 18 return values; 19 QueryAndLog(values, EP ) 20 match := false; 21 Query := create SPARQL query; 22 Result := supply Query to EP and retrieve results; 23 if Result not empty then 24 match := true; 25 log Result; 26 end 27 return match; 6 http://www.sifem-project.eu/sites/default/files/sigUploadedFiles/D5.4. 1_Clinical%20Knowledge%20Documentation_final.pdf; l.a. 2015/08/15. 7 http://download.bio2rdf.org/release/3/release.html; l.a. 2015/08/15. In [6], we presented Direct-Matching (DMatch) approach that creates a single SPARQL query per QTerm and Multi-Matching µMatch approach that creates eight different SPARQL queries for each QTerm. Although, (µMatch) covers possible variants for a QTerm that may exist at a specified SPARQL endpoint but it is a costly operation (as we observe later in the Section 3), because each QTerm is queried eight times on a specified SPARQL endpoint. A QTerm can be represented with and without a language tag in a specific LSLOD dataset. For example, consider a QTerm that appears as “cochlea” and “cochlea”@en on a particular LSLOD dataset, therefore, µMatch consider hits twice for the same QTerm. Our third approach is refined to exploit the VALUES clause provided in SPARQL 1.1, which can be used to provide a list of values that are used for query evaluation. By using VALUES keyword, we are able to supply all possible variants (e.g., Cochlea, cochlea, COCHLEA, Cochlea@en,Cochlea@en, cochlea@en, COCHLEA@en) for a QTerm in a single SPARQL query rather than executing multiple queries on a specified SPARQL endpoint (as shown in the Listing 1.1). Moreover, using the VALUES feature of SPARQL 1.1, we are able to eliminate the duplicate matched results for a single QTerm. The pseudo-code of DetMatch is shown in Algorithm 1. SELECT DISTINCT ? s ? p WHERE { ? s ?p ?o . VALUES ? o { ” C o c h l e a ” ” C o c h l e a ” ” c o c h l e a ” ”COCHLEA” ” C o c h l e a ”@en ” C o c h l e a ”@en ” c o c h l e a ”@en ”COCHLEA”@en ” C o c h l e a ”ˆˆ h t t p : / / . . / . . / XMLSchema#s t r i n g ....... } Listing 1.1: Example SPARQL query for DetMatch approach. 3 Evaluation In Section 2, we presented three different approaches to probe public SPARQL endpoints for its relevance to a local dataset. In this section, we will compare the results of our proposed approaches by carrying out a series of experiments on the datasets provided by our clinical partners. All experiments were conducted on a computer running 64-bit Windows7 OS, with 4GB RAM and an Intel Core i5 (2.53 GHz) CPU. We use a MySQL RDBMS to store experimental data, including the endpoints to search and the results of successful queries. For experimentation, we consider a total of 220 clinical terminologies (CTerms) (some terminologies are exemplified in Table 1), which results in a total of 589 Query Terms (QTerms). And as mentioned in Section 2, we consider a total of 35 public endpoints from Bio2RDF. Queries are sent to the public endpoints over HTTP using the standard SPARQL protocol. 3.1 Results We compare the results of our experiments based on three metrics: (i)Matched Results (MR):represents the total number of query results obtained for all terms against all SPARQL endpoints; (ii) Matched Terms (MT): represents the number of distinct terms for which non-empty results were found for at least one SPARQL endpoint; and (iii) Matched Endpoints (ME): represents the number of distinct endpoints for which some term generated a non-empty result 8 . A detailed comparison of MT, MR and ME based on the three approaches (DMatch, µMatch and DetMatch) is presented in Fig. 1–3 respectively. In general, we see a significant increase in the number of matched terms (cf. Fig. 1) when comparing DMatch approach against µMatch and DetMatch approaches. This is obvious due to the usage of different variants (i.e., lowercase, uppercase, proper case etc.) for each QTerm while probing SPARQL endpoints. The reason behind same number of matched terms for µMatch and DetMatch is because of using VALUES clause for DetMatch to supply all possible variants for each QTerm in a single SPARQL query whereas in µMatch, we executed multiple queries to cover all possible variants for each QTerm. Fig. 1: Matched Terms Fig. 2: Matched Results Fig. 3: Matched End- (MT) comparison. (MR) comparison. points (ME) comparison. The significant increase in the number of matched terms (MT) subsequently increases the number of matched results (MR) for µMatch and DetMatch in contrast to DMatch approach (cf. Fig. 2). Comparing the matched results of µMatch against DetMatch (cf. Fig. 2), we see a decline in the later approach. The reason is the elimination of duplicate matched results for a QTerm using DetMatch approach, which is not handled in µMatch due to the usage of multiple queries. For example, consider an example term oxytocic that is contained in the BioPortal SPARQL endpoint9 . This term is defined using dct:title and obo:exact synonym predicates with the values “oxytocic”@en and “oxytocic”∧∧ http://www.w3.org/2001/XMLSchema#string respectively. In the case of DetMatch, the term oxytocic is matched only once but in the case of µMatch, it is matched twice due to the usage of multiple variants for a QTerm. With respect to the number of SPARQL endpoints that are matched (cf. Fig. 3), 8 To illustrate these metrics, consider an example where a set of two QT erms = { “afferent”, “neuron” } is searched using either DMatch, µMatch or DetMatch on SEndpoints = { “BioPortal”, “ClinicalTrials”, “MeSH” }, where the returned results contained 3 matches of “neuron” in “BioPortal”, 2 matches in “ClinicalTrials” and 1 match in “MeSH”. Similarly, the returned results for “afferent” had 0 matches in “BioPortal”, “ClinicalTrials” and “MeSH”. Then MR = 6, MT = 1 and ME = 3. 9 http://bioportal.bio2rdf.org/sparql; l.a. 2015/08/15. we observe that µMatch and DetMatch produce better results than DMatch approach because of the usage of different possible variants for each QTerm. Finally, we compare the query execution time for the three approaches (DMatch, µMatch and DetMatch). The time (in seconds) is calculated based on the execution time of all QTerms (that are present in one dataset) on a specific SPARQL endpoint. Fig. 4 shows the average query execution time (based on 35 SPARQL endpoints that are considered) for the three different approaches on our datasets. Fig. 4: Query execution time. From Fig. 1 and Fig. 3, we see that µMatch and DetMatch have identified same number of matched terms as well as SPARQL endpoints but the overall execution time of µMatch is much higher than that of DetMatch (cf. Fig. 4). The reason is, µMatch executes eight SPARQL queries per QTerm, however, DetMatch uses all possible combinations for a QTerm in a single SPARQL query, thus reducing the overall query execution time. Similarly, it can also be noticed that the query execution time of DetMatch (cf. Fig. 4) is slightly higher than that of DMatch, with the amount of returned results as better as µMatch (cf. Fig. 1 and Fig. 3). In this section, we have presented results of three different approaches (DMatch, µMatch and DetMatch) to probe SPARQL endpoints for its relevance to a local dataset. By evaluating the results based on three different metrics (MT, MR and ME) along with query execution time, we see that DetMatch approach turns out to be the best option for discovering external LSLOD datasets that are potential targets for interlinking with local dataset. The reason is, DetMatch has the ability to state all possible variants for each QTerm in a single SPARQL query, which lacks in the DMatch approach and not cost effective in the case of µMatch approach due to the 8× query-load. From the publisher’s perspective, if completeness of results is primarily important with less importance to the efficiency of discovery process (execution time), then we can say that µMatch and DMatch approaches are best suited for discovering public SPARQL endpoints that are potential targets for interlinking to a local dataset. 4 Related Work This work aims to find relevant datasets on the LOD Cloud, LSLOD in particular. In past the majority of approaches taken in exploiting the generic and noisy LOD Cloud are performed in two categories (i) identify schema-level links (using rdfs:subClassOf, owl:equivalentClass) between class hierarchies among different RDF and OWL datasets; and (ii) identify instance-level links (using owl:sameAs) among different RDF and OWL datasets. Instance-level links: Leme at al. [3] identifies datasets for interlinking and ranks them using probability measures based on a set of analyzed features. The proposed approach suggests links amongst different data sources using high level information, while schema- or instance-level information is not taken into account. Nikolov et al. [8, 9] propose an approach to identify relevant datasets for interlinking consisting of two steps: (1) searching relevant entities in other datasets using keywords; and (2) filtering irrelevant datasets based on semantic concept similarities using ontology matching techniques. We previously mentioned that there are a number of linking frameworks available for Linked Data, including Silk [11] and LIMES [7]. Both of these works provide a declarative language for guiding the creation of links between datasets based on predicates. However, both of these tools presume that a SPARQL endpoint for the target dataset is specified in the input. Our work addresses the prior step of identifying public endpoints of LSLOD datasets that are interesting to link to. Maali et al. [4] propose an extension of the Google Refine tool to curate and RDFize local datasets. The extension can help find legacy URIs for entities from target endpoints specified by the user. The authors propose using custom full-text search over SPARQL endpoints to find relevant URIs for keyword terms in the local dataset; e.g., using bif:contains over Virtuoso endpoints. However, they presume that the endpoints of interest are manually specified by the user. Other works have discussed the difficulties of discovering relevant SPARQL endpoints on the Web. For example, Buil Aranda et al. [1] note that few structured descriptions are available for endpoints. Paulheim and Hertling [10] also note that the discovery of endpoints is a difficult problem and propose methods to find endpoints given a URI of interest. 5 Conclusion In this paper, we extend our approach of direct look-up of domain-specific keywords/terms in the LSLOD Cloud. We have been able to remove – compared to previous approach [5, 6] – duplicate literal matching results and reduce the overall query execution time. The extended approach is validated using the domain specific anatomical and clinical cochlea (inner-ear) terminologies. We envision that our approach will complement the state-of-the-arts in schema and instance linking tools for the LOD Cloud. Our approach will support linking frameworks to generate automated links (without requiring prior knowledge of LSLOD datasets), this in turn will help domain experts and scientists to explore domain-specific information contained in publicly available infrastructure like the LOD Cloud. The three algorithms (DMatch, µMatch and DetMatch) have certain advantages and trade-offs in context of query execution time, number of results, and results replication, therefore, we intentionally avoid any fit-for-all suggestion. Depending on different clinical contexts, i.e., set of input clinical terminologies, user may select one of the three algorithms or their combination targeting specific clinical scenario. Acknowledgements: This publication has emanated from research supported in part by the research grant from Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289 and EU project SIFEM (contract Number 600933). References 1. C. B. Aranda, A. Hogan, J. Umbrich, and P.-Y. Vandenbussche. SPARQL Web- querying infrastructure: Ready for action? In International Semantic Web Confer- ence (2), pages 277–293. Springer, 2013. 2. V. Isailovic, M. Obradovic, D. Nikolic, I. Saveljic, and N. D. Filipovic. SIFEM project: Finite element modeling of the cochlea. In 13th IEEE International Conference on BioInformatics and BioEngineering, BIBE 2013, Chania, Greece, November 10-13, 2013, pages 1–4, 2013. 3. L. A. P. P. Leme, G. R. Lopes, B. P. Nunes, M. A. Casanova, and S. Dietze. Identifying candidate datasets for data interlinking. In Web Engineering, pages 354–366. Springer, 2013. 4. F. Maali, R. Cyganiak, and V. Peristeras. Re-using cool URIs: Entity reconciliation against LOD hubs. In Linked Data On the Web (LDOW) Workshop. CEUR, 2011. 5. M. Mehdi, A. Iqbal, A. Hasnain, Y. Khan, S. Decker, and R. Sahay. Utilizing domain-specific keywords for discovering public SPARQL endpoints: a life-sciences use-case. In Symposium on Applied Computing, SAC 2014, Gyeongju, Republic of Korea - March 24 - 28, 2014, pages 333–335. ACM, 2014. 6. M. Mehdi, A. Iqbal, A. Hogan, A. Hasnain, Y. Khan, S. Decker, and R. Sahay. Discovering domain-specific public SPARQL endpoints: a life-sciences use-case. In 18th International Database Engineering & Applications Symposium, IDEAS 2014, Porto, Portugal, July 7-9, 2014, pages 39–45. ACM, 2014. 7. A.-C. N. Ngomo and S. Auer. LIMES – a time-efficient approach for large-scale link discovery on the Web of Data. In T. Walsh, editor, IJCAI, pages 2312–2317. IJCAI/AAAI, 2011. 8. A. Nikolov and M. d’Aquin. Identifying relevant sources for data linking using a Semantic Web index. In Linked Data On the Web (LDOW) Workshop. CEUR, 2011. 9. A. Nikolov, M. d’Aquin, and E. Motta. What should I link to? Identifying relevant sources and classes for data linking. In Joint International Semantic Technology Conference (JIST), pages 284–299. Springer, 2011. 10. H. Paulheim and S. Hertling. Discoverability of SPARQL endpoints in Linked Open Data. In ISWC (Posters & Demos), pages 245–248. CEUR, 2013. 11. J. Volz, C. Bizer, M. Gaedke, and G. Kobilarov. Silk - a link discovery framework for the Web of Data. In Linked Data On the Web (LDOW) Workshop. CEUR, 2009.