A new database for drug-discovery address key-issues in mining of knowledge Ole Kristian Ekseth and Svein-Olav Hvasshovd Department of Computer Science (IDI) NTNU, Trondheim, Norway Abstract. The life of individuals are strongly influenced by their health. An example concerns salinity resistant plants, an invention which may alleviate issues of climate change and rising sea levels. A different issue conserns drug discovery for humans, such as accurate and inexpensive cures available for the poor, personalized drugs, etc. In drug discovery the applied strategy is to combine domain experts with data made accessible through off-the-shelf software, and from the latter expect to identify new drugs. While computational drug-discovery is known to be working when number of candidate-factors are sufficiently small, the established methods and software are unfeasible for mining in big-data knowledge bases. In this paper we address the above issue. We present an holistic ap- proach for searches in big-data with complex relations. We demonstrate how our novel strategies for integration of large heterogeneous data- sets results in knowledge discovery. In our work we address issues of: semantics, entity similarity, clustering, data-engine, hypothesis testing, and user-interfaces. To verify our approach we implement data from 37 external data-resources, resulting in a database with more than 30 mil- lion bio-medical relationships. When we compare our findings with exist- ing literature we observe how our holistic approach for big-data mining discover 1000+ novel candidates for drug interaction. To address key- issues in knowledge discovery we have constructed 10+ new software- approaches for data-mining, tools which enable the development of a new method for mining of big-data. To enable reuse of our approaches, they are available from: http://www.knittingTools.org/, http://www. knittingTools.org/gui_lib_mine.cgi, https://bitbucket.org/oekseth/ mine-data-analysis/downloads/, and https://bitbucket.org/oekseth/ hplysis-cluster-analysis-software. 1 Introduction In life-science a recurring task is to understand how and why entities relate: to construct a hypothesis which translates discrete observations into a conceptual figure capturing core-traits of an evaluated subject, as exemplified in Fig. 3 for the research of [1]. An example concerns the effects of Cytoplasmic Phospholi- pase A2 (cP LA2 ) enzyme which is associated to a number of diseases, such as Copyright held by the author(s). NOBIDS 2017 Alzheimer [2] and Rhematism [3]. In knowledge-discovery researchers use man- ual approaches to identify candidate interactions, as exemplified in [4] where the authors use literature to manually construct a “heterogeneous network with 351 node” [4]. In contrast to established approaches for data-mining, an understanding of drug-interactions require the analysis of possible interactions, as exemplified in Fig. 1. While the “PubMed” database [5] contains “more than 27 million cita- tions for biomedical literature” [5], the “Unified Protein Resources (UniProt)” describes more than 47 million protein sequences [6]. “The Economist” asserts that 50 per-cent of published research-literature are erroneous [7], hence the established use of manually selected research findings to identify new drug can- didates is challenging. The high cost of drug development discourage the development of drugs for the poor [8]. The cost of developing a single drug vary from $802 million to $2.2 billion [9]. The drug-company of “AstraZeneca” spend on average $11+ billion ([10,11]) on each accepted drug. The main-cost of drug development is the number of failed drugs [12], e.g., as observed by [13]: “only, one in 5,000 medicines makes it to the marked” [14]. Of importance is to address the above issues in drug discovery, i.e., as “today’s pharmaceutical industry cannot sustain sufficient innovation” [15] with today’s cost of drug-development. Hence, the importance of accurate tools for knowledge discovery. In this paper we relate the above perspectives, demonstrating how a new holistic approach for mining of big data enable user-interactive drug discovery. In the method and associated software we unify the approaches of user-centric and software-centric approaches for data-mining, as depicted in Fig. 5. What we assert is that an holistic approach which increase accuracy and performance of data from disparate sources, software for mining, and tacit understanding, is sufficient to address major issues in drug discovery, a view supported by [15]. “R&D efficiency represents the ability of an R&D system to translate inputs (for example, ideas, investments, effort) into defined outputs (for example, internal milestones that represent resolved uncertainty for a given project or product launches), generally over a defined period of time” [15]. The ensemble of methods and software, summarized in Fig. 5, address challenges which have prevented established semantic data-bases from knowledge discovery, e.g., as observed with respect to the issues encountered by [17,?,19]. In the work we have identified and addressed the issues of: 1. Disparate data: automatic approaches to unify distinctively different data, where results are exemplified in Fig. 1. 2. Execution-time: high-performance software for accurate and large-scale data- mining, as exemplified in Fig. 4; 3. User searches: interactive real-time data-mining which stimulate use of tacit knowledge, as exemplified in Fig. 3 and Fig. 6. The remainder of the paper is organised as follows. In section 2 we briefly survey related approaches, before we in section 3 describe the approach. In the result-section 4 we identify evaluate/discuss how the holistic approach address The semantic inference distance for predicates 10,000 activates state change Count of (predicate, distance=X) pairs regulates state change active in pathway through casuality 8,000 state transition cites research state transition for protein–protein 6,000 4,000 2,000 0 1 2 3 4 5 6 7 8 Semantic inference distance. Fig. 1: Semantic inference and knowledge discovery. The above figure count each in- ferred relationship for a subset of the predicates in our database In the figure pred- icates at a semantic inference distance ≥ 2 capture inferences not known in public drug-discovery databases, hence there are 1000+ identified candidates for knowledge discovery. To increase accuracy of predictions the predicates, depicted as legends, are constructed from a unification of data from “UniProt” [6] and “BioPax” [16]. current issues in big-data mining. This paper ends with a brief summary of observations in section 6. 2 Related Work A challenge in data-mining conserns the slow performance of software, as ob- served for [20] in Fig. 4. A possible explanation of the latter is an unawareness of high-performance software implementation strategies [21]. To exemplify, the major efforts in “systems biology is on developing fundamental computational and informatics tools” [22], an assertion motivated by how “a concerted effort to bring all the useful tools for pathway analysis in a common platform is still missing” [23]. When combining the observations of ([22,?]) and Fig. 4 we re- alize how poor-performing software represents a hurdle in knowledge discovery. To summarize, we observe that established approaches for data-mining suffers from: 1 1. Disparate data: insufficient data-coverage and prediction, e.g., in [23,24,25,26,17]; 2. Execution-time: high query response-time, e.g., in [20,?]; User asks a question/query. web-interface View result. Parse: apply ‘syntactic sugar’, Parse: concatenate and ‘validation of syntax’. web-server ‘provenance’ and ‘relationship’. Legend: Logical Complexity Receive result, and handle Request query to be computed. data-server::client warnings (ie, if any). Interpret/parse query. Construct result-representation. data-server::computer Compute query. Caption: Above cartoon/figure depicts/illustrate the ‘flow of data’ initiated when a user requests the computation of a query: how a large number of ‘code/logical modules’ are involved in an operation which from the outside (ie, by a non-expert) may be seen as a trivial task. Of interest is to observe the modules involved in the handling ‘of the user-request’: for the “web-interface” Fig. 2: Computational complexity versus user-interaction in knowledge searches. The CSS and HTML are ‘responsible’ for the layout, while JQuery, JavaScript and Perl-CGI handles the user-input; the user-input is received/handled by the Perl-CGI-based “web-server” (currently running on an “Apache” ‘framework) which construct a valid KNIT-TSV query; the KNIT-TSV query is received by the “data-server::client” written in “C” (where the above figure relates user-searches to the computational inference process. This re- programming-language of “C” is chosen to simplify the interaction with the “Apache” web-server), which initiates a communication with Knitting-Tools SKB-server (and thereafter sending the query in question to the SKB-server); Knitting-Tools “data-server::computer” (which is written in the programming-languages of “C” and “C++”) accesses the ‘already loaded flects the flow of information, and complexity of software, for user-queries supported SKB-data-object’, before evaluating the query and returning the result to the “data-server::client”, which again return the result to the Perl-CGI-based “web-server”. The “web-server” (a) relate the provenance-relationships to each of the relationships (thereby simplifying the viewing of the result), (b) construct “JavaScript” and HTML data-structures (for simplifying and at www.knittingTools.org. While major research efforts are invested in improving ‘beautifying’ the user experience and user-interaction), before (c) initiating/constructing different data-representations/layouts (eg, tables, heat-maps, radial views, etc). The result is then seen by the user (though the web-interface), where the JQeury and JavaScript support allow user-interaction (ie, without the data to be reloaded for each/every request). performance of software categorized in the upper part of the figure, their return on investment is limited, i.e., as captured from the figure. In contrast the majority of our efforts are invested on improving the software modules depicted in the bottommost part in the above figure, as exemplified in Fig. 5. 3. User searches: user-interfaces which limits domain-experts from accurate data-searches and result-interpretation, e.g., in [18,28,29,30]. In below we briefly examine the above issues, focusing on issues concerned with disparate data and execution-time. 2.1 Execution-Time: Tools for similarity, feature-selection and clustering There are more than 106 research-works concerned with data-mining1 . An exam- ple concerns the k-means cluster-algorithm, where new permutations are pub- lished every year, e.g., with respect to [31]. The work of [32] observes how ex- isting software for data-analysis under-utilizes computer-hardware. A popular software-tool for cluster-analysis is the “cluster C” software [20]. From Fig. 4 we observe how the approach manages to outperform the software of [20] by a factor of 100,000x+. While [33] provide a GPU-optimized implementation of “DB-SCAN” [34], the GPU implementations limited support for user-defined pa- rameters result in inaccurate cluster-predictions for numerous data-sets [35], e.g., 1 Observation from searching on “Google Scholar” for terms such as PCA, k-means, Sum of Squared Error (SSE), spearman, Euclid, correlation, similarity, etc. with respect to issues in missing data and similarity-metrics. In order to eval- uate accuracy of cluster-algorithms, application of feature-selection, and many- dimensional hypothesis-testing, metrics for cluster-consistency are used [36]. Ex- amples of cluster-consistency metrics are “Silhouette” [37], “Sum of Squared Error (SSE)” [38], “Rand’s Index” [39] and “Rands Adjusted Index” [40], etc. Therefore, accurate software for data mining need to be optimized both with respect to number and execution-time of integrated metrics. 2.2 Disparate data and Execution-Time: Engines for data-access A major challenge in big-data analytics concerns the slow performance of database- engines [41,23,22,42]. To exemplify, the authors of [23] asserts that there is no sound computational framework for database-management. The work of [43] ob- serves how “big data analytics requires technologies to efficiently process large quantities of data” [43]. To address the performance lag in database-engines current approaches seeks to pre-compute queries [17,44,45], reduce RDF-dataset through slicing [46], etc. However, a prevailing issue concerns the high time-cost of queries: the search-engine of [47] use more than 13 minutes to evaluate a simple query. What may be argued is that the choice of accurate data-engines may address the performance issue. There is a large number of different data- engines for high-performance querying of semantic data [48,49,50,?,?]. One of these is the “Sesame” data-engine [53], a data-engine which is unable to provide real-time query-answer-time to simple queries [54]. Our earlier work [55] iden- tifies how the established B-tree ([56,41]) data-structure results in a 10,000x+ performance-delay when compared to accessing data stored in an in-memory 2d- sparse data-structure, as discussed in [55]. In our [55] we demonstrates how a 2d sparse data-structure may be used as an alternative to established data-engines, a work which observe how application of a 2d sparse data-structure outperforms MySQL by 10,000,000x+ for important bio-medical queries. 3 Method: A holistic method for knowledge discovery In the integration of real-time user access to 30 million bio-medical relationships we have faced the challenges described in research, as exemplified in section 2. From the works of others we realize that it is not feasible to follow the estab- lished strategies. To exemplify, major efforts by [17,?,19] are placed on trans- lating data-formats into RDF. However, their approaches have not resulted in knowledge discovery. In the unification of data-resources we have addressed is- sues in assimilation of the graph-structured “BioPax” [16] formats and evidence- annotations in “OBO” [60], i.e., where latter by definition is not supported by the “SPAR-QL” query-language. An example of erroneous name-mappings is seen for an entity with name “HDR” asserted by [61,?] to be an exact synonym of the “gata3” gene. In contrast the established view is that “HDR” describes a mechanism in cells [63]. The latter example is one of many fallacies observed in Fig. 3: Our support for knowledge-inferences in filtered data. The above figure exem- plify how our approach enable knowledge-discovery, as described in [57]. Each of the sub-figures represent distinct data-sets capturing different hypothesis in [1]. The differ- ences and similarities between the sub-figures provide clues of how guinea pigs develop. Importantly, the above separation between entities reflect the findings in [1], hence our interactive data-mining approach provide support for accurate and fast data-filtering of user-defined data-sets. multi-origin databases, hence integration of data need to take care when using assertions from unreliable sources. A different aspect concerns the execution-time of user-queries, exemplified in Fig. 2. To address the high time-cost of translating external data-bases into RDF, and searching RDF data-stores, we have designed a data-engine which accepts semantic relationships. When measuring the response-time of queries we observe how our new data-engine address issues in execution-time, as described in our [55]. The/Our semantic data-engine address issues such as: 1. Disparate data: integration of evidence annotations, hence less relationships to investigate during evidence-centered user-queries; 2. Execution-time: memory-cache aware data-searches, effective use of SSE [64], memory-tiling [65], etc; 3. User searches: pre-computation of statistics, and ranks, for database-vertices enable accurate suggestion when users type name of entities, as exemplified in Fig. 6. The above described strategy exemplify approaches which reduce search- time without introducing erroneous heuristics, e.g., in contrast to [18]. Fig. 5 presents a summary of the approaches undertaken to optimize the performance aspects of bio-medical knowledge discovery, hence a holistic method for data- mining. To exemplify, we from Fig. 5 observe how the holistic approach address issues in disparate data through a combination of manually curated rules (to address quality issues in data-resources, e.g., the “HDR” use-case), application of clustering to unify entities both with respect to their database-resource (e.g., “uniprot”), etc. Time-cost of pariwise simliarity-metrics in established software-libraries 1,000 Kendall: “cluster-C” Kendall: “hpLysis” 800 User Time [s] 600 400 200 0 100 200 300 400 500 600 Time to compute pairwise simliarity for a squared matrix Fig. 4: Time-difference of our hpLysis software versus established approaches. The above figure capture the performance-difference of different strategies to compute the pairwise similarity metric of“Kendall’s Tau” [58,59]. While the bottommost legend represents the time-cost of our hpLysis software, the topmost legend capture the time- cost of the popular “Cluster C” library [20]. The figure demonstrates how the hpLysis approach out-performs [20] by a factor of 100,000x+. 4 Result The holistic approach for drug-discovery, introduced in this paper, is constructed to relate domain-experts to accurate interactions minted from big data. From below sub-sections we assert that the approach manages to correctly address the issues described in section 2. 4.1 Disparate data: Data-access in the bio-medical domain There are more than 220 different knowledge sources in the bio-medical domain [66]. Correctness and usability are characteristics which describe the most pop- ular tools for knowledge integration. An example tool is cPath, written by [66], which is built around a MySQL database 1 [67]. From the performance measure- ments of data-structures in sub-section 4.4 we observe how semantic searches through MySQL results in a 10,000,000x+ performance-delay. The best tools provide access to data which has been manually curated by field experts, such as the Reactome tool [26] or the BioGrid tool [25]. The back-bone of the tools is often the Gene Ontology (GO) [68], which is used to define the lexicographic order of the genes and proteins. Translating compartmentalized knowledge into an ontology for reasoning, such as the RDF format, is seen in [24,45]. The prob- lem with both approaches is the performance and quality issues in knowledge discovery. 4.2 Execution-Time: Relationship between implementation, execution-time, and their influence In data-mining the execution-time of software may render high-quality analytical approaches useless, e.g., as inferred for large data-sets in Fig. 4. The application of established implementation-strategies results in under-performing code due to the challenges of compilers to identify strategies for performance tuning (as it otherwise would not have been a time-difference between different software implementations). In below we list a subset of observations from the holistic optimization of approaches for data-mining: 1. Search-time: the test-cases listed in sub-section 4.4 relate the time-cost of semantic searches in our novel database-engine to the established use of B- trees ([56,41]), observing a time-difference of 10,000x+; 2. Data-mining: Fig. 4 compare the time-cost of strategies for computing “Kendall’s Tau”. When our approach is compared to the popular “cluster-C” software [20], a time-difference of 100,000x+ is observed; 3. Software complexity: Fig. 5 examplify how accurate and fast user-searches involve steps in data-curation, analysis of semantic similarity, and construc- tion of user-interfaces. The above observations indicates that the application of low-level optimization strategies is a central part in efforts for mining of big-data, reflecting observations in section 2, hence the importance of an holistic optimization strategy. 4.3 User searches: application centered perspective A sound interaction between software-tools and human domain-experts is seen as an essential part of knowledge discovery. In below we exemplify a subset of the strategies we have applied in the holistic approach (Fig. 5): 1. Semantic user-interactions: Fig. 6 exemplify how users are provided with sup- port for both semantic queries (topmost sub-figures), interactive exploration (sub-figures in the middle) and signature queries (bottom-right sub-figure); 2. Testing hypothesis: in Fig. 3 we observe how a combination of the “MINE” metric [69] and our web-based framework for data-mining facilitate knowl- edge discovery on filtered query-subsets. -- A weakness of our approach, is the complexity of it, a complexity which maintain several (??) isolated parts. -- A strength of our layout, is the (relative) disjointness of the components makes it easier to verify correctness (and to some automatically extend fix Translation Legends: in naming of entities). Pool of KBs into Column-based and DBs N(7) Disparate data Execution-time User-searches Con- catenation of Translating Translating reasoning-res Clustering id’s into axioms into ults --> Update Directionality of Normalized KB-mappings. Mapping Clustering into relations column-based N(7) KT-axiom-spec between equal disjoint <-- get KB-identifiers KB-entries Proximity normalized Standardization (sanitation) KB-identity Input to Local and Formats: OBO, client-server Bell, CSV/TSV Centrality access Sources annotated Translate Knowledge into Synonym Complex Search Local and Data-engine Formats: OBO, client-server Interactive Effective Data unification Patterns Structures Bell, CSV/TSV access www.kni Knitting-Tools -CORE(KT) Knitting-Tools -SKB Curated axioms Semantic Network Real-time Comparison and GUI-query-API Visualisation Reasoning Re-engineering query access Correlation Filtering and inferences Analysis Setup of Direction (??) ANN of sources combined with PCA Analysis Feature Selection and Similarity Clustering Test hypothesis Group Assignment API Local and client-server www Fig. 5: An holistic approach for data-mining. The above figure depicts a collection of GUI-query-API access labelled boxes, such as Pools of KBs and DBs and Visualization. The legend-text (top- signature graph-based Construction of SKB right) describe the classification of the different background rectangles, as discussed searches visualization in section 3. An example of a classification concerns the process of handling disparate Terminal data, where tasks for data-parsing, entity optimization, and format-unification, are Parallel-plots Heatmaps combined into an automated approach. The size of the background-boxes reflect their Selection Criterias computational complexity, as discussed in Fig. 2. The uniqueness of our approach Dendograms Legend-plot concerns how we relate the existing strategies into one unified model, thereby avoiding Dynamic fetching Circle-plots of overheads associated with generalised approaches (such as RDF centered integration- node-information strategies). To exemplify, when approaches such as [19] apply Standardization they use Visualization through www.knittingTools.org standardized rules for all of the integrated data-resources, hence entities from different data-resources are syntactically correct while semantically inaccurate. 4.4 Reproducibility: interfaces to validate and elaborate our approaches for data-mining The results, summarized in this paper, may be re-produced through application of our software, as listed in below: 1. Semantics and data: http://www.knittingTools.org/; 2. MINE data-mining: http://www.knittingTools.org/gui_lib_mine.cgi; 3. MINE high-performance software: https://bitbucket.org/oekseth/mine-data-analysis/ downloads/; 4. Software for data-analysis: https://bitbucket.org/oekseth/hplysis-cluster-analysis-software. 5 Use-cases: how the holistic approach improves drug discovery The holistic approach represented in this paper manages to address issues in big- data analysis. The term big data depends on the complexity of data, algorithms, and use-cases to be evaluated, e.g., where [70] asserts that a study of 857 proteins implies a large-scale analysis. This section therefore seeks to address strategies for: Fig. 6: A user-interface for semantic queries. The above sub-figures depict the user- interface for semantic searches, a user-interface designed for domain-experts without interest in programming. The two uppermost sub-figures depicts the query-form for submitting conditional questions. The drop-down menu observed in the top-left figure exemplifies the support for auto-complete. In the middle-right sub-figure the query- results is seen. To handle the frequent issue of 1000+ identified relationships, the table include filter-options. When a user filters a subset of the query-result, it is visualized in the middle-left sub-figure. 1. Disparate data: algorithms and data which may be accurately queried; 2. Execution-time: why the enabled performance-increase is important in drug discovery; 3. User searches: how domain-experts may identify accurate prediction from complex data. 5.1 Disparate data and user-searches The “Knitting-Tools” web-server includes a number of pre-computed use-cases (http://knittingtools.org/examples.cgi). The results have been manually investigated and verified. Accuracy of predictions depends on data (Fig. 1) and accessibility (Fig. 6). Below use-cases provides a brief introduction to how com- plex searches may be applied on big-data. Use-case(1): What is known for “notch2”? (http://knittingtools.org/ query.cgi?queryID=get_allRelations_forA_vertex). The use-case illustrates an exploratory search to identify all relationships, synonyms and provenance (e.g., the set of databases) describing a vertex of interest, e.g., the “notch2”. Amplifies the use of basic search functionality to fetch relationships in Knitting- Tools, both for visual evaluation (Fig. 6), and as a prior data-gathering step before application of software for pattern identification (sub-section 5.2 and sub- section 5.3). Use-case(2): What is known for “cdk4”, and how was this known? (http:// knittingtools.org/query.cgi?queryID=get_allRelations_forA_vertex_evidence_ and_synonyms). Extends use-case(1) with logic to fetch the synonymous vertices (for each vertex in the set of identified relations) and the provenance for each re- lationships. Provides insight into why the identified relationships were predicted, i.e., their provenance. question(3): Identify the regulations associated to the important event of apoptosis (i.e., ’controlled cell death’). (http://knittingtools.org/query.cgi? queryID=intro_basic_bio_2). The query identifies relationships associated to pathways and regulations for chemical entities, proteins, genes, and pathways. In the result provenance is associated to each relationship, a provenance which becomes visible when clicking the green-plus button in the result-table (Fig. 6). 5.2 Pattern identification and usability The MINE software combine an highly accurate algorithm for pattern iden- tification [69] with a web-interface for interactive data-exploration (Fig. 3). To evaluate the applicability of the MINE web-based software (http://www. knittingTools.org/gui_lib_mine.cgi) the data-sets of [71], [72], and [1] are evaluated. While the data-set by [1,71] provide explanation factors for growth of guinea pigs, [72] analysis the variation in the guinea pigs goat-spots. The conclu- sions presented by the authors are supported through application of the MINE web-interface. 5.3 Execution-Time and knowledge discovery For large data-sets the prediction accuracy relates directly to the execution-time of data-mining software, i.e., a users are otherwise needed to explore smaller data-samples and test fewer number of hypothesis. The below paragraphs exem- plifies how our proposed approach increase accuracy of large-scale data-analysis. Application(1). Large-scale ontology engineering (https://bitbucket.org/ oekseth/hplysis-ontology/). We have developed a new hpLysis-onto software for high-performance engineering of bio-medical ontology. Ontologies are used in a large number of application, e.g., to identify similarities of gene products from experimental outcomes [73] The hpLysis-onto software address performance is- sues in computation of transitive closures and transitive reductions, an issue hampering analysis of large and complex data-sets. For the task to compute transitive closures for all vertices, the software of [74] consumes more than one day on the 24 MB “Gene Ontology” [68]. In contrast, the hpLysis-onto software manages to answer the latter query in less than one second, hence a significant improvement in performance. Application(2). Large scale Semantic similarity (https://bitbucket. org/oekseth/hplysis-cluster-analysis-software). The hpLysis software is updated with a new high-performance library for computation of 20+ seman- tic similarity metrics. “Generally speaking, semantic similarity measures involve the GO tree topology, information content of GO terms, or a combination of both” [75]. The software proposed by [75] takes several hours to complete. The hpLysis-semantic software improves the performance of established software ap- proaches by 1000x+, i.e., without reducing the prediction accuracy. The latter is enabled through increased utilization of computer memory hardware. Semantic similarity-metrics are used to identify important traits in data-sets [76], e.g., to (1) relate hypothetical assumptions to gene-expression-levels [3] and (2) with respect to “Word Sense Disambiguation” (WSD) for automated analysis of text- corpuses [77]. Application(3). Many-dimensional data-analysis ([36]). The hpLysis soft- ware provides an API for high-performance computation of 20+ cluster-algorithms, 320+ pairwise similarity-metrics, 10+ metrics for string-similarity, and 20+ metrics for cluster-validity. The work enables a performance-improvement of 600x+ for pairwise similarity-metrics such as “Canberra” and “Cosine”, while 100,000x+ performance-improvement when compared to “Kendall’s Tau” (Fig. 4). In large-scale data-analysis the execution-time severely hamper the types of relationship which may be explored, e.g., when analysing gene-expressions data- sets for possible interactions, when using pathway-relationships (sub-section 5.1), application of ontology annotations for similarity-assessment, mining of biblio- metric data-bases (e.g., in [78]), etc. Therefore, hpLysis improves both accuracy of data-generation and analysis of user-defined data (Fig. 6). 6 Conlusion and Future Work We have presented both a method, a database, and 10+ software, for data- mining. This paper argue that the holistic approach, which captures an en- semble of approaches and high performance software, manages to overcome the current hurdles in big-data drug-discovery. Fig. 6 exemplify how domain-experts may interact with our real-time support for querying 30+ million bio-medical relationships. In order to evaluate the quality of the approach we investigate the number of accurate and unique relationships identified in our approach, ex- emplified in Fig. 1. The 1000+ novel candidate interactions which are identified highlight the ability of our approach to automatically identify relationship which are not known in literature. Through an optimized data-engine the relationships are accessible for users in real-time, exemplified in Fig. 3. In this paper we have described an approach to unify our semantic inter- face (www.knittingTools.org) with our high-performance software application (e.g., https://bitbucket.org/oekseth/hplysis-cluster-analysis-software). Through concrete use-cases we have exemplified how the approach address is- sues in disparate data, execution-time, and user searches, ie, parameters which are critical in discovery of knowledge. From the examples we observe how our method and 10+ novel software approaches address issues in big-data drug- discovery. Therefore, we assert that our novel holistic approach may influence strategies for mining of big-data. 6.1 Future Work We plan to address the weakness of the user-interfaces and the unknown qual- ity of our knowledge inferences. In order to improve our user-interfaces we are now initiating efforts in usability testing for different target groups. Similarily, we have initiated efforts to evaluate the drug-impact of our putative knowledge discoveries. Both of the issues require year-long lab-experiments, hence the im- portance of quality and performance enabled through our novel method and software. Acknowledgements The authors would like to thank MD K.I. Ekseth at UIO, Dr. O.V. Solberg at SINTEF, Dr. S.A. Aase at GE Healthcare, MD B.H. Helleberg at NTNU– medical, Dr. Y. Dahl, Dr. T. Aalberg, Dr. J.C. Meyer, and K.T. Dragland at NTNU, and the High Performance Computing Group at NTNU for their support. References 1. McPhee, H.C., et al.: Genetic growth differentiation in guinea pigs. Technical re- port, United States Department of Agriculture, Economic Research Service (1931) 2. Stephenson, D.T., Lemere, C.A., Selkoe, D.J., Clemens, J.A.: Cytosolic phospho- lipase a 2 (cpla 2) immunoreactivity is elevated in alzheimer’s disease brain. Neu- robiology of disease 3(1), 51–63 (1996) 3. Feuerherm, A.J., Johansen, B.: Rheumatoid arthritis treatment. US Patent App. 13/783,088 (2013) 4. Zhao, M., Yang, C.C.: Mining online heterogeneous healthcare networks for drug repositioning. In: Healthcare Informatics (ICHI), 2016 IEEE International Confer- ence On, pp. 106–112 (2016). IEEE 5. for Biotechnology Information, N.C.: PubMed data-base for biomedical litterature. https://www.ncbi.nlm.nih.gov/pubmed/ (2017) 6. Consortium, U., et al.: Uniprot: the universal protein knowledgebase. Nucleic acids research 45(D1), 158–169 (2017) 7. Economist, T.: How science goes wrong: Trouble at the lab. The Economist 409(8858), 21–24 (2013) 8. Cuatrecasas, P.: Drug discovery in jeopardy. Journal of Clinical Investigation 116(11), 2837 (2006) 9. DiMasi, J.A., Grabowski, H.G., Hansen, R.W.: Innovation in the pharmaceutical industry: new estimates of r&d costs. Journal of health economics 47, 20–33 (2016) 10. Al-Huniti, N.: Quantitative Decision-Makingin Drug Development. http://www. phuse.eu/download.aspx?type=cms&docID=5334 (2013) 11. Herper, M.: The Truly Staggering Cost Of Inventing New Drugs. https://www.forbes.com/sites/matthewherper/2012/02/10/ the-truly-staggering-cost-of-inventing-new-drugs/#1a0129244a94 (2012) 12. DiMasi, J.A., Grabowski, H.G., Hansen, R.W.: The cost of drug development. New England Journal of Medicine 372(20), 1972–1972 (2015) 13. Association, C.B.R., et al.: Fact sheet: New drug development process. FDA Special Consumer Report 14. Thomas, K.: The price of health: the cost of developing new medicines. https://www.theguardian.com/healthcare-network/2016/mar/30/ new-drugs-development-costs-pharma (2016) 15. Paul, S.M., Mytelka, D.S., Dunwiddie, C.T., Persinger, C.C., Munos, B.H., Lind- borg, S.R., Schacht, A.L.: How to improve r&d productivity: the pharmaceutical industry’s grand challenge. Nature reviews. Drug discovery 9(3), 203 (2010) 16. Demir, E., Cary, M.P., Paley, S., Fukuda, K., Lemer, C., Vastrik, I., Wu, G., D’Eustachio, P., Schaefer, C., Luciano, J., et al.: The biopax community standard for pathway data sharing. Nature biotechnology 28(9), 935–942 (2010) 17. Blonde, W.: Metarel, an ontology facilitating advanced querying of biomedical knowledge. PhD thesis, Department of Mathematical Modelling, Statistics and Bioinformatics, Ghent University, Ghent, Belgium (2012) 18. Antezana, E., Blond, W., Egaña, M., Rutherford, A., Stevens, R., De Baets, B., Mironov, V., Kuiper, M.: Biogateway: a semantic systems biology tool for the life sciences. BMC bioinformatics 10(10), 11 (2009) 19. Belleau, F., Nolin, M.-A., Tourigny, N., Rigault, P., Morissette, J.: Bio2rdf: to- wards a mashup to build bioinformatics knowledge systems. Journal of biomedical informatics 41(5), 706–716 (2008) 20. de Hoon, M.J., Imoto, S., Nolan, J., Miyano, S.: Open source clustering software. Bioinformatics 20(9), 1453–1454 (2004) 21. Hucka, M., Finney, A., Sauro, H.M., Bolouri, H., Doyle, J.C., Kitano, H.: The erato systems biology workbench: enabling interaction and exchange between software tools for computational biology (2002) 22. Butcher, E.C., Berg, E.L., Kunkel, E.J.: Systems biology in drug discovery. Nature biotechnology 22(10), 1253 (2004) 23. Chowdhury, S., Sarkar, R.R.: Comparison of human cell signaling pathway databas- esevolution, drawbacks and challenges. Database 2015 (2015) 24. Beisswanger, E., Lee, V., Kim, J.-J., Rebholz-Schuhmann, D., Splendiani, A., Dameron, O., Schulz, S., Hahn, U., et al.: Gene regulation ontology (gro): de- sign principles and use cases. Studies in health technology and informatics 136, 9 (2008) 25. Stark, C., Breitkreutz, B.-J., Chatr-Aryamontri, A., Boucher, L., Oughtred, R., Livstone, M.S., Nixon, J., Van Auken, K., Wang, X., Shi, X., et al.: The biogrid interaction database: 2011 update. Nucleic acids research 39(suppl 1), 698–704 (2011) 26. Croft, D., OKelly, G., Wu, G., Haw, R., Gillespie, M., Matthews, L., Caudy, M., Garapati, P., Gopinath, G., Jassal, B., et al.: Reactome: a database of reactions, pathways and biological processes. Nucleic acids research 39(suppl 1), 691–697 (2011) 27. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: Machine learning in python. Journal of Machine Learning Research 12(Oct), 2825–2830 (2011) 28. Clustergrammer, J.: Clustergrammer heatmap visualization. http://amp.pharm. mssm.edu/clustergrammer/ 29. Metsalu, T., Vilo, J.: Clustvis: a web tool for visualizing clustering of multivariate data using principal component analysis and heatmap. Nucleic Acids Research 43(W1), 566–570 (2015). doi:10.1093/nar/gkv468 30. Tan, C.M., Chen, E.Y., Dannenfelser, R., Clark, N.R., Maayan, A.: Net- work2canvas: network visualization on a canvas with enrichment analysis. Bioin- formatics 29(15), 1872–1878 (2013). doi:10.1093/bioinformatics/btt319 31. Chen, Y., Zeng, Y., Luo, F., Yuan, Z.: A new algorithm to optimize maximal information coefficient. PloS one 11(6), 0157567 (2016) 32. Mekkat, V., Natarajan, R., Hsu, W.-C., Zhai, A.: Performance characterization of data mining benchmarks. In: Proceedings of the 2010 Workshop on Interaction Between Compilers and Computer Architecture, p. 11 (2010). ACM 33. Andrade, G., Ramos, G., Madeira, D., Sachetto, R., Ferreira, R., Rocha, L.: G- dbscan: A gpu accelerated algorithm for density-based clustering. Procedia Com- puter Science 18, 369–378 (2013) 34. Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Kdd, vol. 96, pp. 226–231 (1996) 35. Ekseth, O.K., Hvasshovd, S.-O.: How an optimized DB-SCAN implementation reduce execution-time and memory-requirements for large data-sets. Accepted for publication (2017) 36. Ole Kristian Ekseth: hpLysis: a high-performance software-library for big-data machine-learning. https://bitbucket.org/oekseth/ hplysis-cluster-analysis-software/. Online; accessed 06. June 2017 37. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics 20, 53–65 (1987) 38. Lloyd, S.: Least squares quantization in pcm. IEEE transactions on information theory 28(2), 129–137 (1982) 39. Rand, W.M.: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical association 66(336), 846–850 (1971) 40. Yeung, K.Y., Ruzzo, W.L.: Details of the adjusted rand index and clustering al- gorithms, supplement to the paper an empirical study on principal component analysis for clustering gene expression data. Bioinformatics 17(9), 763–774 (2001) 41. Jagadish, H., Olken, F.: Database management for life sciences research. ACM SIGMOD Record 33(2), 15–20 (2004) 42. Eltabakh, M.Y., Ouzzani, M., Aref, W.G., Elmagarmid, A.K., Laura-Silva, Y., Arshad, M.U., Salt, D., Baxter, I.: Managing biological data using bdbms. In: Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference On, pp. 1600–1603 (2008). IEEE 43. Lau, L., Yang-Turner, F., Karacapilidis, N.: Requirements for big data analyt- ics supporting decision making: A sensemaking perspective. In: Mastering Data- Intensive Collaboration and Decision Making, pp. 49–70. Springer, ??? (2014) 44. Blond, W., Antezana, E., Mironov, V., Schulz, S., Kuiper, M., Baets, B.D.: Using the relation ontology Metarel for modelling Linked Data as multi-digraphs (2012) 45. Blond, W., Mironov, V., Antezana, E., Venkatesan, A., Baets, B.D., Kuiper, M.: Reasoning with bio-ontologies: using relational closure rules to enable practical querying. Oxford Bioinformatics 27, 1562–1568 (2011). doi:10.1093/bioinformatics/btr164 46. Marx, E., Shekarpour, S., Soru, T., Braşoveanu, A.M., Saleem, M., Baron, C., Weichselbraun, A., Lehmann, J., Ngomo, A.-C.N., Auer, S.: Torpedo: Improving the state-of-the-art rdf dataset slicing. In: Semantic Computing (ICSC), 2017 IEEE 11th International Conference On, pp. 149–156 (2017). IEEE 47. Papanikolaou, N., Pavlopoulos, G.A., Pafilis, E., Theodosiou, T., Schneider, R., Sa- tagopam, V.P., Ouzounis, C.A., Eliopoulos, A.G., Promponas, V.J., Iliopoulos, I.: Biotextquest+: a knowledge integration platform for literature mining and concept discovery. Bioinformatics 30(22), 3249–3256 (2014) 48. Kolpakov, F., Poroikov, V., Sharipov, R., Kondrakhin, Y., Zakharov, A., Lagunin, A., Milanesi, L., Kel, A.: Cyclonetan integrated database on cell cycle regulation and carcinogenesis. Nucleic acids research 35(suppl 1), 550–556 (2007) 49. Demir, E., Babur, Ö., Rodchenkov, I., Aksoy, B.A., Fukuda, K.I., Gross, B., Sümer, O.S., Bader, G.D., Sander, C.: Using biological pathway data with paxtools. PLoS computational biology 9(9), 1003194 (2013) 50. Masseroli, M., Pinoli, P., Venco, F., Kaitoua, A., Jalili, V., Palluzzi, F., Muller, H., Ceri, S.: Genometric query language: a novel approach to large-scale genomic data management. Bioinformatics 31(12), 1881–1888 (2015) 51. Mironov, V., Seethappan, N., Blond, W., Antezana, E., Splendiani, A., Kuiper, M.: Gauging triple stores with actual biological data. BMC bioinformatics 13(1), 3 (2012) 52. Wylot, M., Cudré-Mauroux, P.: Diplocloud: Efficient and scalable management of rdf data in the cloud. IEEE Transactions on Knowledge and Data Engineering 28(3), 659–674 (2016) 53. Huysmans, M., Richelle, J., Wodak, S.J.: Sesam: a relational database for structure and sequence of macromolecules. Proteins: Structure, Function, and Bioinformatics 11(1), 59–76 (1991) 54. Guo, Y., Pan, Z., Heflin, J.: Lubm: A benchmark for owl knowledge base systems. Web Semantics: Science, Services and Agents on the World Wide Web 3(2), 158– 182 (2005) 55. Ekseth, O.K., Hvasshovd, S.-O.: hpLysis database-engine: A new data-scheme for fast semantic queries in biomedical databases. Under review: Provides details of the in-memory data-engine: contact oekseth@gmail.com for the paper. (2017) 56. Bayer, R.: Symmetric binary b-trees: Data structure and maintenance algorithms. Acta Informatica 1, 290–306 (1972). 10.1007/BF00289509 57. Ekseth, K., Hvasshovd, S.: hpLysis MINE: A high-performance approach for com- putation of the accurate MINE simliarty-metric. http://www.knittingtools.org/ gui_lib_mine.cgi. Online; accessed 06. June 2017 58. Knight, W.R.: A computer method for calculating kendall’s tau with ungrouped data. Journal of the American Statistical Association 61(314), 436–439 (1966) 59. Kendall, M.G.: Rank correlation methods. (1948) 60. Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., Goldberg, L.J., Eilbeck, K., Ireland, A., Mungall, C.J., et al.: The obo foundry: coordinated evolution of ontologies to support biomedical data integration. Nature biotechnol- ogy 25(11), 1251–1255 (2007) 61. Hoffmann, R.: A wiki for the life sciences where authorship matters. Nature genetics 40(9), 1047–1051 (2008) 62. Chawla, K., Tripathi, S., Thommesen, L., Lægreid, A., Kuiper, M.: Tfcheckpoint: a curated compendium of specific dna-binding rna polymerase ii transcription fac- tors. Bioinformatics 29(19), 2519–2520 (2013) 63. Davis, L., Maizels, N.: Homology-directed repair of dna nicks via pathways distinct from canonical double-strand break repair. Proceedings of the National Academy of Sciences 111(10), 924–932 (2014) 64. Intel: SSE computer-hardware-low-level parallelism. https://software.intel. com/sites/landingpage/IntrinsicsGuide/. Online; accessed 06. June 2017 65. Drepper, U.: What every programmer should know about memory. Red Hat, Inc 11, 2007 (2007) 66. Cerami, E., Bader, G., Gross, B., Sander, C.: cpath: open source software for collecting, storing, and querying biological pathways. BMC bioinformatics 7(1), 497 (2006) 67. MySQL: MySQL database engine. https://www.mysql.com/ (2017) 68. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al.: Gene ontology: tool for the unification of biology. Nature genetics 25(1), 25–29 (2000) 69. Reshef, D.N., Reshef, Y.A., Finucane, H.K., Grossman, S.R., McVean, G., Turn- baugh, P.J., Lander, E.S., Mitzenmacher, M., Sabeti, P.C.: Detecting novel asso- ciations in large data sets. science 334(6062), 1518–1524 (2011) 70. Butland, G., Peregrı́n-Alvarez, J.M., Li, J., Yang, W., Yang, X., Canadien, V., Starostine, A., Richards, D., Beattie, B., Krogan, N., et al.: Interaction network containing conserved and essential protein complexes in escherichia coli. Nature 433(7025), 531–537 (2005) 71. Peaker, M., Taylor, E.: Sex ratio and litter size in the guinea-pig. Journal of re- production and fertility 108(1), 63–67 (1996) 72. Wright, S., Chase, H.B.: On the genetics of the spotted pattern of the guinea pig. Genetics 21(6), 758 (1936) 73. Pesquita, C., Faria, D., Falcao, A.O., Lord, P., Couto, F.M.: Semantic similarity in biomedical ontologies. PLoS computational biology 5(7), 1000443 (2009) 74. Antezana, E., Egana, M., Baets, B., Kuiper, M., Mironov, V.: Onto-perl: An api for supporting the development and analysis of bio-ontologies. Bioinformatics (2008) 75. Ehsani, R., Drabløs, F.: Topoicsim: a new semantic similarity measure based on gene ontology. BMC bioinformatics 17(1), 296 (2016) 76. Rada, R., Mili, H., Bicknell, E., Blettner, M.: Development and application of a metric on semantic nets. IEEE transactions on systems, man, and cybernetics 19(1), 17–30 (1989) 77. McInnes, B.T., Pedersen, T.: Evaluating measures of semantic similarity and relat- edness to disambiguate terms in biomedical text. Journal of biomedical informatics 46(6), 1116–1124 (2013) 78. Aalberg, T., Žumer, M.: The value of marc data, or, challenges of frbrisation. Journal of Documentation 69(6), 851–872 (2013)