An Analytic Approach for Discovery Eric Dull and Steven P. Reinhardt Cray, Inc. {edull,spr}@cray.com Abstract—With the widespread awareness of Big Data, mission data guided by what we find initially. Sometimes we analyze leaders now expect that the data available to their organizations data and compare it to our mental expectations. Other times will be immediately relevant to their missions. However, the we compare data from one subset (place, time, etc.) to that continuing onslaught of the ”data tsunami”, with data becoming more diverse, changing nature more quickly, and growing in from another subset, to see if they differ in unexpected ways. volume and speed, confounds all but the simplest analysis and Sometimes we analyze the values of the data and other times the most capable organizations. The core challenge faced by the connectivity of the data. Sometimes we create a synthetic an analyst is to discover the most important knowledge in the grid or discretization, say of geospace and time, to represent data. She must overcome potential errors and inaccuracies in the data for fast comparison. Discovery depends on being able data and explore it, even though she understands it incompletely, guided by her knowledge of the data’s domain. to compare dimensions quickly, without knowing in advance We have solved customer problems by quickly analyzing the dimensions to compare. Bringing the SME into direct numerous dimensions of data to check its sanity and to compare interaction with the data is essential to accelerating discovery. it to expected values. Guided by what we find initially, we We implement this analytic approach by exploiting the abil- quickly move on to further (unanticipated) dimensions of the ity of graphs (vertices and edges) to represent the richness of data; discovery depends on this ability. This approach vitally brings the analyst into direct interaction with the data. highly heterogeneous data and enable discovery within it. To We implement this approach by exploiting the ability of graphs date, we use RDF as the data representation and SPARQL for (vertices and edges) to represent highly heterogeneous data. We queries, queries that can build on each other to focus rapidly on use RDF as the data representation and SPARQL for queries. We the highest-value knowledge (in the estimation of the subject- use non-parametric tests, which are readily targeted to any type matter expert) in the data. (We use heterogeneous rather than of data. This graph-analytic approach, using proven techniques with widely diverse data, represents a guidepost for delivering semantic, because this approach is not limited to natural lan- greatly improved analysis on much more complex data. guage.) RDF supports complex, dynamic ontologies, though that adds a burden of discovering the current ontology, which I. I NTRODUCTION we often achieve by summary graphs of vertex-types and the Analyst groups are being strongly challenged to understand edge-types that connect them. We use Jaccard scoring [6] quickly the insights latent in their organization’s data, despite and non-parametric tests (typically Kolmogorov-Smirnov[7]), its diversity, changing nature, volume, and speed. We focus on which are readily targeted to any type of data when guided by a the discovery aspect of analysis, where the analyst cannot rely SME. Other non-parametric tests could easily be used in place on techniques that have run previously on the given datasets, of these. RDF and SPARQL are one data format and language but rather must explore within the data in a way that cannot that support implementation of this approach, but it may be be predicted. The organization expects that the analyst will implemented with other technologies, such as Spark/GraphX quickly uncover the most important knowledge in the data. from the Berkeley Data Analytic Stack [2]. Using her knowledge of the data’s domain, she must explore The graph-analytic approach, using proven techniques with the data despite its potential imperfections. widely diverse data, represents a guidepost for delivering If we define discovery as finding a connection in the data greatly better analysis on much more complex data. of which the analyst was previously unaware, it requires more than just delivering existing capabilities a little faster II. A DDRESSING THE C ORE C HALLENGE or more easily used. Rather, it requires enabling a subject- The core challenge facing many analytic organizations, matter expert (SME), i.e. the person with the most knowledge illustrated in Figure 1, has at its center a data repository, with and intuition about the data, to explore the data quickly, and both new instances of existing types of data and instances of even, by appropriate pre-analysis, pulling the SME’s eye to new types of data flowing into it. Results from existing analyt- aspects of the data that are likely to be most fruitful of ics must flow out to meet existing production needs, while new further exploration. We have evolved an analytic approach analytics must be created to discover further knowledge in the to discovery, implemented in the semantic technologies RDF data. This paper focuses on this discovery process, based on and SPARQL, that enables rapid progress through analytic our experience working with a customer needing to understand questions toward an analytic goal. We have solved high-value traffic and vulnerabilities on its corporate network. In that customer problems by quickly analyzing numerous dimen- context, some key entities are IP addresses, ports, protocols, sions of data to check its sanity and to compare it to our and Internet domain names. Section III.B provides an example expectations, then moving on to further dimensions of the of these techniques applied to flow-data 89 Fig. 1. High-level Workflow III. A NALYTIC A PPROACHES FOR D ISCOVERY Fig. 2. Discovered Ontology We start from the point of view of an analyst with little knowledge about a body of data and proceed to the point of deep knowledge about the data. Many of the techniques we the ontology as a graph is natural; whether the underlying describe are useful at multiple points along this continuum. representation is graph-based or not is immaterial at this level. Analysis is an iterative process that consists of framing A next level of sense-making analyzes the values of a the analytic problem, defining an approach, gathering the given type of edge. For instance, flow data is often discretized applicable data, understanding the biases present in the data such that a flow cannot last more than an hour. Similarly, (via sanity-checking and sense-making), applying analytic TCP/UDP ports are limited to 0-65535. If the data has values techniques consistent with the approach, determining the an- outside those ranges, it is suspect. (Additionally, the analyst alytic results that answer the problem, and documenting the must ensure that the right units (e.g., English v. metric) are answer to the analytic problem. being used.) Not all types of edges will have values that will be readily known by the analyst, but for those that are, A. Sense-making this can be a simple way of surfacing the subject-matter When an analyst first gets a new corpus of data, even new knowledge of the expert, by showing the values in the data instances of an existing corpus, the first task is to ascertain and letting the analyst respond if the data looks suspect. the sanity of the data, and then understand it deeply enough Beyond correctness, understanding the semantic meaning of to enable further analysis. In practice, an initial sanity-check a field, e.g., a timestamp, is important. What is the format on data is often necessary, but even then errors in the data may of the timestamp (absolute, relative)? What is its precision? still surface as analysis becomes more precise. For instance, if When was it collected? What specific event is denoted by the the data claims to cover the 24 hours of a day, but only has data timestamp? in the hours 0 through 9, there may have been an error in the Once the basic structure of the data and its values are software that generates the data. In a heterogeneous context, understood, other questions arise, such as the quantity of the if flow-data records use IP addresses that are not found in the data and whether that indicates that we have all the data we firewall records, it may be that the data is truly disjoint, or expected or not. E.g., if the data purports to capture all flow it may be that the IP addresses in the flow data and firewall data from a 24-hour period, and we know there are about data have not been constructed the same and hence do not 30,000 flows per second, we should have about 2.6 billion match. In either case, the anomaly needs to be understood flows. If we do not, we may be missing data, the data may before continuing with the data. be summarized, or there is some other effect; in any case, the With heterogeneous data, sense-making includes discover- analyst needs to understand the anomaly before proceeding. ing the ontology (or schema, in relational database terms) in Understanding the data values in more detail is another part which the data is represented, as that ontology is not explicitly of sense-making. What are the dominant moments, and, often defined anywhere [1]. A summary graph depiction of the as importantly, what are the outliers? Analysts know that real- ontology is shown in Fig. 2, with vertices representing the world data is noisy and messy, so will want to avoid actual types of objects in the data. The edges summarize the edges noise but at the same time want to understand rare but real between instances of the types of objects that are the subject events accurately. In addition to looking at the total data, we and object of triples in the data; edge thickness represents often gain insight by comparing data from entities selected the number of edges between the two types. While this figure from different ranges in some important dimension, like place is simpler than many real-world ontologies, it illustrates how or time, to see whether the values in a given dimension also the structure of the data can be readily represented for to vary; e.g., are the top N most frequent external domains in aid an analyst’s understanding. Note that the visualization of Internet flow data from the day before yesterday and yesterday 90 Fig. 4. Example of Kolmogorov-Smirnov Test Fig. 3. Example of Jaccard Scoring requires two extensions of the canonical algorithm in section IV.A below. These extensions correct for very popular nodes similar? Or, is the distribution of bytes per packet the same and handle different types of nodes (categorical similarity). for systems in Europe as in North America? 3) Kolmogorov-Smirnov: Once similarity scoring demon- This phase of the workflow is critical to an analytic process strates a high degree of similarity in the predicate- and object- with meaningful, repeatable results. Performing it requires type composition of two data sets, another approach that can tools, methods, and (most significantly for time-sensitive anal- be applied is non-parametric goodness-of-fit statistical com- ysis) time at the beginning of the analytic process. parison between degree distributions of subjects with identical semantic types, using tests such as Kolmogorov-Smirnov [7] B. Techniques or Mann-Whitney U. For example, for each semantic type, 1) Jaccard scoring: Jaccard similarity for comparing two the degree distribution for each predicate type is generated entities is defined as the cardinality of the intersection of for the old, analytically-proven dataset and the new, believed- values of the two entities divided by the cardinality of the similar dataset. These degree distributions are compared via union of their values. An example is shown in Fig. 3, where non-parametric goodness-of-fit statistical tests. The success or V1 and V2 each represent the flow data from one hour, with failure for each semantic-type/predicate-type pair is noted and the circles between representing the eight most commonly then the aggregate success and failure counts are presented to visited Internet domain names, showing five entities in the the analyst at the end of the evaluation. intersection from a total of eleven entities in the union. The In Fig. 4, we use Kolmogorov-Smirnov to compare two SME must judge whether that level of variability merits further sets of flow data, revealing that the time (measured from investigation. GMT) of the activity is distinctly different, with the blue data When analyzing Internet flow data, Jaccard scoring can be reflecting the normal work day of eastern Europe and the red used to calculate the similarity between two time periods data reflecting that of the US East coast. of the top 20 visited Internet domain names. The Jaccard 4) Graph Algorithms: In many cases further insight can be calculation can be done either unweighted or weighted; e.g., gained from semantic data by directly analyzing it as a graph. for the top-20-domains case, weighting would mean that the For instance, knowing a set of IP addresses and the volume of similarity between large number of visits to (say) google.com data between each pair of addresses, the betweenness centrality is weighted more than the small number of visits to another algorithm will calculate which IP addresses are most central, domain. Conversely, the weighting can be for rarity; e.g., if giving insight into which might be most important for covert we want to know whether two people are similar, the fact that communication, or most disruptive were they to be disabled. they both visit a domain that is rarely visited population-wide Other graph metrics that may be valuable are the length of the means more than the fact that they both visit a very common shortest paths between entities, the communities that emerge domain. by analyzing the connectivity of the graph, and the most 2) Synthetic discretization with Jaccard scoring: The ana- likely path between entities (vertices). Such analysis of social lytic applications discussed above are all examples where the networks is well known, but any transactional network lends analyst has subject matter expertise. Similarity scoring can also itself to similar analysis. In our work we have used community be used when the data is known but the meaning of the data detection [10], betweenness centrality [3], and BadRank [4]. is unknown; i.e., to tell us if two data sets are similar enough that sense making can be skipped, thus semi-automating this IV. I MPLEMENTATION WITH RDF/SPARQL critical step at the beginning of the analytic workflow. Much of our work to date has focused on the RDF data Within a semantic context, we can apply Jaccard scoring representation [9] and SPARQL query language [11] imple- to the predicate and object types found in a data set. This mented on Cray’s Urika-GDTM graph appliance. 91 A. Jaccard scoring B. Graph Algorithms Both the intersection and union steps in Jaccard scoring map Graph algorithms can be executed via SPARQL in two trivially to SPARQL constructs. In the source listing below, ways. The first is to implement the algorithm in SPARQL, an almost-identical subquery is repeated 4 times, once for often by a succession of queries [12], with the benefit of calculating the cardinality of each set in the intersection and having great flexibility to implement the algorithm and the once for calculating the cardinality of each set in the union. downside of needing to implement the algorithm one’s self. TM The first instance of that subquery is in lines 9-14. It focuses The second, for a small set of algorithms on Urika-GD , on the first set of data (L11) residing in a named graph g1, is to call a built-in graph function (BGF) in Summer 2014 defined by the PREFIX statement (L2), and specifically the or later releases [8], with the benefit of ease of execution and data denoted by relationship (edge-type) :myPred. Those optimized performance, for the supported functions. The query instances are grouped by the subject, counted within each below illustrates a simple call to the community-detection group, ordered by descending count, and then only the 20 function. The INVOKE function calls the designated BGF and highest-count subjects are retained. Those partial results are the PRODUCING keyword assigns the results to SPARQL joined with the same from the second set by the enclosing variables. query (L7-21), which counts the number of distinct resulting 1 CONSTRUCT { subjects. The other subquery (L22-37) similarly calculates the 2 ?v1 ?count ?v2 union, with the sole substantive difference (besides variable 3 } WHERE { names) being the UNION keyword inserted (L30) between the 4 SELECT ?v1 ?v2 (COUNT(?v2) as ?count) 5 WHERE { two subsubqueries to denote that results that appear in either 6 ?v1 :myPred ?v2 subsubquery should be retained. Finally the outer query (L5- 7 } GROUP BY ?v1 ?v2 38) uses the numerator and denominator values to calculate 8 } INVOKE yda:community() the final Jaccard score. 9 PRODUCING ?v ?communityID 1 PREFIX : 2 PREFIX g1: V. C ONCLUSION 3 PREFIX g2: We have used an analytic approach successfully with several 4 different customers requiring discovery in highly diverse Big 5 SELECT (?num/?denom AS ?jaccard) 6 WHERE { Data. The approach grows from initially no knowledge of the 7 { SELECT (COUNT(DISTINCT ?s) AS ?num) data to eventual deep knowledge, by enabling an analyst to 8 WHERE { interact directly with the data and apply her domain knowledge 9 { SELECT ?s (COUNT(?s) AS ?s1Ct) and intuition. We have implemented this approach with RDF 10 WHERE { and SPARQL on Cray’s Urika-GDTM graph-analytic appliance. 11 GRAPH g1: { ?s :myPred ?o1 } 12 } GROUP BY ?s ORDER BY DESC(?s1Ct) We believe this approach represents one way of addressing 13 LIMIT 20 the critical challenge of quickly discovering deep knowledge 14 } in highly diverse Big Data. 15 { SELECT ?s (COUNT(?s) AS ?s2Ct) 16 WHERE { R EFERENCES 17 GRAPH g2: { ?s :myPred ?o2 } [1] S. Al-Saffar et al, Structure Discovery in Large Semantic Graphs using 18 } GROUP BY ?s ORDER BY DESC(?s2Ct) Extant Ontological Scaling and Descriptive Semantics, Int’l Conferences 19 LIMIT 20 on Web Intelligence and Intelligent Agent Technology, IEEE, 2011. 20 } [2] M.J. Franklin, Making Sense of Big Data with the Berkeley Data Analytics 21 } } Stack, Proceedings of the 25th International Conference on Scientific and 22 { SELECT (COUNT(DISTINCT ?s) AS ?denom) Statistical Database Management. ACM, 2013. 23 WHERE { [3] L. C. Freeman, A set of measures of centrality based on betweenness, 24 { SELECT ?s (COUNT(?s) AS ?s1Ct) Sociometry, 40, 35-41, 1977. 25 WHERE { [4] T.G. Kolda et al. Generalized badrank with graduated trust. Technical Report SAND2009-6670, Sandia National Labs, Albuquerque, NM, 2009. 26 GRAPH g1: { ?s :myPred ?o1 } [5] T. Mattson et al. Standards for graph algorithm primitives. In High 27 } GROUP BY ?s ORDER BY DESC(?s1Ct) Performance Extreme Computing Conference, 2013 IEEE (pp. 1-2). 28 LIMIT 20 [6] P. Jaccard. Étude comparative de la distribution florale dans une portion 29 } des Alpes et des Jura. Impr. Corbaz, 1901. 30 UNION [7] F.J. Massey, The Kolmogorov-Smirnov test for goodness of fit. Journal of 31 { SELECT ?s (COUNT(?s) AS ?s2Ct) the American Statistical Association, 1951. 32 WHERE { [8] D. Mizell et al, Extending SPARQL with Graph Functions, High Perfor- 33 GRAPH g2: { ?s :myPred ?o2 } mance Big Graph Data Management, Analysis, and Mining, IEEE, 2014. 34 } GROUP BY ?s ORDER BY DESC(?s2Ct) [9] RDF Working Group, Resource Description Framework (RDF), 2004. [10] E. J. Riedy et al, Parallel Community Detection for Massive Graphs, 35 LIMIT 20 Lecture Notes in Computer Science, Volume 7203, pp 286-296, 2012. 36 } [11] S. Harris et al, SPARQL 1.1 Query Language (W3C Recommendation 37 } } 21 March 2013), http://www.w3.org/TR/sparql11-query/, 2013. 38 } [12] R. Techentin et al, Implementing Iterative Algorithms with SPARQL, Third International Workshop on Querying Graph Structured Data, 2014. 92