An Analytic Approach for Discovery
                                              Eric Dull and Steven P. Reinhardt
                                                            Cray, Inc.
                                                      {edull,spr}@cray.com


   Abstract—With the widespread awareness of Big Data, mission           data guided by what we find initially. Sometimes we analyze
leaders now expect that the data available to their organizations        data and compare it to our mental expectations. Other times
will be immediately relevant to their missions. However, the             we compare data from one subset (place, time, etc.) to that
continuing onslaught of the ”data tsunami”, with data becoming
more diverse, changing nature more quickly, and growing in               from another subset, to see if they differ in unexpected ways.
volume and speed, confounds all but the simplest analysis and            Sometimes we analyze the values of the data and other times
the most capable organizations. The core challenge faced by              the connectivity of the data. Sometimes we create a synthetic
an analyst is to discover the most important knowledge in the            grid or discretization, say of geospace and time, to represent
data. She must overcome potential errors and inaccuracies in the         data for fast comparison. Discovery depends on being able
data and explore it, even though she understands it incompletely,
guided by her knowledge of the data’s domain.                            to compare dimensions quickly, without knowing in advance
   We have solved customer problems by quickly analyzing                 the dimensions to compare. Bringing the SME into direct
numerous dimensions of data to check its sanity and to compare           interaction with the data is essential to accelerating discovery.
it to expected values. Guided by what we find initially, we                 We implement this analytic approach by exploiting the abil-
quickly move on to further (unanticipated) dimensions of the             ity of graphs (vertices and edges) to represent the richness of
data; discovery depends on this ability. This approach vitally
brings the analyst into direct interaction with the data.                highly heterogeneous data and enable discovery within it. To
   We implement this approach by exploiting the ability of graphs        date, we use RDF as the data representation and SPARQL for
(vertices and edges) to represent highly heterogeneous data. We          queries, queries that can build on each other to focus rapidly on
use RDF as the data representation and SPARQL for queries. We            the highest-value knowledge (in the estimation of the subject-
use non-parametric tests, which are readily targeted to any type         matter expert) in the data. (We use heterogeneous rather than
of data. This graph-analytic approach, using proven techniques
with widely diverse data, represents a guidepost for delivering          semantic, because this approach is not limited to natural lan-
greatly improved analysis on much more complex data.                     guage.) RDF supports complex, dynamic ontologies, though
                                                                         that adds a burden of discovering the current ontology, which
                      I. I NTRODUCTION                                   we often achieve by summary graphs of vertex-types and the
   Analyst groups are being strongly challenged to understand            edge-types that connect them. We use Jaccard scoring [6]
quickly the insights latent in their organization’s data, despite        and non-parametric tests (typically Kolmogorov-Smirnov[7]),
its diversity, changing nature, volume, and speed. We focus on           which are readily targeted to any type of data when guided by a
the discovery aspect of analysis, where the analyst cannot rely          SME. Other non-parametric tests could easily be used in place
on techniques that have run previously on the given datasets,            of these. RDF and SPARQL are one data format and language
but rather must explore within the data in a way that cannot             that support implementation of this approach, but it may be
be predicted. The organization expects that the analyst will             implemented with other technologies, such as Spark/GraphX
quickly uncover the most important knowledge in the data.                from the Berkeley Data Analytic Stack [2].
Using her knowledge of the data’s domain, she must explore                  The graph-analytic approach, using proven techniques with
the data despite its potential imperfections.                            widely diverse data, represents a guidepost for delivering
   If we define discovery as finding a connection in the data            greatly better analysis on much more complex data.
of which the analyst was previously unaware, it requires
more than just delivering existing capabilities a little faster                    II. A DDRESSING THE C ORE C HALLENGE
or more easily used. Rather, it requires enabling a subject-                The core challenge facing many analytic organizations,
matter expert (SME), i.e. the person with the most knowledge             illustrated in Figure 1, has at its center a data repository, with
and intuition about the data, to explore the data quickly, and           both new instances of existing types of data and instances of
even, by appropriate pre-analysis, pulling the SME’s eye to              new types of data flowing into it. Results from existing analyt-
aspects of the data that are likely to be most fruitful of               ics must flow out to meet existing production needs, while new
further exploration. We have evolved an analytic approach                analytics must be created to discover further knowledge in the
to discovery, implemented in the semantic technologies RDF               data. This paper focuses on this discovery process, based on
and SPARQL, that enables rapid progress through analytic                 our experience working with a customer needing to understand
questions toward an analytic goal. We have solved high-value             traffic and vulnerabilities on its corporate network. In that
customer problems by quickly analyzing numerous dimen-                   context, some key entities are IP addresses, ports, protocols,
sions of data to check its sanity and to compare it to our               and Internet domain names. Section III.B provides an example
expectations, then moving on to further dimensions of the                of these techniques applied to flow-data


                                                                    89
                    Fig. 1. High-level Workflow


       III. A NALYTIC A PPROACHES FOR D ISCOVERY
                                                                                             Fig. 2. Discovered Ontology
   We start from the point of view of an analyst with little
knowledge about a body of data and proceed to the point of
deep knowledge about the data. Many of the techniques we                 the ontology as a graph is natural; whether the underlying
describe are useful at multiple points along this continuum.             representation is graph-based or not is immaterial at this level.
   Analysis is an iterative process that consists of framing                A next level of sense-making analyzes the values of a
the analytic problem, defining an approach, gathering the                given type of edge. For instance, flow data is often discretized
applicable data, understanding the biases present in the data            such that a flow cannot last more than an hour. Similarly,
(via sanity-checking and sense-making), applying analytic                TCP/UDP ports are limited to 0-65535. If the data has values
techniques consistent with the approach, determining the an-             outside those ranges, it is suspect. (Additionally, the analyst
alytic results that answer the problem, and documenting the              must ensure that the right units (e.g., English v. metric) are
answer to the analytic problem.                                          being used.) Not all types of edges will have values that
                                                                         will be readily known by the analyst, but for those that are,
A. Sense-making                                                          this can be a simple way of surfacing the subject-matter
   When an analyst first gets a new corpus of data, even new             knowledge of the expert, by showing the values in the data
instances of an existing corpus, the first task is to ascertain          and letting the analyst respond if the data looks suspect.
the sanity of the data, and then understand it deeply enough             Beyond correctness, understanding the semantic meaning of
to enable further analysis. In practice, an initial sanity-check         a field, e.g., a timestamp, is important. What is the format
on data is often necessary, but even then errors in the data may         of the timestamp (absolute, relative)? What is its precision?
still surface as analysis becomes more precise. For instance, if         When was it collected? What specific event is denoted by the
the data claims to cover the 24 hours of a day, but only has data        timestamp?
in the hours 0 through 9, there may have been an error in the               Once the basic structure of the data and its values are
software that generates the data. In a heterogeneous context,            understood, other questions arise, such as the quantity of the
if flow-data records use IP addresses that are not found in the          data and whether that indicates that we have all the data we
firewall records, it may be that the data is truly disjoint, or          expected or not. E.g., if the data purports to capture all flow
it may be that the IP addresses in the flow data and firewall            data from a 24-hour period, and we know there are about
data have not been constructed the same and hence do not                 30,000 flows per second, we should have about 2.6 billion
match. In either case, the anomaly needs to be understood                flows. If we do not, we may be missing data, the data may
before continuing with the data.                                         be summarized, or there is some other effect; in any case, the
   With heterogeneous data, sense-making includes discover-              analyst needs to understand the anomaly before proceeding.
ing the ontology (or schema, in relational database terms) in               Understanding the data values in more detail is another part
which the data is represented, as that ontology is not explicitly        of sense-making. What are the dominant moments, and, often
defined anywhere [1]. A summary graph depiction of the                   as importantly, what are the outliers? Analysts know that real-
ontology is shown in Fig. 2, with vertices representing the              world data is noisy and messy, so will want to avoid actual
types of objects in the data. The edges summarize the edges              noise but at the same time want to understand rare but real
between instances of the types of objects that are the subject           events accurately. In addition to looking at the total data, we
and object of triples in the data; edge thickness represents             often gain insight by comparing data from entities selected
the number of edges between the two types. While this figure             from different ranges in some important dimension, like place
is simpler than many real-world ontologies, it illustrates how           or time, to see whether the values in a given dimension also
the structure of the data can be readily represented for to              vary; e.g., are the top N most frequent external domains in
aid an analyst’s understanding. Note that the visualization of           Internet flow data from the day before yesterday and yesterday


                                                                    90
                                                                                     Fig. 4. Example of Kolmogorov-Smirnov Test

                 Fig. 3. Example of Jaccard Scoring
                                                                         requires two extensions of the canonical algorithm in section
                                                                         IV.A below. These extensions correct for very popular nodes
similar? Or, is the distribution of bytes per packet the same            and handle different types of nodes (categorical similarity).
for systems in Europe as in North America?                                  3) Kolmogorov-Smirnov: Once similarity scoring demon-
   This phase of the workflow is critical to an analytic process         strates a high degree of similarity in the predicate- and object-
with meaningful, repeatable results. Performing it requires              type composition of two data sets, another approach that can
tools, methods, and (most significantly for time-sensitive anal-         be applied is non-parametric goodness-of-fit statistical com-
ysis) time at the beginning of the analytic process.                     parison between degree distributions of subjects with identical
                                                                         semantic types, using tests such as Kolmogorov-Smirnov [7]
B. Techniques
                                                                         or Mann-Whitney U. For example, for each semantic type,
   1) Jaccard scoring: Jaccard similarity for comparing two              the degree distribution for each predicate type is generated
entities is defined as the cardinality of the intersection of            for the old, analytically-proven dataset and the new, believed-
values of the two entities divided by the cardinality of the             similar dataset. These degree distributions are compared via
union of their values. An example is shown in Fig. 3, where              non-parametric goodness-of-fit statistical tests. The success or
V1 and V2 each represent the flow data from one hour, with               failure for each semantic-type/predicate-type pair is noted and
the circles between representing the eight most commonly                 then the aggregate success and failure counts are presented to
visited Internet domain names, showing five entities in the              the analyst at the end of the evaluation.
intersection from a total of eleven entities in the union. The              In Fig. 4, we use Kolmogorov-Smirnov to compare two
SME must judge whether that level of variability merits further          sets of flow data, revealing that the time (measured from
investigation.                                                           GMT) of the activity is distinctly different, with the blue data
   When analyzing Internet flow data, Jaccard scoring can be             reflecting the normal work day of eastern Europe and the red
used to calculate the similarity between two time periods                data reflecting that of the US East coast.
of the top 20 visited Internet domain names. The Jaccard                    4) Graph Algorithms: In many cases further insight can be
calculation can be done either unweighted or weighted; e.g.,             gained from semantic data by directly analyzing it as a graph.
for the top-20-domains case, weighting would mean that the               For instance, knowing a set of IP addresses and the volume of
similarity between large number of visits to (say) google.com            data between each pair of addresses, the betweenness centrality
is weighted more than the small number of visits to another              algorithm will calculate which IP addresses are most central,
domain. Conversely, the weighting can be for rarity; e.g., if            giving insight into which might be most important for covert
we want to know whether two people are similar, the fact that            communication, or most disruptive were they to be disabled.
they both visit a domain that is rarely visited population-wide          Other graph metrics that may be valuable are the length of the
means more than the fact that they both visit a very common              shortest paths between entities, the communities that emerge
domain.                                                                  by analyzing the connectivity of the graph, and the most
   2) Synthetic discretization with Jaccard scoring: The ana-            likely path between entities (vertices). Such analysis of social
lytic applications discussed above are all examples where the            networks is well known, but any transactional network lends
analyst has subject matter expertise. Similarity scoring can also        itself to similar analysis. In our work we have used community
be used when the data is known but the meaning of the data               detection [10], betweenness centrality [3], and BadRank [4].
is unknown; i.e., to tell us if two data sets are similar enough
that sense making can be skipped, thus semi-automating this                      IV. I MPLEMENTATION WITH RDF/SPARQL
critical step at the beginning of the analytic workflow.                   Much of our work to date has focused on the RDF data
   Within a semantic context, we can apply Jaccard scoring               representation [9] and SPARQL query language [11] imple-
to the predicate and object types found in a data set. This              mented on Cray’s Urika-GDTM graph appliance.


                                                                    91
  A. Jaccard scoring                                               B. Graph Algorithms
     Both the intersection and union steps in Jaccard scoring map    Graph algorithms can be executed via SPARQL in two
  trivially to SPARQL constructs. In the source listing below, ways. The first is to implement the algorithm in SPARQL,
  an almost-identical subquery is repeated 4 times, once for often by a succession of queries [12], with the benefit of
  calculating the cardinality of each set in the intersection and having great flexibility to implement the algorithm and the
  once for calculating the cardinality of each set in the union. downside of needing to implement the algorithm one’s self.
                                                                                                                          TM
  The first instance of that subquery is in lines 9-14. It focuses The second, for a small set of algorithms on Urika-GD ,
  on the first set of data (L11) residing in a named graph g1, is to call a built-in graph function (BGF) in Summer 2014
  defined by the PREFIX statement (L2), and specifically the or later releases [8], with the benefit of ease of execution and
  data denoted by relationship (edge-type) :myPred. Those optimized performance, for the supported functions. The query
  instances are grouped by the subject, counted within each below illustrates a simple call to the community-detection
  group, ordered by descending count, and then only the 20 function. The INVOKE function calls the designated BGF and
  highest-count subjects are retained. Those partial results are the PRODUCING keyword assigns the results to SPARQL
  joined with the same from the second set by the enclosing variables.
  query (L7-21), which counts the number of distinct resulting 1 CONSTRUCT {
  subjects. The other subquery (L22-37) similarly calculates the 2   ?v1 ?count ?v2
  union, with the sole substantive difference (besides variable 3 } WHERE {
  names) being the UNION keyword inserted (L30) between the 4        SELECT ?v1 ?v2 (COUNT(?v2) as ?count)
                                                                  5  WHERE {
  two subsubqueries to denote that results that appear in either 6     ?v1 :myPred ?v2
  subsubquery should be retained. Finally the outer query (L5- 7     } GROUP BY ?v1 ?v2
  38) uses the numerator and denominator values to calculate 8 } INVOKE yda:community()
  the final Jaccard score.                                        9  PRODUCING ?v ?communityID

 1 PREFIX : <http://mydata.org>
 2 PREFIX g1: <urn:g/2014-04-03T10:00:00>                                                     V. C ONCLUSION
 3 PREFIX g2: <urn:g/2014-04-03T12:00:00>                             We have used an analytic approach successfully with several
 4
                                                                   different customers requiring discovery in highly diverse Big
 5 SELECT (?num/?denom AS ?jaccard)
 6 WHERE {                                                         Data. The approach grows from initially no knowledge of the
 7   { SELECT (COUNT(DISTINCT ?s) AS ?num)                         data to eventual deep knowledge, by enabling an analyst to
 8     WHERE {                                                     interact directly with the data and apply her domain knowledge
 9     { SELECT ?s (COUNT(?s) AS ?s1Ct)                            and intuition. We have implemented this approach with RDF
10       WHERE {
                                                                   and SPARQL on Cray’s Urika-GDTM graph-analytic appliance.
11         GRAPH g1: { ?s :myPred ?o1 }
12       } GROUP BY ?s ORDER BY DESC(?s1Ct)                        We believe this approach represents one way of addressing
13         LIMIT 20                                                the critical challenge of quickly discovering deep knowledge
14     }                                                           in highly diverse Big Data.
15     { SELECT ?s (COUNT(?s) AS ?s2Ct)
16       WHERE {                                                                                 R EFERENCES
17         GRAPH g2: { ?s :myPred ?o2 }                            [1] S. Al-Saffar et al, Structure Discovery in Large Semantic Graphs using
18       } GROUP BY ?s ORDER BY DESC(?s2Ct)                            Extant Ontological Scaling and Descriptive Semantics, Int’l Conferences
19         LIMIT 20                                                    on Web Intelligence and Intelligent Agent Technology, IEEE, 2011.
20     }                                                           [2] M.J. Franklin, Making Sense of Big Data with the Berkeley Data Analytics
21   } }                                                               Stack, Proceedings of the 25th International Conference on Scientific and
22   { SELECT (COUNT(DISTINCT ?s) AS ?denom)                           Statistical Database Management. ACM, 2013.
23     WHERE {                                                     [3] L. C. Freeman, A set of measures of centrality based on betweenness,
24     { SELECT ?s (COUNT(?s) AS ?s1Ct)                                Sociometry, 40, 35-41, 1977.
25       WHERE {                                                   [4] T.G. Kolda et al. Generalized badrank with graduated trust. Technical
                                                                       Report SAND2009-6670, Sandia National Labs, Albuquerque, NM, 2009.
26         GRAPH g1: { ?s :myPred ?o1 }                            [5] T. Mattson et al. Standards for graph algorithm primitives. In High
27       } GROUP BY ?s ORDER BY DESC(?s1Ct)                            Performance Extreme Computing Conference, 2013 IEEE (pp. 1-2).
28         LIMIT 20                                                [6] P. Jaccard. Étude comparative de la distribution florale dans une portion
29     }                                                               des Alpes et des Jura. Impr. Corbaz, 1901.
30     UNION                                                       [7] F.J. Massey, The Kolmogorov-Smirnov test for goodness of fit. Journal of
31     { SELECT ?s (COUNT(?s) AS ?s2Ct)                                the American Statistical Association, 1951.
32       WHERE {                                                   [8] D. Mizell et al, Extending SPARQL with Graph Functions, High Perfor-
33         GRAPH g2: { ?s :myPred ?o2 }                                mance Big Graph Data Management, Analysis, and Mining, IEEE, 2014.
34       } GROUP BY ?s ORDER BY DESC(?s2Ct)                        [9] RDF Working Group, Resource Description Framework (RDF), 2004.
                                                                   [10] E. J. Riedy et al, Parallel Community Detection for Massive Graphs,
35         LIMIT 20                                                    Lecture Notes in Computer Science, Volume 7203, pp 286-296, 2012.
36     }                                                           [11] S. Harris et al, SPARQL 1.1 Query Language (W3C Recommendation
37   } }                                                               21 March 2013), http://www.w3.org/TR/sparql11-query/, 2013.
38 }                                                               [12] R. Techentin et al, Implementing Iterative Algorithms with SPARQL,
                                                                       Third International Workshop on Querying Graph Structured Data, 2014.


                                                              92