Proceedings of the 11th International Conference on Applied Informatics Eger, Hungary, January 29–31, 2020, published at http://ceur-ws.org Are Graph Databases Fast Enough for Static P4 Code Analysis?∗ Dániel Lukácsa , Gergely Pongráczb , Máté Tejfelc a Faculty of Informatics Eötvös Loránd University, Budapest, Hungary dlukacs@inf.elte.hu ORCID: 0000-0001-9738-1134 b Ericsson Hungary Ltd., Budapest, Hungary Gergely.Pongracz@ericsson.com ORCID: 0000-0002-5115-9973 c 3in Research Group, Eötvös Loránd University, Martonvásár, Hungary matej@inf.elte.hu ORCID: 0000-0002-5115-9973 Abstract P4 is a new protocol-programming language for smart switches and virtual network functions linked up in vast software defined networks (SDNs). Our current work in progress is focused on analysing the execution cost of P4 programs using hierarchical control flow graphs (CFGs). We found that cost analysis of P4 requires storing and working with diverse information about various execution environments. In theory, versatile graph backends, such as graph databases (GDBs) could provide a unifying representation in which we can store, access and traverse the CFGs and the environment information together. On the other hand, analysis efficiency is a requirement both for large-scale testing and end user application. Unfortunately, we cannot conclusively assess – without trying it out in practice – whether GDB disk and memory reads will ruin our perfor- mance or not. ∗ The research has been supported by the European Union, co-financed by the European Social Fund (EFOP-3.6.2-16-2017-00013, Thematic Fundamental Research Collaborations Grounding Innovation in Informatics and Infocommunications). Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 213 In this work, we briefly detail how we realized P4 control flow analysis based on GDBs, and present our measurements to conclude if the overhead inherent in GDBs is justified for our purposes. Keywords: static analysis, P4, cost analysis, graph database, software defined networking 1. Introduction Softwarization of modern networks introduces new challenges for tools supporting agile development and analysis of new protocols. P4 is a relatively new program- ming language targeting network switches [2]. As the overall functioning of the network requires switches to process packets with an acceptable latency, it is im- portant for switches, and P4 programs (protocols) executed by the switches, to execute as fast as possible. The overall goal of our research is to develop a static cost analysis framework for P4 to assist network engineers in estimating the exe- cution cost of the P4 program during development, well before deployment. To perform cost analysis, we take as input the P4 program code, and an ap- propriate model of the target environment. Then, we transform this information into an abstract cost formula. Missing runtime information will be represented in the formula as variables. In turn, the derived cost formula can be used in many ways: the cost of a P4 program can be inferred given an existing hardware config- uration, or inversely, we can infer the set of configurations that satisfy a given cost constraint. This paper has two contributions. First, we reformulated our earlier cost analy- sis algorithm (see below) as a graph query to enable the reuse of established graph database (GDB) services (Section 3). Second, we implemented this query, along with our earlier algorithm in Java, and performed experiments to measure the per- formance of the two cost analyses on real-world P4 programs (Section 4). Here, our intention is to establish the feasibility and limits of performing P4 cost analysis real-time, i.e. during development, or possibly, at runtime of P4 programs. The results suggest that – despite the overhead – graph databases are viable platforms for implementing cost analysis. 1.1. Motivation The goals of this paper are motivated by practical – instead of theoretical – con- cerns. Modern software engineering involves both transactional and analytical operations: multiple developers armed with their IDEs asynchronously work to- gether on a large code base, which is cyclically inspected (augmented with static analysis and verification tools) for testing and code review purposes. In hope that we can innovate industrial-scale software – not “just” an algorithm –, we intend to assess whether we can reuse common database features (instead of reinvent- ing the wheel) without significant losses in runtime efficiency. Among these are scalability (through ease of multi-core or multi-machine parallelization), multi-user 214 concurrency (transaction processing, rollback, etc.), fault-tolerance, and security features. Cost analysis has exponential complexity. For this reason, we rely on empirical measurements: we expect our algorithm to be near instantaneous for programs with small number of execution paths, but we also test the analysis on programs with large number of exeuction paths to illustrate usability limits. 1.2. Earlier work Our approach to cost analysis of P4 programs is based on control flow graph (CFG) path-analysis [3]. In our earlier work, we introduced an approach that, in essence, enumerates all ∑︀ 𝜋 execution paths in a 𝑔 CFG, and derives its average execution cost as 𝐸(𝑔) = 𝜋∈paths(𝑔) 𝑃 (𝜋)cost(𝜋). That is, the average execution cost is the sum of the execution cost of each paths, weighted by the probability of those paths occurring. In the main section of that work, we managed to decompose path costs into compositions of conditional expectations that encoded the dependence of a program statement on the effect of previous statements. With this, it became possible to analyse the components separately. Moreover, this step formalized the input requirements of our analysis either as conditional cost formulas, or as models from which these formulas can be derived. In that work, we also presented a procedural algorithm to calculate the above formula given a CFG. The algorithm, starting from the root, traverses the nodes of each CFG paths in a breadth-first manner, collecting both the cost and probability information, and the conditions node-by-node on each path. The algorithm also features an optimization: many CFG paths share the same prefix, and the traversal avoids processing these segments multiple times. In Section 3, we reformulated this algorithm as a GDB query in Gremlin. Pre- cisely comparing the two notation is difficult, since in Gremlin, we only program one traverser, that is automatically split by Gremlin when it meets a branching point in the CFG. Without actually looking behind the abstraction and inspect- ing the GDB implementation, we can only guess that the query denotes the same behavior. On the other hand, this abstract formulation has advantages in itself, beyond those GDB features listed earlier. It generalizes our original algorithm for CFGs represented with different schemas, and it is ready to be parallelized, specif- ically at those points, where the traverser meets a branching point. In Section 4 of this work, we use empirical claims to argue that these benefits can be achieved without a significant loss in efficiency. 2. Related work A question similar to our title was also asked by [7] about applicability of GDBs for static analysis (SA) in general. The authors emphasize the advantage of schema- less GDBs over RDBMSs in addressing cross-cutting concerns in SA queries. By extending vertices and edges with new properties, data flow, CFG, call graph, type 215 Name In-memory Data model Lightweight Single query edges parallelization TinkerGraph- Yes Native Yes No 3.4.4 Neo4J-0.7-3.2.3 No Native Yes No OrientDB- No Document Yes Yes 3.1.0M3 JanusGraph- No Wide-column No Yes Cassandra- 0.4.0 Table 1: Characterization of GDB providers hierarchy, etc. information can be overlaid on the same graph structure (usually the syntax tree). In turn, these overlays can be queried using one, universal graph query. The authors also tested practical feasibility of GDBs for SA: they used Neo4J to store overlays on 2 million lines of JAVA source code and formulated graph queries in the declarative Cypher language. They found that loading took around 10 times longer than javac compilation, queries took 3-5 times longer with a cold (as opposed to warm) cache, and that query response time scaled linearly with program size. As even their slowest query took around 10 seconds on this large code base, they conclude that GDBs are indeed feasible for representing static analysis data structures. In this paper, we implemented our algorithm as a graph query in Gremlin [6]. Gremlin is a compositional programming language embedded in Java (and many other languages) for describing graph traversals. In Gremlin, complex traversals are realized as compositions of elementary traversals, called steps (vertex and edge selections, conditions, repetitions, etc.). The language-specific Gremlin traversal is compiled to the language-agnostic bytecode of the Gremlin traversal machine. This traversal machine interacts with the graph database backends (providers) through a common graph API. One important feature of Gremlin is that supports many graph providers: very different providers can be exchanged without any modification of the Gremlin queries. The survey in [1] provides a comprehensive classification of the most important graph backends as of 2019. In this work, we only utilize a fraction of these. (TinkerGraph – intended mostly for testing purposes – is not in the survey.) Table 1 lists some of the features of the GDB providers we tested. Apache TinkerGraph is the only in-memory GDB, other GDBs have to cache persisted data as part of the queries. All providers are NoSQL databases: unfortunately, we could not find RDBMS-based GDBs that are compatible with modern Gremlin APIs. Native GDBs are systems that were built specifically to store graphs, as opposed to systems that “retrofit” other NoSQL data models, such as document- based or wide-column DBs. In practice, this means that native GDBs use efficient, close-to-machine implementations of the typical graph representations (adjacency lists, edge lists, etc.). This does not mean non-native data models can not represent 216 graphs efficiently. OrientDB, for example, supports lightweight edges: this means that vertices may store pointers to each other’s location (as opposed to storing a vertex-index, on which a costly lookup is performed). While some GDBs support utilization of multiple cores for a single query, we used each GDB in single core mode to aid comparability. 3. Experiment design To answer the presented question, we selected one small, and one large example P4 CFG, ran the algorithm in various configurations, and measured runtime. In this section, we describe these examples and algorithms. 3.1. P4 use cases For profiling the algorithm, we selected two publicly available P4 programs, and then selected one CFG from each. In this section, we describe the selected compo- nents and our motivation as well for selecting these instead of other components. We collected the graph characteristics of the program components influencing our choices into Table 2. The small example – chosen for testing and illustration purposes – is the ingress component from basic_routing-bmv2.p4 [4], depicted in Figure 1. This switch program essentially describes a simple L3 IP router, and was found among the test cases of the official P4 reference compiler, P4C. The simplicity of P4 is well illustrated by the fact that ingress is the most complicated component of this L3 switch program. This component is executed on a packet that was previously parsed into the structure named hdr, which stores the IP4 header in its ipv4 field. First, it de- clares various match-action tables (packet matching algorithms) and defines actions (packet transformation algorithms) that are executed on matching packets. Tables basically map packets to actions. For terseness, we omitted the full declarations and definitions. Note, that tables are usually not defined in the P4 programs: these are received from the SDN controller during runtime. Then, the program describes the actual control flow. The packet is to matched by tables port_mapping, bd, and ipv_fib. In case that last table mapped the packet to action on_miss, ipv_fib_lpm is applied as well. Finally, table nexthop is applied. As a real-world, large-size use case, for conducting plausible measurements, we chosen the egress component of switch.p4 [5]. This switch program is capable of handling all of the most important protocols used in L2/L3 data planes. Unfortunately, it was written in version 14 of the P4 language (P414 ), so we had to use P4C to transpile it to the current version (P416 ). It is also unmaintained, so an older version of P4C is required for this process. 1 For these measurements, we used a simplified cost model, and so the cost column displays the average number of basic blocks visited when traversing from entry to exit. 217 Name |V| |E| Choice Path prefixes Full paths Min Max Avg Cost1 basic_routing-bmv2.p4 ParserImpl 4 3 1 3 2 1 2 1.5 1.5 ingress 8 9 2 11 3 2 7 5 4.25 egress 3 2 0 2 1 2 2 2 2 switch.p4 ParserImpl 50 133 24 15860 7477 2 20* 12.16 4.15 ingress 123 180 58 ? 204750249984 30 ? ? ? egress 67 90 24 1015381 374547 3 62 44.77 18.0625 Table 2: Component CFG characteristics of two P4 programs __START__ 1 control ingress(inout headers hdr, 2 inout metadata meta, 3 inout standard_metadata_t stm) { hdr.ipv4.isValid(); 4 action set_vrf(bit<12> vrf) { ... } 5 action on_miss() { ... } 6 action fib_hit_nexthop(bit<16> nexthop_index) TRUE 7 { ... } 8 action set_egress_details(bit<9> egress_spec) 9 { ... } ingress.port_mapping 10 action set_bd(bit<16> bd) { ... } 11 table bd { ... } 12 table ipv4_fib { ... } 13 table ipv4_fib_lpm { ... } 14 table nexthop { ... } ingress.bd 15 table port_mapping { ... } 16 17 apply { 18 if (hdr.ipv4.isValid()) { FALSE ingress.ipv4_fib 19 port_mapping.apply(); 20 bd.apply(); 21 switch (ipv4_fib.apply().action_run) on_miss_1/on_miss; 22 { 23 on_miss: { 24 ipv4_fib_lpm.apply(); ingress.ipv4_fib_lpm default 25 } 26 } 27 28 nexthop.apply(); 29 } ingress.nexthop 30 } 31 } __EXIT__ Figure 1: Source code and CFG of ingress control from basic_routing-bmv2.p4 (created using P4C) Another candidate from switch.p4 was ingress. Unfortunately, this com- ponent has so many execution paths, our algorithm ran out of RAM and thus it could not finish in a reasonable time. In practice, we consider such large controls as extreme cases though: these were specific to P414 , as P416 introduced custom controls that enabled better separation of concerns (resulting in smaller CFGs). We also decided not to measure ParserImpl. It has the same size as egress, yet it is very flat, promising very fast executions. Unfortunately, it contains a self- loop, and cycles are deprecated in current versions of the P4 specification. 218 1 Double costAnalysis(GraphTraversalSource g, 2 Consumer> f, 3 Object sid, 4 Object[] eids){ 5 CostState cs = g.withSack(new CostState()) 6 .V(sid) 7 .repeat(__.outE().as("e") 8 .sideEffect(f) 9 .inV()) 10 .until(__.hasId(P.within(eids))) 11 .sack() 12 .fold(new CostState(), 13 (s, o) -> s.merge((CostState) o)) 14 .next(); 15 return cs.getCost(); 16 } Figure 2: Executable cost analysis query in Gremlin-Java 3.4.4 3.2. CFG-based cost analysis algorithm In this current paper, we profile a plain Java implementation of the algorithm presented in our earlier work. Beyond that – considering our motivation in the Motivation section to extend this software with application-centered features –, we adapt this algorithm to GDBs as well. We chose the Gremlin query language to formulate the algorithm, as according to [1], it is well supported across GDBs utilizing different data models. Figure 2 depicts the query we constructed in the Java 8 language variant of Gremlin 3.4.4. Only CostState is a custom class, the rest can be understood using the documentation of the Java and Gremlin standard libraries. CostState is basically a pair that keeps account of the probability product and cost sum on the path being traversed. The traversal starts traversing the graph from the vertex with the provided ID, and in each step creates a new copy of its local CostState, with its information updated by function f. This function accesses the current path history in the Traverser (assumed to include only named elements, which are the edges, in this case), queries the probability of the current edge (conditional on the path) and the cost of the destination vertex of that edge (conditional on the path) to update the probability product and cost sum belonging to this path in the sack. The copying ensures that parallel paths will not affect the state of each other. When the exit nodes are reached, the states from all paths are collected, weighted and summarized into one number, the average cost. While the high-level nature of GDB query languages makes it difficult to map the Java implementation to the query language exactly, we believe this naive query approximates the runtime characteristics of the original algorithm reasonably well. In comparison to our earlier paper [3], here we applied an extremely oversim- plified environment model (and used the predicted value only for checking consis- tency). Namely, all vertices have a cost of 1 unit, and sibling edges have uniform probability (that is, if the source vertex of an edge 𝑒 has 𝑛 children, then 𝑒 has 𝑛1 probability). 219 4. Evaluation Here, we describe the machine environment in which we conducted the experiment, and the measured results regarding two P4 code samples. 4.1. Configuration Table 3 collects the specifications of the machine we performed the experiments on. We should note that all providers were running the query on a single JVM thread, so these numbers can characterize the execution environment only to a limited extent. All our experiments are warm cache measurements: each provider was tested in its own JVM instance, but all tests were running directly one after the other, without restarting the JVM. CPU Clockspeed Cores2 L1 L2 L3 RAM Intel Core i7-7500U 3.5 GHz 4 128KiB 512KiB 4MiB 7872 MiB Table 3: Specifications of the processing environment 4.2. Case basic_routing-bmv2.p4/ingress We measured algorithm runtime w.r.t. the two cases differently. For the small case, we simply repeated the experiment 100 times to average out random noise. These results are depicted in Table 4. Provider Java TinkerGraph Neo4J OrientDB JanusGraph Runtime (𝜇s) 172.16 1964.64 2517.68 5253.09 6728052.78 Table 4: Algorithm runtime on a small CFG input In this small case, the GDB overhead results in at least one magnitude worse performance, compared to the Java implementation, but still, the CFG was han- dled almost instantaneously by most backends. (The large, 6 second overhead of JanusGraph was probably caused by improper configuration on our part). Not all cost analysis approaches can provide this performance: traditional syn- tax tree transformation-based cost analysis methods are expected to have many magnitudes worse performance. We can expect this performance to be stable even if new environment informa- tion is added to the model. Expanding nodes (as described in [3]) into sub-CFGs should only result in linear increase (unless we have no information about the input distribution of the expanded component, in which case this must be inferred from the incoming paths). 2 All queries were executed on a single core. 220 4.3. Case switch.p4/egress For the large case, we examined what happens if we only process an initial segment, or prefix of the graph. For all possible path lengths 𝑛, we selected the subgraph induced by the vertices in 𝑛 distance from the start vertex. We then applied the algorithm to this subgraph, and measured its runtime. We picked 10 path lengths as a percentage of the longest path 𝜋 in the CFG: so vertices in the prefixes were at least 10 1 2 |𝜋|, 10 |𝜋|, . . . , |𝜋| distant from the start vertex (possibly more because of joining roundabout paths). For each length, we repeated the experiment 25 times and averaged the samples. The results are depicted in Figure 3. Figure 3: Algorithm runtime on a large CFG input Almost all charts show a similar curve: presumably, the 40%-length prefix introduces a large number of paths, but this number does not grow significantly in higher length prefixes. We speculate that this CFG is of a diamond shape, where the diamond resides around 10 4 |𝜋| distance from the start vertex. Interestingly, all charts also consistently show a convex valley by the 60%-mark, that is, CFGs with longer and possibly more numerous paths were processed faster. This phenomenon possibly indicates cache warming, or it maybe an interplay of traversal implementations and CFG-topology. Among all the providers, the Java algorithm were the fastest, closely followed by the query propelled by the in-memory TinkerGraph. This is a positive result, implying that the query and the original algorithm are indeed possessing similar runtime characteristics. But, since Gremlin is a byte-code compiled language, the 221 difference is not necessarily explained by the overhead, and instead hints to an actual deviance in the formulations. Somewhat surprisingly, Neo4J (a persistent database) performed quite close to the in-memory solutions. Neo4J’s advantage over the other persistent databases is most likely due to the combination of efficient caching and it being a native GDB: the space efficient representation requires less file-level access, which enables Neo4J to behave similarly to the in-memory graphs. In this experiment, non-native persistent GDBs were showing weaker perfor- mance characteristics, but we should note that this could be an effect of the design of our graph query. It concerns only the traversal itself, while data-access was implemented on the heap. Adding this data to the graph itself, and including more features (such as the processing of sub-CFGs, described in [3]) may change these characteristics. Using dedicated configurations on these databases (instead of the default ones) may also introduce significant improvements. 5. Conclusion In this paper, we reformulated our earlier procedural P4 cost analysis algorithm as a graph database query. Our motivation is to leverage built-in GDB features, such as scalability and multi-user concurrency, but it was not clear if the GDB overhead will make us trade off analysis efficiency. To substantiate that using GDBs is viable for P4 cost analysis, we measured how our GDB query – backed by various backends – compares against our earlier algorithm, over large, real-world P4 use cases. To answer the question in the title, our results imply that, yes, we can utilize GDBs without losing much of the efficiency. While GDB backends must be chosen and configured properly for the purpose, the portability of the query language will presumably make these efforts feasible as well. We should note that these measurements are far from conclusive yet. Both our original algorithm and the query presented here can be considered naive: while these can analyse small CFGs real-time, future optimizations (e.g. paralleliza- tion) may enable faster processing even for very large CFGs (such as those in switch.p4) as well. We treated all GDBs as blackboxes: precise configuration of these may drastically improve their runtime behaviour. Devising further graph queries in languages other than Gremlin would enable us to investigate more graph backends (e.g. compare various in-memory GDBs against each other). We had to leave the in-depth exploring of these topics to the future. References [1] Besta, M., Peter, E., Gerstenberger, R., Fischer, M., Podstawski, M., Barthels, C., Alonso, G., and Hoefler, T. Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries, 2019. 222 [2] Bosshart, P., Daly, D., Gibb, G., Izzard, M., McKeown, N., Rexford, J., Schlesinger, C., Talayco, D., Vahdat, A., Varghese, G., and Walker, D. P4: Programming Protocol-independent Packet Processors. SIGCOMM Comput. Commun. Rev. 44, 3 (July 2014), 87–95. [3] Lukács, D., Pongrácz, G., and Tejfel, M. Performance guarantees for P4 through cost analysis. In Informatics’2019, 2019 IEEE 15th International Scientific Conference on Informatics, Poprad, Slovakia, November 20–22 (2019), W. Steingart- ner, Štefan Korecko, and A. Szakál, Eds., IEEE, pp. 242–247. [4] P4 Language Consortium. basic_routing-bmv2.p4, a small test case for the official P4 reference compiler, P4C, 2018. [5] P4 Language Consortium. switch.p4, a large P4 project handling protocols of L2/L3 data plane, 2016. [6] Rodriguez, M. A. The Gremlin graph traversal machine and language (invited talk). Proceedings of the 15th Symposium on Database Programming Languages - DBPL 2015 (2015). [7] Urma, R.-G., and Mycroft, A. Source-code queries with graph databases—with application to programming language usage and evolution. Science of Computer Pro- gramming 97 (2015), 127 – 134. Special Issue on New Ideas and Emerging Results in Understanding Software. 223