An Empirical Analysis of GraphQL API Schemas in Open Code Repositories and Package Registries Yun Wan Kim1 , Mariano P. Consens1 , and Olaf Hartig2 1 University of Toronto, Canada timyun.kim@mail.utoronto.ca consens@mie.utoronto.ca 2 Linköping University, Sweden olaf.hartig@liu.se Abstract. GraphQL is a query language for APIs that has been increas- ingly adopted by Web developers since its specification was open sourced in 2015. The GraphQL framework lets API clients tailor data requests by using queries that return JSON objects described using GraphQL Schema. We present initial results of an exploratory empirical study with the goal of characterizing GraphQL Schemas in open code repositories and package registries. Our first approach identifies over 20 thousand GraphQL-related projects in publicly accessible repositories hosted by GitHub. Our second, and complementary, approach uses package reg- istries to find over 37 thousand dependent packages and repositories. In addition, over 2 thousand schema files were loaded into the GraphQL-JS reference implementation to conduct a detailed analysis of the schema in- formation. Our study provides insights into the usage of different schema constructs, the number of distinct types and the most popular types in schemas, as well as the presence of cycles in schemas. 1 Motivation and Approach The schema of a GraphQL API describes the data and the types of queries supported by the API. An empirical study of the GraphQL schemas used by open source projects, therefore, provides useful information about the characteristics of data interfaces. Currently, there is no comprehensive collection of such schemas or a tool that helps gather schemas from GraphQL APIs. The goal of the work presented in this paper is i) to establish a method to extract schemas into a single collection for analyses and ii) to conduct an empirical analysis of the schemas. 1.1 Data Collection Method APIs-guru has the most comprehensive list of public GraphQL APIs with links to endpoints and their documentation. By using APIs-guru, combined with man- ual effort through keyword searching, we collected 67 schemas of distinct APIs. Authentication requirements for most publicly available APIs hindered the effi- ciency and possible automation of schema extraction. Hence, we decided to take a different approach by extracting schemas from open source repositories from GitHub and used three sources to identify GraphQL repositories. 2 Y. Kim et al. GitHub API As of June, 2018, there were more than 20,000 repositories on GitHub matching the keyword “graphql” and 2,000 repositories matching the keywords “graphql api”. Libraries.io API Decan et al. [1] explored security vulnerabilities of NPM pack- ages that were dependent on vulnerable packages. Following a similar method, we identified over 37,000 repositories dependent on GraphQL reference imple- mentations. GHTorrent Archived data of GHTorrent is hosted on Google’s Big Query plat- form. We identified over 5,000 repositories matching the keyword “graphql”. Table 1. Summary of GraphQL Table 2. Number of dependent repositories repositories identified. for the most popular implementations. Method NumRepositories Package language Count GitHub API 20,635 NPM/graphql JavaScript 12,700 Libraries.io API 37,588 Pypi/graphene Python 310 GHTorrent 5,188 Rubygems/graphql Ruby 470 By using string search for schema for every repository file’s full file-path, it was possible to identify exact path of potential schema files and their repository data. Our assumption is that this method returns a considerable portion of actual schemas available such that this portion is representative for the entire population of GraphQL schemas publicly availables. We found that schema files are most often named schema.json, schema.js, and schema.graphql for single-file schemas. For modular schemas, the files are most often separated by types, queries, mutations, and subscriptions but are contained in directories with the name schema or schemas. After downloading all potential schema files, we tried to load each of them via GraphQL-JS. A successful attempt indicated a valid schema and a failure indicated an invalid schema or an irrelevant file. We identified duplicates through several methods including Levenshtein distance and cosine similarity. 2 Analysis Results We identified a total of 2,777 valid but non-distinct schemas using the proposed method. 1,880 files were unique JSON-formatted schemas. We also conducted an exhaustive search excluding the “schema” keyword on all GraphQL-related repositories to collect a larger list of 3,949 schemas. The union of the two methods resulted in 4,095 schemas and, by using cosine similarity to filter duplicates, 2,081 schemas were unique. Figure 1 illustrates the number of schemas per source and the overlap of sources. This illustration shows that the different approaches to collect GraphQL schemas are non-redundant. An Empirical Analysis of GraphQL API Schemas 3 Fig. 1. Number of schemas by sources. Fig. 2. Number of cycles per schema. To estimate our recall, we downloaded all .json and .graphql files from all repositories found with the keyword “graphql”. By using the 3,949 valid schema counts, the estimated recall of our method is ca. 70% and the precision is 1.8%. There are five major components of GraphQL schemas that describes the supported operations: Query, Object, Mutation, Subscription, and Directive. While every GraphQL server needs to support queries, which fetch information about data objects, other operations are not necessarily required. Only about 20% of the schemas have the Subscription type that can push information, while about 70% have the Mutation type via which the stored data can be changed. Table 4. The ten most common object types. Table 3. Number of non-empty Object type Frequency components in the 2,081 schemas. Node 1,009 Schema components Frequency PageInfo 922 User 879 object types 2,079 UserConnection 336 query type 2,079 UserEdge 307 directives 2,059 BatchPayload 220 mutation type 1,440 Viewer 215 subscription type 414 UserPreviousValues 190 Post 182 Object types dictate what information is exchanged between the users and the servers. We find that even after excluding scalar types and type definitions such as Query and Mutation, the most common types are generic types affiliated with reference implementations as shown in Table 4. Node is a reserved interface type for reference implementations such as Apollo and Relay with an identifier field and is the most common. We traversed each schema in its JSON format recursively to identify their levels of nesting. We find that the median number of levels is 9 and the median number of levels only considering object types is 6. Excluding introspection and scalar type definitions, most schemas have only one level of nesting. 3 Cycles in GraphQL Schemas Another interesting question is whether the relationships between the types in the schemas form directed cycles, because only if such cycles exist, the data 4 Y. Kim et al. exposed via a GraphQL API may contain directed cycles and these, in turn, may cause an undesired overhead during query processing [2]. Hence, we analyze GraphQL schemas as directed graphs. The vertices in such a graph for a given schema correspond to the object types, the interface types, and the union types in the schema. For every field definition whose value type is based on one such type, the graph contains an edge from the vertex that repre- sents the type in which the field definition appears to the vertex that represents the value type of the field definition. Additionally, there are edges from interface types to their implementing object types and, similarly, from union types to their participating object types. In this paper we focus only on simple cycles; that is directed cycles in which repetition of vertices is not allowed. For the analysis we use a program3 that loads a schema, generates the cor- responding graph representation of this schema, and then enumerates the sim- ple cycles in the generated graph. For the latter step, the program applies a combination of Johnson’s algorithm [3] to enumerate the cycles and Tarjan’s algorithm [4] to first divide the graph into its strongly connected components, which is a prerequisite of Johnson’s algorithm. To run the program for each of the 2,094 schemas we use an ordinary desktop computer with 8 GB of RAM. We find that 832 of the 2,094 schemas (39.7%) contain at least one simple cycle. For a more detailed analysis of these cycles we can, unfortunately, focus only on 788 of the 832 schemas; the other 44 schemas contain so many simple cycles (at least 10M in each of them) that enumerating these cycles causes the program to crash with an out-of-memory exception. The distribution of the number of cycles in the remaining 788 schemas is illustrated in Figure 2. As can be observed, the distribution resembles a power law. In more detail, 2 schemas contain more than 100K cycles (that is 0.3% of the 788 schemas), where the maximum is 256,348 cycles; 9 schemas contain more than 10K cycles (that is 1.1%); 41 schemas contain more than 1K cycles (5.2%); 73 contain more than 100 cycles (9.3%); 152 contain at least 10 cycles (19.3%), and 543 contain more than one cycle (68.9%). Hence, 31.1% contain exactly one cycle only. Moreover, the average length of all cycles within each schema ranges from 2.0 to 20.5, but there is no correlation between this average length and the number of cycles. Similarly, we do not find a correlation between the number of cycles and the number of vertices or edges. 4 Concluding Remarks This preliminary report describes our approach to collect and analyze thousands of GraphQL schemas from open project repositories. Initial descriptive and struc- tural properties of the collected schemas were presented. The collection has also enabled additional analysis (not included in this contribution) such as temporal characteristics of repository commits and co-committer relationships. 3 https://github.com/LiUGraphQL/graphql-schema-cycles An Empirical Analysis of GraphQL API Schemas 5 Acknowledgements. The authors thank Jonas Lind and Kieron Soames who, as part of their thesis project at Linköping University, have developed the cycle enumeration program and applied it to our collection of schemas. Olaf Hartig’s work on this paper has been funded by the CENIIT program at Linköping University (project no. 17.05). References 1. Decan, A., Mens, T., Constantinou, E.: On the impact of security vulnerabilities in the npm package dependency network. In: Proceedings of the 15th Interna- tional Conference on Mining Software Repositories. pp. 181–191. MSR ’18 (2018), http://doi.acm.org/10.1145/3196398.3196401 2. Hartig, O., Pérez, J.: Semantics and Complexity of GraphQL. In: Proceedings of the 2018 World Wide Web Conference. pp. 1155–1164. WWW ’18, Republic and Canton of Geneva, Switzerland (2018), https://doi.org/10.1145/3178876.3186014 3. Johnson, D.B.: Finding all the elementary circuits of a directed graph. SIAM J. Comput. 4(1), 77–84 (1975). https://doi.org/10.1137/0204007, https://doi.org/10.1137/0204007 4. Tarjan, R.E.: Depth-first search and linear graph algorithms. SIAM J. Comput. 1(2), 146–160 (1972), https://doi.org/10.1137/0201010