=Paper=
{{Paper
|id=Vol-2369/short04
|storemode=property
|title=An Empirical Analysis of GraphQL API Schemas in Open Code Repositories and Package Registries
|pdfUrl=https://ceur-ws.org/Vol-2369/short04.pdf
|volume=Vol-2369
|authors=Yun Wan Kim,Mariano Consens,Olaf Hartig
|dblpUrl=https://dblp.org/rec/conf/amw/KimCH19
}}
==An Empirical Analysis of GraphQL API Schemas in Open Code Repositories and Package Registries==
An Empirical Analysis of
GraphQL API Schemas in Open
Code Repositories and Package Registries
Yun Wan Kim1 , Mariano P. Consens1 , and Olaf Hartig2
1
University of Toronto, Canada
timyun.kim@mail.utoronto.ca consens@mie.utoronto.ca
2
Linköping University, Sweden
olaf.hartig@liu.se
Abstract. GraphQL is a query language for APIs that has been increas-
ingly adopted by Web developers since its specification was open sourced
in 2015. The GraphQL framework lets API clients tailor data requests
by using queries that return JSON objects described using GraphQL
Schema. We present initial results of an exploratory empirical study with
the goal of characterizing GraphQL Schemas in open code repositories
and package registries. Our first approach identifies over 20 thousand
GraphQL-related projects in publicly accessible repositories hosted by
GitHub. Our second, and complementary, approach uses package reg-
istries to find over 37 thousand dependent packages and repositories. In
addition, over 2 thousand schema files were loaded into the GraphQL-JS
reference implementation to conduct a detailed analysis of the schema in-
formation. Our study provides insights into the usage of different schema
constructs, the number of distinct types and the most popular types in
schemas, as well as the presence of cycles in schemas.
1 Motivation and Approach
The schema of a GraphQL API describes the data and the types of queries
supported by the API. An empirical study of the GraphQL schemas used by open
source projects, therefore, provides useful information about the characteristics
of data interfaces. Currently, there is no comprehensive collection of such schemas
or a tool that helps gather schemas from GraphQL APIs. The goal of the work
presented in this paper is i) to establish a method to extract schemas into a single
collection for analyses and ii) to conduct an empirical analysis of the schemas.
1.1 Data Collection Method
APIs-guru has the most comprehensive list of public GraphQL APIs with links
to endpoints and their documentation. By using APIs-guru, combined with man-
ual effort through keyword searching, we collected 67 schemas of distinct APIs.
Authentication requirements for most publicly available APIs hindered the effi-
ciency and possible automation of schema extraction. Hence, we decided to take
a different approach by extracting schemas from open source repositories from
GitHub and used three sources to identify GraphQL repositories.
2 Y. Kim et al.
GitHub API As of June, 2018, there were more than 20,000 repositories on
GitHub matching the keyword “graphql” and 2,000 repositories matching the
keywords “graphql api”.
Libraries.io API Decan et al. [1] explored security vulnerabilities of NPM pack-
ages that were dependent on vulnerable packages. Following a similar method,
we identified over 37,000 repositories dependent on GraphQL reference imple-
mentations.
GHTorrent Archived data of GHTorrent is hosted on Google’s Big Query plat-
form. We identified over 5,000 repositories matching the keyword “graphql”.
Table 1. Summary of GraphQL Table 2. Number of dependent repositories
repositories identified. for the most popular implementations.
Method NumRepositories Package language Count
GitHub API 20,635 NPM/graphql JavaScript 12,700
Libraries.io API 37,588 Pypi/graphene Python 310
GHTorrent 5,188 Rubygems/graphql Ruby 470
By using string search for schema for every repository file’s full file-path, it
was possible to identify exact path of potential schema files and their repository
data. Our assumption is that this method returns a considerable portion of
actual schemas available such that this portion is representative for the entire
population of GraphQL schemas publicly availables. We found that schema files
are most often named schema.json, schema.js, and schema.graphql for single-file
schemas. For modular schemas, the files are most often separated by types,
queries, mutations, and subscriptions but are contained in directories with the
name schema or schemas.
After downloading all potential schema files, we tried to load each of them
via GraphQL-JS. A successful attempt indicated a valid schema and a failure
indicated an invalid schema or an irrelevant file. We identified duplicates through
several methods including Levenshtein distance and cosine similarity.
2 Analysis Results
We identified a total of 2,777 valid but non-distinct schemas using the proposed
method. 1,880 files were unique JSON-formatted schemas. We also conducted
an exhaustive search excluding the “schema” keyword on all GraphQL-related
repositories to collect a larger list of 3,949 schemas. The union of the two methods
resulted in 4,095 schemas and, by using cosine similarity to filter duplicates, 2,081
schemas were unique. Figure 1 illustrates the number of schemas per source and
the overlap of sources. This illustration shows that the different approaches to
collect GraphQL schemas are non-redundant.
An Empirical Analysis of GraphQL API Schemas 3
Fig. 1. Number of schemas by sources. Fig. 2. Number of cycles per schema.
To estimate our recall, we downloaded all .json and .graphql files from all
repositories found with the keyword “graphql”. By using the 3,949 valid schema
counts, the estimated recall of our method is ca. 70% and the precision is 1.8%.
There are five major components of GraphQL schemas that describes the
supported operations: Query, Object, Mutation, Subscription, and Directive.
While every GraphQL server needs to support queries, which fetch information
about data objects, other operations are not necessarily required. Only about
20% of the schemas have the Subscription type that can push information, while
about 70% have the Mutation type via which the stored data can be changed.
Table 4. The ten most common object types.
Table 3. Number of non-empty Object type Frequency
components in the 2,081 schemas.
Node 1,009
Schema components Frequency PageInfo 922
User 879
object types 2,079
UserConnection 336
query type 2,079
UserEdge 307
directives 2,059
BatchPayload 220
mutation type 1,440
Viewer 215
subscription type 414
UserPreviousValues 190
Post 182
Object types dictate what information is exchanged between the users and
the servers. We find that even after excluding scalar types and type definitions
such as Query and Mutation, the most common types are generic types affiliated
with reference implementations as shown in Table 4. Node is a reserved interface
type for reference implementations such as Apollo and Relay with an identifier
field and is the most common.
We traversed each schema in its JSON format recursively to identify their
levels of nesting. We find that the median number of levels is 9 and the median
number of levels only considering object types is 6. Excluding introspection and
scalar type definitions, most schemas have only one level of nesting.
3 Cycles in GraphQL Schemas
Another interesting question is whether the relationships between the types in
the schemas form directed cycles, because only if such cycles exist, the data
4 Y. Kim et al.
exposed via a GraphQL API may contain directed cycles and these, in turn,
may cause an undesired overhead during query processing [2].
Hence, we analyze GraphQL schemas as directed graphs. The vertices in such
a graph for a given schema correspond to the object types, the interface types,
and the union types in the schema. For every field definition whose value type is
based on one such type, the graph contains an edge from the vertex that repre-
sents the type in which the field definition appears to the vertex that represents
the value type of the field definition. Additionally, there are edges from interface
types to their implementing object types and, similarly, from union types to
their participating object types. In this paper we focus only on simple cycles;
that is directed cycles in which repetition of vertices is not allowed.
For the analysis we use a program3 that loads a schema, generates the cor-
responding graph representation of this schema, and then enumerates the sim-
ple cycles in the generated graph. For the latter step, the program applies a
combination of Johnson’s algorithm [3] to enumerate the cycles and Tarjan’s
algorithm [4] to first divide the graph into its strongly connected components,
which is a prerequisite of Johnson’s algorithm. To run the program for each of
the 2,094 schemas we use an ordinary desktop computer with 8 GB of RAM.
We find that 832 of the 2,094 schemas (39.7%) contain at least one simple
cycle. For a more detailed analysis of these cycles we can, unfortunately, focus
only on 788 of the 832 schemas; the other 44 schemas contain so many simple
cycles (at least 10M in each of them) that enumerating these cycles causes the
program to crash with an out-of-memory exception.
The distribution of the number of cycles in the remaining 788 schemas is
illustrated in Figure 2. As can be observed, the distribution resembles a power
law. In more detail, 2 schemas contain more than 100K cycles (that is 0.3% of
the 788 schemas), where the maximum is 256,348 cycles; 9 schemas contain more
than 10K cycles (that is 1.1%); 41 schemas contain more than 1K cycles (5.2%);
73 contain more than 100 cycles (9.3%); 152 contain at least 10 cycles (19.3%),
and 543 contain more than one cycle (68.9%). Hence, 31.1% contain exactly one
cycle only.
Moreover, the average length of all cycles within each schema ranges from 2.0
to 20.5, but there is no correlation between this average length and the number
of cycles. Similarly, we do not find a correlation between the number of cycles
and the number of vertices or edges.
4 Concluding Remarks
This preliminary report describes our approach to collect and analyze thousands
of GraphQL schemas from open project repositories. Initial descriptive and struc-
tural properties of the collected schemas were presented. The collection has also
enabled additional analysis (not included in this contribution) such as temporal
characteristics of repository commits and co-committer relationships.
3
https://github.com/LiUGraphQL/graphql-schema-cycles
An Empirical Analysis of GraphQL API Schemas 5
Acknowledgements. The authors thank Jonas Lind and Kieron Soames who,
as part of their thesis project at Linköping University, have developed the cycle
enumeration program and applied it to our collection of schemas. Olaf Hartig’s
work on this paper has been funded by the CENIIT program at Linköping
University (project no. 17.05).
References
1. Decan, A., Mens, T., Constantinou, E.: On the impact of security vulnerabilities
in the npm package dependency network. In: Proceedings of the 15th Interna-
tional Conference on Mining Software Repositories. pp. 181–191. MSR ’18 (2018),
http://doi.acm.org/10.1145/3196398.3196401
2. Hartig, O., Pérez, J.: Semantics and Complexity of GraphQL. In: Proceedings of
the 2018 World Wide Web Conference. pp. 1155–1164. WWW ’18, Republic and
Canton of Geneva, Switzerland (2018), https://doi.org/10.1145/3178876.3186014
3. Johnson, D.B.: Finding all the elementary circuits of a directed graph.
SIAM J. Comput. 4(1), 77–84 (1975). https://doi.org/10.1137/0204007,
https://doi.org/10.1137/0204007
4. Tarjan, R.E.: Depth-first search and linear graph algorithms. SIAM J. Comput.
1(2), 146–160 (1972), https://doi.org/10.1137/0201010