<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Empirical Analysis of GraphQL API Schemas in Open Code Repositories and Package Registries</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yun Wan Kim</string-name>
          <email>timyun.kim@mail.utoronto.ca</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mariano P. Consens</string-name>
          <email>consens@mie.utoronto.ca</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olaf Hartig</string-name>
          <email>olaf.hartig@liu.se</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Linkoping University</institution>
          ,
          <country country="SE">Sweden</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Toronto</institution>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>GraphQL is a query language for APIs that has been increasingly adopted by Web developers since its speci cation was open sourced in 2015. The GraphQL framework lets API clients tailor data requests by using queries that return JSON objects described using GraphQL Schema. We present initial results of an exploratory empirical study with the goal of characterizing GraphQL Schemas in open code repositories and package registries. Our rst approach identi es over 20 thousand GraphQL-related projects in publicly accessible repositories hosted by GitHub. Our second, and complementary, approach uses package registries to nd over 37 thousand dependent packages and repositories. In addition, over 2 thousand schema les were loaded into the GraphQL-JS reference implementation to conduct a detailed analysis of the schema information. Our study provides insights into the usage of di erent schema constructs, the number of distinct types and the most popular types in schemas, as well as the presence of cycles in schemas.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The schema of a GraphQL API describes the data and the types of queries
supported by the API. An empirical study of the GraphQL schemas used by open
source projects, therefore, provides useful information about the characteristics
of data interfaces. Currently, there is no comprehensive collection of such schemas
or a tool that helps gather schemas from GraphQL APIs. The goal of the work
presented in this paper is i) to establish a method to extract schemas into a single
collection for analyses and ii) to conduct an empirical analysis of the schemas.
1.1</p>
      <p>Data Collection Method
APIs-guru has the most comprehensive list of public GraphQL APIs with links
to endpoints and their documentation. By using APIs-guru, combined with
manual e ort through keyword searching, we collected 67 schemas of distinct APIs.
Authentication requirements for most publicly available APIs hindered the e
ciency and possible automation of schema extraction. Hence, we decided to take
a di erent approach by extracting schemas from open source repositories from
GitHub and used three sources to identify GraphQL repositories.</p>
      <p>GitHub API As of June, 2018, there were more than 20,000 repositories on
GitHub matching the keyword \graphql" and 2,000 repositories matching the
keywords \graphql api".</p>
      <p>
        Libraries.io API Decan et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] explored security vulnerabilities of NPM
packages that were dependent on vulnerable packages. Following a similar method,
we identi ed over 37,000 repositories dependent on GraphQL reference
implementations.
      </p>
      <p>GHTorrent Archived data of GHTorrent is hosted on Google's Big Query
platform. We identi ed over 5,000 repositories matching the keyword \graphql".</p>
      <p>By using string search for schema for every repository le's full le-path, it
was possible to identify exact path of potential schema les and their repository
data. Our assumption is that this method returns a considerable portion of
actual schemas available such that this portion is representative for the entire
population of GraphQL schemas publicly availables. We found that schema les
are most often named schema.json, schema.js, and schema.graphql for single- le
schemas. For modular schemas, the les are most often separated by types,
queries, mutations, and subscriptions but are contained in directories with the
name schema or schemas.</p>
      <p>After downloading all potential schema les, we tried to load each of them
via GraphQL-JS. A successful attempt indicated a valid schema and a failure
indicated an invalid schema or an irrelevant le. We identi ed duplicates through
several methods including Levenshtein distance and cosine similarity.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Analysis Results</title>
      <p>We identi ed a total of 2,777 valid but non-distinct schemas using the proposed
method. 1,880 les were unique JSON-formatted schemas. We also conducted
an exhaustive search excluding the \schema" keyword on all GraphQL-related
repositories to collect a larger list of 3,949 schemas. The union of the two methods
resulted in 4,095 schemas and, by using cosine similarity to lter duplicates, 2,081
schemas were unique. Figure 1 illustrates the number of schemas per source and
the overlap of sources. This illustration shows that the di erent approaches to
collect GraphQL schemas are non-redundant.</p>
      <p>Object types dictate what information is exchanged between the users and
the servers. We nd that even after excluding scalar types and type de nitions
such as Query and Mutation, the most common types are generic types a liated
with reference implementations as shown in Table 4. Node is a reserved interface
type for reference implementations such as Apollo and Relay with an identi er
eld and is the most common.</p>
      <p>We traversed each schema in its JSON format recursively to identify their
levels of nesting. We nd that the median number of levels is 9 and the median
number of levels only considering object types is 6. Excluding introspection and
scalar type de nitions, most schemas have only one level of nesting.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Cycles in GraphQL Schemas</title>
      <p>
        Another interesting question is whether the relationships between the types in
the schemas form directed cycles, because only if such cycles exist, the data
exposed via a GraphQL API may contain directed cycles and these, in turn,
may cause an undesired overhead during query processing [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Hence, we analyze GraphQL schemas as directed graphs. The vertices in such
a graph for a given schema correspond to the object types, the interface types,
and the union types in the schema. For every eld de nition whose value type is
based on one such type, the graph contains an edge from the vertex that
represents the type in which the eld de nition appears to the vertex that represents
the value type of the eld de nition. Additionally, there are edges from interface
types to their implementing object types and, similarly, from union types to
their participating object types. In this paper we focus only on simple cycles ;
that is directed cycles in which repetition of vertices is not allowed.</p>
      <p>
        For the analysis we use a program3 that loads a schema, generates the
corresponding graph representation of this schema, and then enumerates the
simple cycles in the generated graph. For the latter step, the program applies a
combination of Johnson's algorithm [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to enumerate the cycles and Tarjan's
algorithm [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to rst divide the graph into its strongly connected components,
which is a prerequisite of Johnson's algorithm. To run the program for each of
the 2,094 schemas we use an ordinary desktop computer with 8 GB of RAM.
      </p>
      <p>We nd that 832 of the 2,094 schemas (39.7%) contain at least one simple
cycle. For a more detailed analysis of these cycles we can, unfortunately, focus
only on 788 of the 832 schemas; the other 44 schemas contain so many simple
cycles (at least 10M in each of them) that enumerating these cycles causes the
program to crash with an out-of-memory exception.</p>
      <p>The distribution of the number of cycles in the remaining 788 schemas is
illustrated in Figure 2. As can be observed, the distribution resembles a power
law. In more detail, 2 schemas contain more than 100K cycles (that is 0.3% of
the 788 schemas), where the maximum is 256,348 cycles; 9 schemas contain more
than 10K cycles (that is 1.1%); 41 schemas contain more than 1K cycles (5.2%);
73 contain more than 100 cycles (9.3%); 152 contain at least 10 cycles (19.3%),
and 543 contain more than one cycle (68.9%). Hence, 31.1% contain exactly one
cycle only.</p>
      <p>Moreover, the average length of all cycles within each schema ranges from 2.0
to 20.5, but there is no correlation between this average length and the number
of cycles. Similarly, we do not nd a correlation between the number of cycles
and the number of vertices or edges.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Concluding Remarks</title>
      <p>This preliminary report describes our approach to collect and analyze thousands
of GraphQL schemas from open project repositories. Initial descriptive and
structural properties of the collected schemas were presented. The collection has also
enabled additional analysis (not included in this contribution) such as temporal
characteristics of repository commits and co-committer relationships.
3 https://github.com/LiUGraphQL/graphql-schema-cycles
Acknowledgements. The authors thank Jonas Lind and Kieron Soames who,
as part of their thesis project at Linkoping University, have developed the cycle
enumeration program and applied it to our collection of schemas. Olaf Hartig's
work on this paper has been funded by the CENIIT program at Linkoping
University (project no. 17.05).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Decan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mens</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Constantinou</surname>
          </string-name>
          , E.:
          <article-title>On the impact of security vulnerabilities in the npm package dependency network</article-title>
          .
          <source>In: Proceedings of the 15th International Conference on Mining Software Repositories</source>
          . pp.
          <volume>181</volume>
          {
          <fpage>191</fpage>
          . MSR '
          <volume>18</volume>
          (
          <year>2018</year>
          ), http://doi.acm.
          <source>org/10</source>
          .1145/3196398.3196401
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Hartig</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perez</surname>
          </string-name>
          , J.:
          <article-title>Semantics and Complexity of GraphQL</article-title>
          .
          <source>In: Proceedings of the 2018 World Wide Web Conference</source>
          . pp.
          <volume>1155</volume>
          {
          <fpage>1164</fpage>
          . WWW '
          <volume>18</volume>
          , Republic and Canton of Geneva, Switzerland (
          <year>2018</year>
          ), https://doi.org/10.1145/3178876.3186014
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Johnson</surname>
          </string-name>
          , D.B.:
          <article-title>Finding all the elementary circuits of a directed graph</article-title>
          .
          <source>SIAM J. Comput</source>
          .
          <volume>4</volume>
          (
          <issue>1</issue>
          ),
          <volume>77</volume>
          {
          <fpage>84</fpage>
          (
          <year>1975</year>
          ). https://doi.org/10.1137/0204007, https://doi.org/10.1137/0204007
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Tarjan</surname>
          </string-name>
          , R.E.:
          <article-title>Depth- rst search and linear graph algorithms</article-title>
          .
          <source>SIAM J. Comput</source>
          .
          <volume>1</volume>
          (
          <issue>2</issue>
          ),
          <volume>146</volume>
          {
          <fpage>160</fpage>
          (
          <year>1972</year>
          ), https://doi.org/10.1137/0201010
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>