<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Synthetic Graph Generation from Finely-Tuned Temporal Constraints</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Karim Alami</string-name>
          <email>karim.alami@etudiant.univ-bpclermont.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Radu Ciucanu</string-name>
          <email>ciucanu@isima.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Engelbert Mephu Nguifo</string-name>
          <email>mephu@isima.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universite Clermont Auvergne &amp; CNRS LIMOS</institution>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Large-scale graphs are at the core of a plethora of modern applications such as social networks, transportation networks, or the Semantic Web. Such graphs are naturally evolving over time, which makes particularly challenging graph processing tasks e.g., graph mining. To be able to realize rigorous empirical evaluations of research ideas, the graph processing community needs nely-tuned generators of synthetic time-evolving graphs, which are particularly useful whenever real-world graphs are unavailable for public use. The goal of this paper is to report on an ongoing project that aims at generating synthetic time-evolving graphs satisfying nely-tuned temporal constraints speci ed by the user.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Large-scale graphs are used to model a variety of real-world domains. In practice,
both nodes and edges of such graphs have properties that are naturally evolving
over time. For example, in a geographical database storing information about
cities and transportation facilities, the nodes of type city have evolving
properties e.g., weather and air quality, whereas the edges that encode transportation
facilities between cities have evolving properties e.g., price and duration.</p>
      <p>
        The graph evolution over time makes particularly challenging the graph
processing tasks e.g., graph querying and mining. This motivated the community
to propose frameworks for building applications on time-dependent graphs e.g.,
Graphast [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], or historical graph management systems e.g., DeltaGraph [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>To be able to realize rigorous empirical evaluations of research ideas, the
graph processing community needs tunable evolving graph generators, which are
particularly useful whenever real-world graphs are unavailable for public use.</p>
      <p>
        In this paper, we report on our ongoing research on EGG (Evolving Graph
Generator), an open-source1 framework for generating evolving graphs based on
nely-tuned temporal constraints given by the user. We depict the architecture
of EGG in Fig. 1. We built EGG on top of gMark [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], a state-of-the-art static graph
generator. EGG takes as input (i) an initial graph generated by gMark, and (ii)
an evolving graph con guration that encodes how the evolving properties of
the nodes and edges of the graph from point (i) should evolve over time. The
output of EGG is a sequence of graph snapshots, each annotated with an interval
      </p>
      <sec id="sec-1-1">
        <title>1 https://github.com/karimalami7/EGG</title>
        <p>Evolving graph con guration
# of snapshots
Evolving properties (nodes and edges)
Evolution constraints</p>
        <p>Static graph con guration</p>
        <p>Size
Node and edge types
Occurrence constraints
Degree distributions</p>
        <p>EGG
Evolving graph generator</p>
        <p>gMark
Static graph generator</p>
        <p>RDF annotated
with temporal
information
of integers. The semantics of time depends on the use case. In the running
example of this paper, a snapshot corresponds to a day. In other use cases that
we developed, we used snapshots representing e.g., years, semesters, and weeks.</p>
        <p>Similarly to gMark, EGG is schema-driven and domain-independent.
Throughout the paper, we use as running example a geographical database, but we have
been additionally able to easily encode di erent application domains such as a
social network or a DBLP-like bibliographical network.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2 Overview of EGG</title>
      <p>We present gMark static graph con gurations (Section 2.1), EGG evolving graph
con gurations (Section 2.2), EGG implementation challenges (Section 2.3), and
preliminary EGG experimental results (Section 2.4).
2.1</p>
      <sec id="sec-2-1">
        <title>Static Graph Con gurations</title>
        <p>We illustrate via an example the static graph con guration that the user can
specify as gMark input. Assume that the user wants to generate graphs simulating
a geographical database storing information about cities, and di erent facilities
such as transportation and hotels. The user can specify as gMark input the
following types of constraints: (i) graph size, given as # of nodes; (ii) node types
e.g., city and hotel, and edge types e.g., train and locatedIn; (iii) occurrence
constraints e.g., 10% of the graph nodes should be of type city, whereas 90%
of the graph nodes should be of type hotel; (iv) degree distributions e.g.,
source type predicate target type</p>
        <p>In-distribution</p>
        <p>
          Out-distribution
hotel locate!dIn city Zip an Uniform [
          <xref ref-type="bibr" rid="ref1 ref1">1,1</xref>
          ]
        </p>
        <p>
          !
meaning that we can have an edge of type locatedIn from a node of type hotel
to a node of type city, with a Zip an in-distribution (since it is realistic to
assume that the number of hotels in a city follows such a power-law distribution)
and a uniform [
          <xref ref-type="bibr" rid="ref1 ref1">1,1</xref>
          ] out-distribution (since a hotel is located in precisely one city).
        </p>
        <p>We call such gMark graph con gurations as being static since the nodes
of type e.g., city and hotel are rarely created or deleted. Nonetheless, such
nodes (as well as the di erent edges connecting them) possess properties that
naturally evolve over time, in an interdependent manner. The user can specify
such evolving properties as input of EGG, that we detail in the next section.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Evolving Graph Con gurations</title>
        <p>We build on the example from Section 2.1 to introduce examples of evolving
properties that can be encoded as EGG evolving graph con gurations. Assume
that the user generates with gMark a graph having nodes of type city and hotel,
and edges of type train (connecting two cities) and locatedIn (connecting a
hotel to its city). Next, the user wants to add properties that evolve over time for
the aforementioned types of nodes and edges, assuming that a graph snapshot
corresponds to a day. We give below examples of such properties, together with
their nely-tuned constraints to evolve among consecutive graph snapshots.</p>
        <p>A node of type city has the following evolving properties:
{ weather (unordered qualitative), which can have three uniformly-distributed
values (sunny, rainy, cloudy). There is a probability of 50% that weather
changes from a snapshot to the next one. Each of the three values can evolve
between two consecutive snapshots to any of the other ones, with the exception
of sunny, that cannot be followed by rainy.</p>
        <p>{ qAir i.e., \quality of air" (ordered qualitative), which can have ten
possible values, following a binomial distribution with p=0.6. There is a probability
of 20% that qAir changes from a snapshot to the next one, and it can only
increment or decrement by 1 between two consecutive snapshots.</p>
        <p>Moreover, a node of type hotel has the following evolving properties:
{ star (ordered qualitative), which can have ve possible values, following a
geometric distribution with p=0.65. It can only change every thirty snapshots,
with a probability of 10%, and it can only increment or decrement by 1.</p>
        <p>
          { availRoom (quantitative discrete), which can have as values integers in
the interval [
          <xref ref-type="bibr" rid="ref1">1,100</xref>
          ], following a binomial distribution with p=0.5. There is a
probability of 80% that it changes from a snapshot to the next one, and it can
increment or decrement by an integer up to 5 between two consecutive snapshots.
        </p>
        <p>{ hotelPrice (quantitative continuous), whose values follow a normal
distribution in an interval that is dynamically constructed depending on the value
of the property star. Moreover, hotelPrice is anti-correlated with availRoom
i.e., if availRoom decreases, then hotelPrice increases, and vice-versa.</p>
        <p>The user can similarly specify evolving properties and constraints for edges
e.g., property trainPrice of edge type train.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3 Implementation Challenges</title>
        <p>Building a system like EGG is an ambitious goal because, as outlined in the
previous section, we aim at allowing the user to specify very expressive constraints.
This leads to some interesting challenges, that we brie y discuss in this section.</p>
        <p>Computational complexity. As illustrated in Section 2.2, we allow the user to
specify evolution constraints where the value of a property among consecutive
snapshots depends on another property. We model the inter-dependencies
between such evolving properties with a dependency graph. It is easy to see that
if the aforementioned dependency graph is cyclic, the generation algorithm may
not halt. Consequently, in our implementation we require that the dependency
graph is acyclic and we sort it topologically to decide in which order we should
apply the evolution constraints. Even for acyclic dependency graphs, we suspect
that it is NP-complete to decide whether there exists a sequence of graph
snapshots satisfying the input constraints. The exact complexity is an open question
that we plan to attack in the near future.</p>
        <p>Storage redundancy. A naive solution to store the generated graph snapshots
would be to entirely store each of them after it is generated. For the parts of the
graph that do not change every snapshot, this yields redundant storage. This is
why we rely on a storage format that uses named graphs to express temporal
information in RDF. Our output format (that we serialize using the TriG syntax)
allows us to decouple the storage of the static parts of the graph (i.e., structural
information satis ed in all snapshots) and the evolving parts of the graph (i.e.,
the property values that change from a snapshot to the next one).
2.4</p>
      </sec>
      <sec id="sec-2-4">
        <title>Experimental Evaluation of EGG</title>
        <p>We implemented in Python the rst prototype of
EGG. The user can easily encode in JSON
constraints as those presented in Section 2.2.</p>
        <p>We observe the accuracy of the evolving graphs
generated with EGG in Fig. 2, where we depict the
evolving properties of one node of type hotel as
shown by the EGG visualization module. In partic- Fig. 2. Evolving properties
ular, we observe that the anti-correlation between for a node of type hotel.
availRooms and hotelPrice follows the constraints
shown in Section 2.2.</p>
        <p>As for the scalability evaluation, we have run EGG with several use cases
and graphs of increasing sizes. In all cases, we observed that EGG has a linear
time behavior with respect to the studied parameters. More details about our
experimental observations are available on the EGG wiki2.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusions and Future Work</title>
      <p>We presented our ongoing research on EGG, a system for generating evolving
graphs based on nely-tuned constraints speci ed by the user. EGG is
opensource and can be already used by the graph processing community.</p>
      <p>We next plan to thoroughly study the impact of EGG in improving the
empirical evaluations of existing temporal graph querying and mining systems.</p>
      <sec id="sec-3-1">
        <title>2 https://github.com/karimalami7/EGG/wiki</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>G.</given-names>
            <surname>Bagan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bonifati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ciucanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. H. L.</given-names>
            <surname>Fletcher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lemay</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Advokaat</surname>
          </string-name>
          . gMark:
          <article-title>Schema-driven generation of graphs and queries</article-title>
          .
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          ,
          <volume>29</volume>
          (
          <issue>4</issue>
          ):
          <volume>856</volume>
          {
          <fpage>869</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>U.</given-names>
            <surname>Khurana</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Deshpande</surname>
          </string-name>
          .
          <article-title>E cient snapshot retrieval over historical graph data</article-title>
          .
          <source>In ICDE</source>
          , pages
          <volume>997</volume>
          {
          <fpage>1008</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. R. P. Magalh~aes, G. Coutinho,
          <string-name>
            <surname>J. A. F.</surname>
          </string-name>
          de Mac^edo,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ferreira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Cruz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Nascimento</surname>
          </string-name>
          .
          <article-title>Graphast: an extensible framework for building applications on time-dependent networks</article-title>
          .
          <source>In SIGSPATIAL, pages 93:1{93:4</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>