<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Are Graph Query Languages Applicable for Requirements Traceability Analysis?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Michael Rath</string-name>
          <email>michael.rath@tu-ilmenau.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Akehurst</string-name>
          <email>david.akehurst@itemis.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christoph Borowski</string-name>
          <email>christoph.borowski@itemis.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Patrick Mader</string-name>
          <email>patrick.maeder@tu-ilmenau.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Technische Universitat Ilmenau</institution>
          ,
          <addr-line>Ilmenau</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>itemis AG</institution>
          ,
          <addr-line>Lunen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>[Context &amp; motivation] Maintaining traceability in development projects is a costly and resource expensive task. Once established, it can be used to answer questions about the product and its development process. [Question/problem] Typically, generic query languages for relational databases are used to formulate these questions. Additionally, speci c traceability query languages have been proposed, but they are not widely adopted and show other weaknesses. [Principal ideas/results] Relationships among development artifacts span a rich traceability graph. Directly operating on this data structure rather than on a relational representation by using a graph query language may overcome the limitations of the current work ow. [Contribution] In this paper, typical traceability questions from a user survey were extracted. Using these, the resulting query expression for four di erent languages are brie y compared.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Requirements traceability plays an important role in software development
processes. Traceability analysis aids stakeholders in satisfying information needs
such as requirements validation, coverage analysis, impact analysis, and
compliance veri cation. Considerable e ort needs to be invested to establish
traceability [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, the captured information is only of limited use without being
able to e ectively query and analyze it. Today's developments tools, like IBM
Rational DOORSTM, Enterprise Architect, HP Quality CenterTM, and
Atlassian Jira3 typically allow users to write queries in SQL-like languages. These
languages require in depth knowledge of the underlying relational data schema
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Though, the nature of data, i.e., the artifacts and their relations,
inherently span a graph data structure. This concept is not only di cult to model in
traditional table oriented databases, but its complexity also propagates to the
formulation of queries (e.g., multiple joins). NoSQL approaches gained a lot of
attraction in the database world. One approach, graph databases [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], e ciently
manages information in nodes and edges between them. Graph query languages
were designed to traverse such graphs and to gather requested information.
      </p>
      <p>In this paper, we initially investigate whether graph query languages are
applicable for traceability analysis. We conducted a user survey with stakeholders
of software projects and collected queries common to their daily work. These
queries were systematically analyzed to extract basic features to be provided
by a suitable query language. We selected two graph query languages and
formulated concrete queries based on participants' replies and compare them with
those written in SQL and VTML, a visual language speci cally designed for
traceability related information retrieval.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Query languages selection</title>
      <p>The Structured Query Language (SQL) was introduced in the 1970s for
managing data in relational database management systems (RDBMS). SQL
matured over the last decades, has a large base of users familiar with its syntax, and
is widely used in industry today. We selected SQL for our study as an industrial
baseline for analyzing traceability information.</p>
      <p>
        The visual query language (VTML) [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ] was explicitly designed to
intuitively query traceability information. Users model queries without explicit
knowledge of the data's schema and its distribution. The visual representation
of a query closely resembles UML class diagrams enhanced by speci c
annotations for selecting and ltering data. VTML behaves like a meta-language,
unbound to a speci c underlying query language. Their inventors demonstrated
a transformation from VTML to SQL. We selected VTML for our study as a
scienti c baseline for how traceability information can be analyzed today.
      </p>
      <p>Cypher4 is a declarative graph query language accompanying the Neo4j
graph database. Cypher's syntax is inspired by SQL. Most keywords, their
semantics, and the resulting query structure are borrowed making it intuitive to
SQL users and easing the transition. We selected Cypher for that reason.</p>
      <p>
        Gremlin refers to a graph traversal machine and language designed and
developed by the Apache TinkerPop project [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The graph traversal machine
can run on di erent graph databases, including Neo4j. As a query language,
gremlin is a domain-speci c language (DSL) built on Groovy targeting the Java
Virtual Machine. In Gremlin, queries can be expressed either in an imperative
or declarative way. We selected Gremlin because of its turing completeness.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>User Survey</title>
      <p>In September and October 2016 we conducted a user survey with engineers
working in di erent industrial domains. Participants covered the areas
automotive electronics and mechanics; embedded and distributed systems; logistics; and
telecommunications. In particular, we interviewed nine participants employed as
4 https://neo4j.com/docs/developer-manual/current/cypher/</p>
      <p>Feature
id
name</p>
      <p>Use Case
id
name
Component Method
nidame LidoC
Legend</p>
      <p>Traceable
taPKrreeatrycimfeapictrtoyttepptydeeprety ridesTuelstt</p>
      <p>(a) TIM (b) Example graph according to TIM
Fig. 1: TIM extracted from a participant's traceability queries (left) and example
traceability graph complying to the TIM (right).
software developers (4), requirements engineers (3), and software architects (2).
Out of the 35 asked questions about sociodemographic, development process,
and reporting, we were interested in information like
What is the role of requirements traceability during your daily work?
What relevant information do you use for analyses, including data and metrics?
Given answers were systematically analyzed. At rst, we derived an example
set of queries to answer typical traceability questions, e.g. existing link patterns
among artifacts and traceability metrics. Using these, a traceability information
model (TIM) per participant was extracted. Additionally, we determined basis
operations (e.g. artifact selection, ltering) that are to be supported by a query
language expressing the queries.
3.1</p>
      <sec id="sec-3-1">
        <title>Traceability Information Model (TIM)</title>
        <p>
          A traceability information model is a meta-model de ning prescribed traceability
for a project and providing the context in which queries can be speci ed and
executed [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. We applied a two step approach to extract a TIM per participant of
our study. In Step 1, we asked users to formulate trace queries, which would help
them in their daily business. In Step 2, we annotated development artifacts and
prescribed relationships involved in these queries. One example of an extracted
TIM is shown in Figure 1a. The TIM consists of ve development artifacts and
ve traceability relations. Features belong to architectural components, can be
re ned into more ne-grained features, and can eventually be decomposed into
use cases. A method implements one or more use cases and is tested by one or
more tests. Figure 1b exempli es a possible traceability graph complying to this
TIM. The nodes represent artifacts with their type coded in color. The edges
refer to trace links. To illustrate queries, which include ones checking consistency,
this graph is on purpose not complete with respect to the TIM.
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Basic query language operations</title>
        <p>We identi ed ve operations that a query language must at least support to be
applicable for the gathered queries: Ê selecting artifacts with a speci c type,
SELECT `UseCase`.name FROM `UseCase` AS uc
INONNERucJ.OidIN= `lL1i.nUkseCUasseeCIDase Method` AS l1
INNER JOIN `Method` ON `Method`.id = l1.MethodID
INNER JOIN `Link Method Test` AS l2
ON `Method`.id = l2.MethodID
INNER JOIN `Test` ON `Test`.id = l2.TestID
WHERE `Method`.LoC &gt; 50 AND `Test`.result = "failed"
(a) SQL</p>
        <p>Use Case
name</p>
        <p>Test
(b) VTML
MATC(Hm)(&lt;uc:Us[e:TCEasSeT)&lt;S] ([t:I:MTePstL)EMENTS] (m:Method),
WHERE m.LoC &gt; 50 AND t.result = "failed"
RETURN uc.name
g.V().hasLabel('UseCase').as('uc')
.in('IMPLEMENTS').as('m').in('TESTS').as('t')
. select ('t' , 'm','uc' ).by(' result ' ).by('LoC').by('name')
. lter fit.get ()[ 't' ] == 'failed' &amp;&amp; it.get()[ 'm'] &gt; 50g.select('uc')
(c) Cypher (d) Gremlin
Fig. 2: Example query: "Find all uses cases implemented by a method with &gt; 50
lines of code that have failed test cases" using the four selected query languages.
e. g., use cases or tests; Ë retrieving relations between artifacts, e.g., methods
implementing use cases; Ì selecting artifact and trace properties to be part of
a query's result, e.g., the name of a component; Í ltering artifacts based on
their properties satisfying a predicate, e. g., methods with more than 50 lines of
code; and Î applying aggregation functions, e. g., counting artifacts.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Example trace queries</title>
        <p>We populated sqlite5 as well as matching Neo4j databases with example artifacts
and trace links according to the TIMs extracted per participant. We used these
to manually execute all queries formulated by our participants, the written SQL
statements and the ones generated from the VTML representation on the sqlite
database; and the Cypher and Gremlin statements on the Neo4j database.
Below, we discuss two trace queries selected from the set gained in the user survey.
Query 1: Which use cases implemented by method(s) with more than 50 lines of
source code have failed test cases.</p>
        <p>
          Finding use cases with failed tests is an important operation. Additionally, a
lter is applied selecting only methods that are considered complex based on the
lines of code as basic measure. The resulting queries for all selected languages
are shown in Figure 2. The query involves multiple artifacts (use case, method,
and test) as well as traces. In fact, traceability queries deal to a large extent with
the existence of traces between artifacts [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. In SQL, querying relations requires
multiple join statements and in depth knowledge about the data's schema. Since
the syntax of SQL aims to resemble natural English language, the remaining
part of query is obvious. VTML's graphical notation is comprehensible to
everybody with a basic knowledge of UML. Artifacts and traces are easy to spot,
because of its declarative, query by example approach. Cypher is very similar to
SQL, because of the borrowed keywords and syntax. The major di erence being
the powerful MATCH clause, used to describe paths in the graph, in our context
5 https://sqlite.org/
WITH RECURSIVE child(ID, depth) AS (
SELECT ID, 0 AS depth FROM `Feature`
UNION
SELECT link.SourceID, depth + 1
FROM child, `Link Feature Feature` AS link, `Feature`
WHERE child.ID = link.TargetID
AND `Feature`.ID = link.SourceID
)
SELECT `Feature`.Name, MAX(depth) AS max depth
FROM `Feature`
INNER JOIN child ON child.ID = `Feature`.ID
GROUP BY `Feature`.ID HAVING max depth &gt; 1
MATCH p = (f:Feature) [:REFINE ] &gt;(:Feature)
WITH f, length(p) AS depth WHERE depth &gt; 1
RETURN f.name, depth
        </p>
        <p>
          (b) Cypher
g.V().hasLabel('feature ')
.repeat(out('REFINE')).emit().path().as('s','p')
.mapfit.get(). size()g.as('depth'). lter fit.get() &gt; 2g
. select('s','p','depth').byfit.get(0)g.by().by()
(a) SQL (c) Gremlin
Fig. 3: Example query "Which features have a re nement depth larger than one
and what is their depth?" using three out of the four query languages. Currently
it is not possible to formulate this query with VTML.
traces between artifacts. It substitutes the complicated INNER JOIN clauses used
in SQL. Contrasting all other languages, the Gremlin query uses an imperative
style, i. e., navigating the underlying graph data structure step by step6. It starts
at graph nodes representing use cases (g.V().hasLabel()) and explicitly follows
speci c edges (e. g.,in('IMPLEMENTS')). However, a Gremlin query is actually a
set of Groovy statements, which might be an obstacle for users.
Query 2: Which features have an re nement depth &gt; 1 and what is their depth?
The second query implements the traceability metric depth counting layers that
traceability extends up- and downwards from a given layer [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. For our example,
a feature can be re ned multiple times (see self-relation in TIM), which requires
recursive querying to follow a relation arbitrarily often (see Figure 3). With
common table expressions (CTE), a rather recent concept of SQL [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], the query is
possible but quite complex with SQL. Currently, it is impossible to formulate
Query 2 with VTML due to its missing notion of recursion. Advanced pattern
matching and variable length relations make it easy to formulate this query in
Cypher. Its core being expressed as (:Feature)-[:REFINE*]-&gt;(:Feature). The
partial expression repeat(out('REFINE')).emit().path() serves the same
purpose in Gremlin.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation and Conclusions</title>
      <p>
        Overall we tried to formulate 45 trace queries from the user survey in all four
languages. With this knowledge, we conducted an early evaluation using three
evaluation criteria for query languages as de ned by Jarke et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] (see Table 1).
      </p>
      <p>E ort measures the amount of syntactic elements (tokens) the user needs to
enter in order to pose the query. VTML and Cypher perform best in this
category, because of their versatile syntax elements and terse notation. Readability
describes the di culty to understand a query. Being a visual language, VTML is
the clear winner. Gremlin, with its programing language syntax, is hard to read.
6 Gremlin also supports declarative programming using the match() step.</p>
      <p>SQL
VTML
Cypher
Gremlin</p>
      <p>E ort
medium
low
low
medium</p>
      <sec id="sec-4-1">
        <title>Readability</title>
        <p>medium
high
medium
low</p>
      </sec>
      <sec id="sec-4-2">
        <title>Expressiveness</title>
        <p>medium
low
high
medium</p>
        <p>We de ne expressiveness as the ability to successfully state a query as well to
express complex computations in intuitive ways. Cypher with its few but
powerful clauses is the best and VTML the worst, because currently some essential
features are missing and therefore some queries cannot be executed.</p>
        <p>In this paper, we studied the applicability of graph query languages for
requirements traceability analysis. Appropriate query languages are rare and users
are often forced to use generic ones like SQL. Based on a relational table model,
this is not a natural t when queried data are graphs of artifacts connected
by trace links. Therefore we selected two graph query languages (Cypher and
Gremlin) and a visual language (VTML), designed for traceability analysis, and
compared them to SQL. A preliminary user survey collected queries from
engineers working in the eld of requirements analysis. We found that graph query
languages can be successfully applied for traceability analysis. Future work will
provide a more detailed discussion and will include an in depth language
comparison and performance analysis.</p>
        <p>Acknowledgment We are funded by the BMBF grants: 01IS14026A, 01IS16003B,
DFG grant: MA 5030/3-1 and by the EU EFRE/TAB grant: 2015FE9033.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Cleland-Huang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gotel</surname>
            ,
            <given-names>O.C.Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hu man Hayes</surname>
            , J., Mader,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Software traceability: Trends and future directions</article-title>
          . pp.
          <volume>55</volume>
          {
          <fpage>69</fpage>
          . ACM Press (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Hull</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jackson</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dick</surname>
          </string-name>
          , J.: Requirements Engineering. Springer-Verlag New York, Inc., New York, NY, USA, 3rd edn. (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Jarke</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vassiliou</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>A framework for choosing a database query language</article-title>
          .
          <source>ACM Computing Surveys</source>
          <volume>17</volume>
          (
          <issue>3</issue>
          ),
          <volume>313</volume>
          {340 (Sep
          <year>1985</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Mader,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Cleland-Huang</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.:</surname>
          </string-name>
          <article-title>A Visual Traceability Modeling Language</article-title>
          .
          <source>In: Proc. 13th International Conference on Model Driven Engineering Languages and Systems</source>
          . Springer (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. Mader,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Cleland-Huang</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.:</surname>
          </string-name>
          <article-title>A visual language for modeling and executing traceability queries</article-title>
          .
          <source>Software &amp; Systems Modeling</source>
          <volume>12</volume>
          (
          <issue>3</issue>
          ),
          <volume>537</volume>
          {553 (Jul
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Melton</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simon</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>SQL:1999: Understanding Relational Language Components</article-title>
          . Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Ramesh</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jarke</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Toward reference models for requirements traceability</article-title>
          .
          <source>IEEE Trans. Softw. Eng</source>
          .
          <volume>27</volume>
          (
          <issue>1</issue>
          ),
          <volume>58</volume>
          {93 (Jan
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Rodriguez</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          :
          <article-title>The Gremlin graph traversal machine and language (invited talk)</article-title>
          . pp.
          <volume>1</volume>
          {
          <fpage>10</fpage>
          . ACM Press (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Wood</surname>
          </string-name>
          , P.T.:
          <article-title>Query languages for graph databases</article-title>
          .
          <source>SIGMOD Rec</source>
          .
          <volume>41</volume>
          (
          <issue>1</issue>
          ),
          <volume>50</volume>
          {60 (Apr
          <year>2012</year>
          ), http://doi.acm.
          <source>org/10</source>
          .1145/2206869.2206879
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>