<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Natural Language Programming (NLPRO) Project: Turning Text into Executable Code</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>NLPRO Project Statement: A Tale of Two Disciplines</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Reut Tsarfaty The ONLP Research Lab @ The Open University of</institution>
          <country country="IL">Israel</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we present the natural language programming (NLPRO) project (via ERC-StG-2015 grant 677352), where we strive to automatically translate requirements documents directly into the executable code of the systems they describe. To achieve this, we embrace the ambiguity of NL requirements and define a three-fold research agenda wherein we (i) formalize text-to-code translation as a structure prediction task, (ii) propose a formal semantic representation in terms of Live Sequence Charts (LSCs), and (iii) develop and comparatively evaluate novel sentence-based vs. discourse-based models for semantic parsing of requirements documents, and test their accuracy on various case studies. The empirical results of our first research cycle show that the discourse-based models consistently outperform the sentence-based models in constructing a system that reflects the requirements in the document. We conjecture that the formal representation of LSCs, the joint sentencediscourse modeling strategy, and the statistical learning component, are key ingredients for effectively tackling the NLPRO long-standing challenge.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Background: Requirements Engineering using Controlled Languages</title>
      <p>Requirements elicitation is a process whereby a system analyst gathers information from a stakeholder about a desired
system to be implemented. The knowledge collected by the analyst may be static, referring to the conceptual model (the
entities, properties, possible values) or dynamic, referring to the behavior that the system should follow (who does what to
whom, when, how, etc) [NE00]. A stakeholder interested in the system typically has a specific static and dynamic domain
in mind, but he or she cannot necessarily prescribe any formal models or code artifacts. The term requirements elicitation I
use here refers to a piece of discourse in natural language, by means of which a stakeholder communicates their desiderata
to the analyst. The role of the system analyst is to understand the different requirements and transform them into formal
constructs, diagrams or executables. Moreover, the analyst needs to consolidate the different pieces of information to
uncover a single shared domain.</p>
      <p>To streamline this process, studies in software engineering aim to develop intuitive symbolic systems, often termed
controlled natural languages (CNL), with which human agents can encode requirements which would then be unambiguously
translated into formal or executable artifacts [FS95, BL02]. Gordon and Harel [GH09] for instance define a CNL that can
be used for specifying requirements which can be effectively translated into live sequence charts (LSC) [DH01, HM03], a
formal language for specifying the dynamic behavior of reactive systems. However, the grammar that underlies this
language fragment is highly ambiguous, and all disambiguation has to be conducted manually by human intervention, which
in turn makes the process slow, unintuitive, and time consuming. This particular aspect of their work reflects a rather
general phenomenon: the more natural a symbolic system, or a CNL, is, the harder it is to develop an unambiguous translation
engine for it [Kuh14]. As a result, much work on CNL parsing for requirements documents requires a human in the loop,
or instead, drastically narrowing down the space of allowed utterances, in order to avoid any ambiguity.
3</p>
    </sec>
    <sec id="sec-2">
      <title>The Proposal: Semantic Parsing using Content and Context</title>
      <p>In this project we accept the ambiguity of requirements descriptions as a premise, and aim to directly address the challenge
of automatically recovering a single unambigous formal representation of the complete system by parsing the requirements
document — one that best reflects the human-perceived interpretation of the document.</p>
      <p>Recent advances in natural language processing (NLP) are define semantic parsing as the task of automatically assigning
informal natural language utterances with a formal, unambiguous representation of meaning. Their formal representation
output type is often task-specific — for example, Zettelmoyer and Collins [ZC05], Liang et al [LJK11], Artzi and
Zettlemoyer [AZ13], and Liang and Potts [LP14], use different formalisms and various kinds of statistical learning signals to
support the automatic assignment of meaning in different tasks. In particular, the model of Lei et al [LLBR13] induces
input parsers from format descriptions, and Kushman and Barzilay [KB13] automatically convert free textual descriptions
into regular expressions. However, these models interpret only short and local instructions, and rarely do they take into
account the entire document, in order to deliver a formal description of a complete system.</p>
      <p>Here we cast the requirements documents interpretation task as a structure prediction task, where we accept a piece
of discourse as input and aim to automatically predict a formal model of the static and dynamic domain as output. We
currently assume that the input requirements document is given in the simple, yet highly ambiguous, fragment of English
of Gordon and Harel [GH09].1 The output, in contrast, is a sequence of unambiguous and well-formed formal constructs
that represent the dynamic behavior of the system, called live sequence charts (LSC) [DH01, HM03] tied to a single shared
code-base called a system model (SM).</p>
      <p>The key idea we promote in this work is that discourse context provides substantial disambiguating information for
the semantic interpretation of individual requirements in the document. We present a novel system for automatically
translating the requirements into executable artefacts based on a joint sentence-level and discourse-level probabilistic
generative model. The solution we present takes the form of a hidden markov model (HMM) where emission probabilities
are calculated in CKY charts to reflect the grammaticality and interpretability of each individual textual requirements via a
probabilistic grammar, and transition probabilities model the overlap between SM snapshots of a single, shared, domain.
Using efficient viterbi decoding, we search for the best sequence of domain snapshots that has most likely generated the
entire document. We empirically show that such a joint model consistently outperforms a sentence-based model learned
from the same set of data.2</p>
      <p>1This version assumes a fragment of English grammar with a closed set of function words and an unlimited lexicon of open class categories (verbs,
nouns and adjectives).</p>
      <p>2For further information on the model as well as thorough experimental results, we kindly refer the reader to Tsarfaty et al [TPW+14] and references
therein.</p>
    </sec>
    <sec id="sec-3">
      <title>Natural Language Programming: Conclusions and Perspectives</title>
      <p>The contribution of the first development cycle of the NLPRO project is three-fold: (i) we formalize the text-to-code
translation as a structure prediction task, (ii) we propose a formal semantic representation with well-defined grounding for
RE, and (iii) we empirically evaluate sentence-based and discourse-based models for semantic parsing of requirements. We
show a consistent improvement of discourse-based over sentence-based models, in all case studies. In the future, we intend
to extend this model for interpreting requirements in unrestricted English, endowed with a more sophisticated discourse
interpretation function.</p>
      <p>All in all, the automatic interpretation of requirements documents presents an exciting challenge for both natural
language processing (NLP) and software engineering (SE). In NLP, effectively addressing text-to-code translation requires
fundamentally rethinking the nature of semantic representation, the scope of interpretation, and how the automatic
discovery the entities, actions, conditions, temporal markers, constraints, execution modalities, etc., can lead to a single coherent
storyline/system description. In SE, the availability of tools for automatic analysis and interpretation of requirements has
the potential of introducing a huge leap in the speed and accuracy of system development. It is hoped that this prospect
will lead to new and novel methodologies of requirements engineering, in which man and machine offer respective and
complementary contribution towards maximally effective and efficient development.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgements References</title>
      <p>This work is kindly supported by the European Research Council, ERC-StG-2015 grant 677352, for which we are grateful.</p>
      <p>B. Bryant and B. S. Lee. Two-level grammar as an object-oriented requirements specification language. In
Proceedings of the 35th Annual Hawaii International Conference on System Sciences (HICSS02). IEEE Computer
Society, 2002.</p>
      <p>W. Damm and D. Harel. LSCs: Breathing life into message sequence charts. Form. Methods Syst. Des.,
19(1):45–80, July 2001.</p>
      <p>Edsger W. Dijkstra. On the foolishness of ”natural language programming”. In FriedrichL. Bauer, Manfred
Broy, E.W. Dijkstra, S.L. Gerhart, D. Gries, M. Griffiths, J.V. Guttag, J.J. Horning, S.S. Owicki, C. Pair,
H. Partsch, P. Pepper, M. Wirsing, and H. WA˜ ssner, editors, Program Construction, volume 69 of Lecture
Notes in Computer Science, pages 51–53. Springer Berlin Heidelberg, 1979.</p>
      <p>N. E. Fuchs and R. Schwitter. Attempto: Controlled natural language for requirements specifications. In
Markus P. J. Fromherz, Marc Kirschenbaum, and Anthony J. Kusalik, editors, LPE, 1995.</p>
      <p>M. Gordon and D. Harel. Generating executable scenarios from natural language. In Proceedings of the 10th
International Conference on Computational Linguistics and Intelligent Text Processing, CICLing ’09, pages
456–467, Berlin, Heidelberg, 2009. Springer-Verlag.</p>
      <p>D. Harel and R. Marelly. Come, Let’s Play: Scenario-Based Programming Using LSCs and the Play-Engine.
Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2003.</p>
      <p>N. Kushman and R. Barzilay. Using semantic unification to generate regular expressions from natural
language. In Proceedings of NAACL-HLT, pages 826–836, 2013.</p>
      <p>T. Kuhn. A survey and classification of controlled natural languages. Computational Linguistics, 40(1):121–
170, 2014.</p>
      <p>P. Liang, M. I. Jordan, and D. Klein. Learning dependency-based compositional semantics. In Association for
Computational Linguistics (ACL), pages 590–599, 2011.
[LLBR13] T. Lei, F. Long, R. Barzilay, and M. C. Rinard. From natural language specifications to program input parsers.</p>
      <p>In ACL (1), pages 1294–1303, 2013.</p>
      <p>P. Liang and C. Potts. Bringing machine learning and compositional semantics together. Annual Reviews of
Linguistics (submitted), 0, 2014.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [NE00]
          <string-name>
            <given-names>B.</given-names>
            <surname>Nuseibeh</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Easterbrook</surname>
          </string-name>
          .
          <article-title>Requirements engineering: A roadmap</article-title>
          .
          <source>In Proceedings of ICSE</source>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [TPW+14]
          <string-name>
            <given-names>R.</given-names>
            <surname>Tsarfaty</surname>
          </string-name>
          , E. Pogrebezky, G. Weiss,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Natan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Szekely</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Harel</surname>
          </string-name>
          .
          <article-title>Semantic parsing using content and context: A case study from requirements elicitation</article-title>
          .
          <source>In Proceedings of EMNLP</source>
          , pages
          <fpage>1296</fpage>
          -
          <lpage>1307</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [ZC05]
          <string-name>
            <given-names>L. S.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Collins</surname>
          </string-name>
          .
          <article-title>Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars</article-title>
          .
          <source>In UAI</source>
          , pages
          <fpage>658</fpage>
          -
          <lpage>666</lpage>
          . AUAI Press,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>