Taxonomy of Flexible Linguistic Commitments

                                  Vadim Zaytsev

      Universiteit van Amsterdam, The Netherlands, vadim@grammarware.net


      Abstract. Beside strict linguistic commitments of models to metamod-
      els, documents to schemata, programs to grammars and activities to pro-
      tocols, we often require or crave more flexible commitments. Some of such
      types of flexible commitments are well-understood and even formalised,
      others are considered harmful. In this paper, we make an attempt to
      classify these types from the language theory point of view, provide con-
      crete examples from software language engineering for each of them and
      estimate usefulness and possible harm.


1    Introduction
In software language engineering people often speak of linguistic commitment,
structural commitment or commitment to grammatical structure [4]. Examples
include:
 ◇ well-formed schema-less XML documents that commit to containing tags
   with allowed names and in proper combinations;
 ◇ XML documents that commit to the structure explicitly specified in the
   DTD or XSD;
 ◇ models that conform to their metamodel;
 ◇ programs in high level programming languages that commit to the structure
   specified by a grammar of such language and by other constraints of the
   designated compiler;
 ◇ programs that rely on some library and have to commit to calling its func-
   tions in proper order and with properly typed and ordered arguments;
 ◇ communicating entities that commit to sending and receiving messages ac-
   cording to the chosen protocol and must respect it in order to achieve com-
   patibility and interoperability.
    Any form of flexibility in such commitments is either not present or not
considered. In this paper we propose to model it explicitly and classify its forms
into several categories, varying in usefulness and the ability to lead to robust
systems or to catastrophes.
    The paper is organised as follows: §2 state the problem more clearly and
formally. Based on that, §3 introduces three kinds of flexible language varia-
tions. Then, §4 classifies and explains all flexibly committing mappings within
the framework. §5 proposes five kinds of streamlining helper mappings around
the same language. §6 gives preliminary outlook at higher order manipulations
such as composition of mappings,extension/restriction of them and constructing
inverse functions. §7 concludes the paper.
2   Problem Statement

Consider a mapping f ∶ L1 → L2 which domain is a software language L1 and
which codomain is a software language L2 (they can be the same in many cases,
but let us look at the general case). Following the principle of least surprise, we
could assume that f is surjective and total (i.e., its image fully coincides with
its codomain and its preimage with its domain) so that it maps every element of
L1 to some element of L2 and that every element of L2 is reachable. By flexible
linguistic commitment we will understand a situation when this expectation is
violated.


3   Variations in Software Languages

Consider three variants based on a language L:

 ◇ By Ľ we will denote a language that is strictly smaller than L: it has more
   constraints and less elements. Committing to a subset of a language is in
   general not that harmful and means limited applicability of the tool or a
   mapping.
 ◇ By L̂ we will denote a similarly strict superset of L which has then more ele-
   ments and less constraints. Committing to a bigger language than intended is
   considered good practice in some areas that value robustness highly [7]. How-
   ever, here we consider only “accidental” violations that extend the intended
   language in a way that preserves the semantics of the language instances up
   to heuristics that make sense for the problem domain.
 ◇ By L̃ we will denote another superset of L that refers to semantic changes —
   instances from L̃ are not just outside L, but they also allow interpretations
   that are incorrect according to the semantics of L. The existence of L̃ is the
   main cause for critique on robustness guidelines [1].

     In order to stay universally applicable, we consider all definitions to be spe-
cific to the problem domain of the software language (without loss of generality,
so called general purpose programming and modelling languages are assumed to
belong to problem domains of Turing-complete algorithms and the everything-
is-a-model paradigm respectfully). For example, unclosed tags are indicators
of HT ̂ M L since the HTML language is meant to be processed in a soft and
extremely flexible way, yet they are indicators of XM  ̃   L in most cases except
for the absolutely trivial ones — say, one could argue that an XML document
which is well-formed except one place with the construction like <a><b></a> is
    ̂
in XM    L since it can be treated as <a><b/></a>. However, even <a><b>c</a>
               ̃
is already in XM    L since it has two intuitively good resolutions: <a><b/>c</a>
and <a><b>c</b></a>. Yet, in a subdomain of XML that deals with exclusively
empty or complex elements but never with cdata or mixed content, <a><b>c</a>
would also belong to XM ̂   L′ .


                                         2
                                                        mapping from
                                Ľ1 → . . .           L1 → . . .       L̂1 → . . .    L̃1 → . . .
                            correct program       non-surjectivity
                 . . . → Ľ2                                          robustness overrobustness
                            wrong language        conservativeness
    mapping to
                                                       perfect
                 . . . → L2 partial applicability                   fault recovery overrecovery
                                                      assumed
                                                                                   overtolerance
                 . . . → L̂2  antirobustness         liberality    fault tolerance
                                                                                   semirecovery
                                                                                     linguistic
                 . . . → L̃2          fault introduction           shotgun effect
                                                                                     ignorance

Table 1. Forms of linguistic commitments of mappings with a domain L1 and
codomain L2 , classified by their preimages Ľ1 ⊂ L1 ⊂ L̂1 and images Ľ2 ⊂ L2 ⊂ L̂2 .
Additionally, L̃ is a special case where L̃i ⊃ Li , but the difference L̃1 ∖ L1 is unantici-
pated and L̃2 ∖ L2 is unintended.


4                Variations in First Order Mappings
Table 1 presents all possible preimage/image combinations. Let us consider each
of them in turn more closely.

 ◇ f ∶ Ľ1 → Ľ2 (correct program, wrong language)
   Some mappings that look like having flexible linguistic commitments, are
   in fact (possibly) correct mappings working on a different language than
   expected. This kind of “flexibility” is easily fixable by correcting the specifi-
   cation.
 ◇ f ∶ Ľ1 → L2 (partial mapping, limited applicability)
   Theoretically this scenario corresponds to the notion of partial function. In
   practice it can be well disguised in a wrapper that extends f to L1 ∖ Ľ1 to be-
   have like an identity function. If this is not done, this kind of flexibility is not
   flexible at all: it actually means that in order to use f , one needs to normalise
   its inputs in a more strict way than documented. In certain cases it mani-
   fests itself as a works-on-my-machine antipattern when such undocumented
   constraints concern configuration management and deployment details.
 ◇ f ∶ Ľ1 → L̂2 (antirobustness, ungratefulness)
   If robustness (see below) is to expect less and provide more, then antiro-
   bustness is its exact opposite: the demands on the input in this case are
   higher than officially specified, and even when they are met, the output is
   ungratefully relaxed.
 ◇ f ∶ Ľ1 → L̃2 (fault introduction)
   If antirobust mappings can damage syntactic conformity, this variant can
   also do semantic damage. Such a tool is flexible beyond reasonable, it is
   inherently faulty: while holding surprisingly high expectations on its inputs,
   it generates outputs that are outright wrong.
 ◇ f ∶ L1 → Ľ2 (non-surjective mapping, conservativeness)
   Concervative mappings that transform a language instance into an instance


                                                       3
  with an even stronger linguistic commitment, are not surjective. For cases
  of L1 = L2 they are called language reductions and represent a form of
  traditional normalisers without extra tolerance for input violations. Most
  code generators are conservative: they cover the entire input language but
  there exist possible target language programs that will never be generated.
  Classic canonical normal forms in grammarware and databases are also this
  kind of conservative mappings: they are proven to exist for all inputs and
  infer standardised provably equivalent counterparts of them. Formally, this
  is one of the best kinds of flexibility.
◇ f ∶ L1 → L2 (assumed / perfect)
  We list this form of commitment for the sake of completeness, but in fact
  it represents no flexibility: this is the standard commitment to grammatical
  structure [4], more or less precisely defined and precisely obeyed.
◇ f ∶ L1 → L̂2 (liberality)
  In mathematics, a function is under no circumstances allowed to return a
  value outside its codomain, but from our point of view this variant is a con-
  ceptual sibling of the conservativeness discussed above which corresponds
  to the lack of surjectivity. For the case of L1 = L2 this is called language
  extension: and it may be intentional — consider a situation of a coupled
  transformation sequence representing language evolution (if the language is
  backwards compatible, it will only contain language preserving and extend-
  ing operators) or some kind of refinement/enrichment plan that recognises
  patterns of language use and transforms them into constructs of some more
  sophisticated language.
◇ f ∶ L1 → L̃2 (fault introduction)
  What classifies mappings of this category is the lack of anticipation: it was
  probably meant to be a Ľ1 → Ľ2 mapping, and as it turns out, it does not
  work correctly on unanticipated instances from L ∖ Ľ. One of the typical
  examples is refactoring engines that claim to cover some programming lan-
  guage but in fact work correctly only on its subset while yielding irregular
  results whenever certain concurrency or subtyping patterns are involved [3].
◇ f ∶ L̂1 → Ľ2 (robustness)
  This is precisely the model of the “be conservative in what you do, be liberal
  in what you expect from others”, also known as Postel’s Robustness Prin-
  ciple [7], or at least its harmless implementations. The input language is
  extended to a safe superset of L1 , but the output language is limited to a
  subset of L2 . Many normalisers work this way, accepting mostly valid entities
  of a language and mapping them onto a strict subset of the same language.
  For example, BibSLEIGH [15], a project involving a large JSON database,
  has a normaliser that traverses all JSON files, including those that have trail-
  ing commas in lists or dictionaries, as well as those with naïve quoting, and
  transforms each of them into an LRJ file (Lexically Reliable Json) which
  is basically valid JSON with extra guarantees of one key-value pair per line
  and lines being sorted lexicographically.
◇ f ∶ L̂1 → L2 (recovery)
  A mapping that accepts slightly more than obligatory yet remains true to its


                                       4
  output language, represents error recovery: in the simplest case of L1 = L2
  this may be the only purpose of the mapping, but it does not have to be. For
  example, consider notation-parametric grammar recovery, a technique that
  takes a human-written error-containing language document and a definition
  of the format it is supposed to have, and yields a well-formed grammar
  extracted from it [12]. Most of its heuristics are such mappings. Approximate
  pattern matching [6] and semiparsing [13] also belong to the same group, as
  do screen-scraping libraries like Beautiful Soup [8].
◇ f ∶ L̂1 → L̂2 (tolerance)
  In the terminology of negotiated mappings, the previous kind of flexible
  commitment represented adaptation through adjustment; this one is adap-
  tation through tolerance [14]. Such a mapping still needs to cope with the
  unexpected outputs, but has an option of propagating the unexpected parts
  further down the pipeline instead of fixing them.
◇ f ∶ L̂1 → L̃2 (shotgun effect)
  This kind of mapping has been identified in the technological space of inter-
  net security and named “shotgun parsing” there [2]. A shotgun parser is a
  program that is meant to check its input for linguistic commitment and sani-
  tise it, yet in the interest of performance optimisations does not perform a
  thorough check and limits itself to fundamentally flawed approaches such as
  regular means of treating context-free languages or using plain string substi-
  tution for escaping; as a result, the system becomes vulnerable to malevolent
  input (comment bombs, SQL injections, etc). Every shotgun parser in the
  pipeline increases the span of possible treatments of data, hence the shotgun
  metaphor.
◇ f ∶ L̃1 → Ľ2 (overrobustness)
  A few paragraphs above we reintroduced robustness as input type contravari-
  ance and output type covariance. Overrobustness does the same but crosses
  the syntax-semantics border and hence becomes dangerously ambiguous,
  nondeterministic and in general error-prone. Many examples of overrobust-
  ness leading to security bugs were given by Meredith Patterson et al. at
  http://langsec.org [2,9].
◇ f ∶ L̃1 → L2 (overrecovery)
  Overrecovery is a process of applying heuristic-based fixes to semantically
  incorrect language instances with the goal to return to the intended out-
  put language. Some aggressive heuristics like parenthesis matching from
  notation-parametric grammar recovery [12] fall into this category, as well
  as desperate matches in semiparsing [13].
◇ f ∶ L̃1 → L̂2 (overtolerance, semirecovery)
  Overtolerance is a form of harmful error handling when semantic errors in
  the input are presumably fixed, but the output is still not always guaranteed
  to be syntactically correct.
◇ f ∶ L̃1 → L̃2 (ignorance)
  The ultimate form of flexible linguistic commitment is complete linguistic
  ignorance: our inputs can be broken beyond any hope of unambiguous re-
  pair, and the outputs are not much better in that respect. All lexical analysis


                                       5
                             L → ...      L̂ → . . .     L̃ → . . .
                                4            1               ∼
                                                             *
               . . . → Ľ
                            canoniser   normaliser      regulator
                                             3               ≃
               ... → L         id
                                         codifier       calibrator

             Table 2. Symbols and names for streamlining mappings.


    methods belong to this category and are still being often use due to their ex-
    treme speed, ease of development and virtually unlimited flexibility, despite
    limited expressiveness and abundant false positives and negatives. Some mi-
    gration and analysis projects of considerable size have been reported being
    completed with language ignorant lexical methods [5,10].


5   Streamlining Mappings

Before we try to compose flexibly committed mappings and consider higher order
combinators, we need to introduce a special subcategory of streamlining map-
pings that help us “get back” to L or even Ľ whenever we deviate. In particular,
we have a need for the following five (summarised on Table 2:

 ◇ Canoniser (4 ∶ L → Ľ) has a normal (precise) commitment to L and produces
   a strict subset of it. Typically this is some kind of canonical normal form
   that makes sense for the chosen technological space; if it is strictly canonical,
   there is usually a proof of its uniqueness up to L.
 ◇ Codifier (3 ∶ L̂ → L) is flexible with its input because it applies certain rules
   for recovery which are applied until the output conforms to all expectations
   of L.
 ◇ Normaliser (1 ∶ L̂ → Ľ) implements a classic Postel-style normalisation
   scheme that we have briefly discussed before: it is liberal with respect to
   its input and conservative with respect to its output. In our formalisation
   it also means that 1 ≡ 4 ○ 3 (a normaliser is equivalent to the superposition
   of a codifier and a canoniser), and indeed, any normaliser can be conceptu-
   ally studied as a composition of two parts implementing the liberality and
   the conservativeness accordingly. We will consider superposition of flexibly
   committing functions in the next section in more detail.
 ◇ Calibrators (≃ ∶ L̃ → L) and regulators (*     ∼ ∶ L̃ → Ľ) can be decomposed
   similarly in various ways, but the key when considering them is remembering
   that the main distinction between L̂ and L̃ is that L̂ ∖ L is anticipated
   and L̃ ∖ L is not. Hence, when something is not anticipated, it cannot be
   casually accounted for by an automated streamliner. In practice developing
   calibrators and regulators requires the same steps as developing codifiers
   and normalisers, with additional effort dedicated to testing and documenting
   recovery heuristics, which are usually unreliable.


                                         6
           f ⊚g                                            g, applied first
                                         . . . → Ľ       ... → L      . . . → L̂     . . . → L̃
                                     f ○ g,   if Ľ ⊆ Ľ′
                         Ľ′ → . . .                      f ○4○g       f ○1○g         f ○*  ∼○g
                                         f ○4○g


            f, applied
                         L → ...            f ○g            f ○g       f ○3○g         f ○≃○g


              second
                                                                   f ○ g, if L̂ ⊆ L̂′
                         L̂ → . . .
                           ′                f ○g            f ○g                      f ○≃○g
                                                                       f ○3○g
                         L̃′ → . . .        f ○g            f ○g       f ○3○g         f ○≃○g

Table 3. Superposition of two mappings with flexible commitments (f ⊚ g) expressed
as normal superposition of f , g and possibly some streamliners. Dashes simply denote
non-uniqueness: for example, L̂ may be a different superset of L than L̂′ .


6     Manipulating Flexible Mappings
6.1   Extension and Restriction
For every possible combination of input and output types (strictly speaking,
for any input type since output type is irrelevant), a restriction on L is defined
straightforwardly as a function which returns whatever the original function
would have returned, for all inputs from the restricted set, and is underfined
otherwise. An extension of a function deliberately committing to a subset of the
intended language, on the other hand, cannot be defined in a general fashion.
Hence, for any f which preimage is L, L̂ or L̃, we get a f ∣L for free, but the
extension f ∣L for f ∶ Ľ → L′ is, generally speaking, unknown.
    A more interesting observation is that restriction can affect the flexibility
of the linguistic commitment to the output language as well: for example, if
f ∶ L̂1 → L2 , then f ∣L1 ∶ L → L2 or possibly f ∣L1 ∶ L → Ľ2 . This has bad
consequences on calculating inverses of restrictions.

6.2   Inverse Functions
Due to the nature of our formalisation that considers preimages and images in-
stead of domains and codomains, all functions that map between L1 , Ľ1 , L̂1 and
L2 , Ľ2 , L̂2 , have straightforward inverse mappings. Their non-unique existence is
guaranteed (the proof is a direct consequence of the definitions of a premiga and
an image). For example, for any f ∶ Ľ1 → L̂2 there is at least one f −1 ∶ L̂2 → Ľ1 .
Inverses of restrictions can be more desirable, but less reliable, since there is no
guarantee that for an arbitrary f ∶ Lˆ1 → Lˆ2 , the image of f ∣L1 (L1 ) ∩ L2 ≠ ∅.

6.3   Composition
Table 3 summarises the resulting superpositions of two flexible mappings with
respect to all possible kinds of mappings. The most obvious case is the following:

                             ⊚ ∶ (L2 → L3 ) → (L1 → L2 ) → (L1 → L3 )


                                                       7
   However, the following compositions are also universally safe (variations of
L2 correspond to variations of L in the table):

                     ⊚ ∶ (L2 → L3 ) → (L1 → Ľ2 ) → (L1 → L3 )
                     ⊚ ∶ (L̂2 → L3 ) → (L1 → Ľ2 ) → (L1 → L3 )
                     ⊚ ∶ (L̃2 → L3 ) → (L1 → Ľ2 ) → (L1 → L3 )
                     ⊚ ∶ (L̂2 → L3 ) → (L1 → L2 ) → (L1 → L3 )
                     ⊚ ∶ (L̃2 → L3 ) → (L1 → L2 ) → (L1 → L3 )
    The result type of such compositions given above is a possible overapproxi-
mation, because technically we combine g with f ∣L2 or f ∣Lˇ2 , not with f itself,
so it is possible for f ⊚ g with the codomain L3 to have image Ľ3 .
    The rest often require streamliners, with two special cases stemming from the
fact that in our notation L̂ expresses any superset of L and not a particular one.
Thus, it can be the case that Ľ2 ⊆ Ľ′2 and then ⊚ ∶ (Ľ′2 → L3 ) → (L1 → Ľ2 ) →
(L1 → L3 ) is expressible with a simple superposition: f ⊚ g = f ○ g, but otherwise
we need an Ľ′2 -canoniser to glue them together: f ⊚ g = f ○ 4 ○ g. Similarly to the
safe case discussed above, the image of such composition can be Ľ3 if Ľ2 ≠ Ľ′2 .


7   Conclusion

This paper is an attempt to investigate Postel’s Robustness Principle (“be con-
servative in what you do, be liberal in what you expect from others”) [7,1], in
particular to follow the Postel’s Principle Patch (“be definite about what you ac-
cept”) [9] and the formal language theoretical approach to computer (in)security,
in the general software language engineering view (not limited to internet pro-
tocols and even data languages). The result was a taxonomy of several fami-
lies of language mappings depending on the inclusion relations between their
domains and preimages, as well as codomains and images. The taxonomy (Ta-
ble 1) contains two forms of precise commitment and several forms of harmful
and profitable flexible commitments. Streamlining mappings (Table 2), mapping
superpositions (Table 3) and other details of manipulating mappings with such
flexible commitments, have also been considered. This classification is a refine-
ment with respect to any previously existing approach, and provides means to
identify safe combinations of mappings and streamliners. Formal treatment of
tolerance [11] remains future work, and in general a categorical approach should
provide even more solid foundation than a set theoretical one, which can also be
refined further to yield interesting and useful properties.


                                         8
References
 1. E. Allman. The Robustness Principle Reconsidered. Communications of the ACM,
    54(8):40–45, Aug. 2011.
 2. S. Bratus and M. L. Patterson. Shotgun Parsers in the Cross-hairs. In Brucon,
    2012. http://langsec.org.
 3. M. Gouseti. A General Framework for Concurrency Aware Refactorings. Master’s
    thesis, UvA, Mar. 2015. http://dare.uva.nl/en/scriptie/502196.
 4. P. Klint, R. Lämmel, and C. Verhoef. Toward an Engineering Discipline for Gram-
    marware. ACM ToSEM, 14(3):331–380, 2005.
 5. S. Klusener, R. Lämmel, and C. Verhoef. Architectural Modifications to Deployed
    Software. Science of Computer Programming, 54:143–211, 2005.
 6. V. Laurikari. TRE. http://github.com/laurikari/tre, 2005.
 7. J. Postel. DoD Standard Internet Protocol. RFC 0760, 1980.
 8. L. Richardson.          Beautiful Soup.      http://www.crummy.com/software/
    BeautifulSoup, 2012.
 9. L. Sassaman, M. L. Patterson, and S. Bratus. A Patch for Postel’s Robustness
    Principle. IEEE Security and Privacy, 10(2):87–91, Mar. 2012.
10. D. Spinellis. Differential Debugging. IEEE Software, 30(5):19–21, 2013.
11. P. Stevens. Bidirectionally Tolerating Inconsistency: Partial Transformations. In
    FASE, volume 8411 of LNCS, pages 32–46. Springer, 2014.
12. V. Zaytsev. Notation-Parametric Grammar Recovery. In A. Sloane and S. Andova,
    editors, LDTA. ACM DL, June 2012.
13. V. Zaytsev. Formal Foundations for Semi-parsing. In S. Demeyer, D. Binkley, and
    F. Ricca, editors, CSMR-WCRE ERA, pages 313–317. IEEE, Feb. 2014.
14. V. Zaytsev. Negotiated Grammar Evolution. JOT, 13(3):1:1–22, July 2014.
15. V. Zaytsev. BibSLEIGH: Bibliography of Software Language Engineering in Gen-
    erated Hypertext. In A. H. Bagge, editor, SATToSE, pages 59–62, July 2015.


                                         9