Taxonomy of Flexible Linguistic Commitments Vadim Zaytsev Universiteit van Amsterdam, The Netherlands, vadim@grammarware.net Abstract. Beside strict linguistic commitments of models to metamod- els, documents to schemata, programs to grammars and activities to pro- tocols, we often require or crave more flexible commitments. Some of such types of flexible commitments are well-understood and even formalised, others are considered harmful. In this paper, we make an attempt to classify these types from the language theory point of view, provide con- crete examples from software language engineering for each of them and estimate usefulness and possible harm. 1 Introduction In software language engineering people often speak of linguistic commitment, structural commitment or commitment to grammatical structure [4]. Examples include: ◇ well-formed schema-less XML documents that commit to containing tags with allowed names and in proper combinations; ◇ XML documents that commit to the structure explicitly specified in the DTD or XSD; ◇ models that conform to their metamodel; ◇ programs in high level programming languages that commit to the structure specified by a grammar of such language and by other constraints of the designated compiler; ◇ programs that rely on some library and have to commit to calling its func- tions in proper order and with properly typed and ordered arguments; ◇ communicating entities that commit to sending and receiving messages ac- cording to the chosen protocol and must respect it in order to achieve com- patibility and interoperability. Any form of flexibility in such commitments is either not present or not considered. In this paper we propose to model it explicitly and classify its forms into several categories, varying in usefulness and the ability to lead to robust systems or to catastrophes. The paper is organised as follows: §2 state the problem more clearly and formally. Based on that, §3 introduces three kinds of flexible language varia- tions. Then, §4 classifies and explains all flexibly committing mappings within the framework. §5 proposes five kinds of streamlining helper mappings around the same language. §6 gives preliminary outlook at higher order manipulations such as composition of mappings,extension/restriction of them and constructing inverse functions. §7 concludes the paper. 2 Problem Statement Consider a mapping f ∶ L1 → L2 which domain is a software language L1 and which codomain is a software language L2 (they can be the same in many cases, but let us look at the general case). Following the principle of least surprise, we could assume that f is surjective and total (i.e., its image fully coincides with its codomain and its preimage with its domain) so that it maps every element of L1 to some element of L2 and that every element of L2 is reachable. By flexible linguistic commitment we will understand a situation when this expectation is violated. 3 Variations in Software Languages Consider three variants based on a language L: ◇ By Ľ we will denote a language that is strictly smaller than L: it has more constraints and less elements. Committing to a subset of a language is in general not that harmful and means limited applicability of the tool or a mapping. ◇ By L̂ we will denote a similarly strict superset of L which has then more ele- ments and less constraints. Committing to a bigger language than intended is considered good practice in some areas that value robustness highly [7]. How- ever, here we consider only “accidental” violations that extend the intended language in a way that preserves the semantics of the language instances up to heuristics that make sense for the problem domain. ◇ By L̃ we will denote another superset of L that refers to semantic changes — instances from L̃ are not just outside L, but they also allow interpretations that are incorrect according to the semantics of L. The existence of L̃ is the main cause for critique on robustness guidelines [1]. In order to stay universally applicable, we consider all definitions to be spe- cific to the problem domain of the software language (without loss of generality, so called general purpose programming and modelling languages are assumed to belong to problem domains of Turing-complete algorithms and the everything- is-a-model paradigm respectfully). For example, unclosed tags are indicators of HT ̂ M L since the HTML language is meant to be processed in a soft and extremely flexible way, yet they are indicators of XM ̃ L in most cases except for the absolutely trivial ones — say, one could argue that an XML document which is well-formed except one place with the construction like is ̂ in XM L since it can be treated as . However, even c ̃ is already in XM L since it has two intuitively good resolutions: c and c. Yet, in a subdomain of XML that deals with exclusively empty or complex elements but never with cdata or mixed content, c would also belong to XM ̂ L′ . 2 mapping from Ľ1 → . . . L1 → . . . L̂1 → . . . L̃1 → . . . correct program non-surjectivity . . . → Ľ2 robustness overrobustness wrong language conservativeness mapping to perfect . . . → L2 partial applicability fault recovery overrecovery assumed overtolerance . . . → L̂2 antirobustness liberality fault tolerance semirecovery linguistic . . . → L̃2 fault introduction shotgun effect ignorance Table 1. Forms of linguistic commitments of mappings with a domain L1 and codomain L2 , classified by their preimages Ľ1 ⊂ L1 ⊂ L̂1 and images Ľ2 ⊂ L2 ⊂ L̂2 . Additionally, L̃ is a special case where L̃i ⊃ Li , but the difference L̃1 ∖ L1 is unantici- pated and L̃2 ∖ L2 is unintended. 4 Variations in First Order Mappings Table 1 presents all possible preimage/image combinations. Let us consider each of them in turn more closely. ◇ f ∶ Ľ1 → Ľ2 (correct program, wrong language) Some mappings that look like having flexible linguistic commitments, are in fact (possibly) correct mappings working on a different language than expected. This kind of “flexibility” is easily fixable by correcting the specifi- cation. ◇ f ∶ Ľ1 → L2 (partial mapping, limited applicability) Theoretically this scenario corresponds to the notion of partial function. In practice it can be well disguised in a wrapper that extends f to L1 ∖ Ľ1 to be- have like an identity function. If this is not done, this kind of flexibility is not flexible at all: it actually means that in order to use f , one needs to normalise its inputs in a more strict way than documented. In certain cases it mani- fests itself as a works-on-my-machine antipattern when such undocumented constraints concern configuration management and deployment details. ◇ f ∶ Ľ1 → L̂2 (antirobustness, ungratefulness) If robustness (see below) is to expect less and provide more, then antiro- bustness is its exact opposite: the demands on the input in this case are higher than officially specified, and even when they are met, the output is ungratefully relaxed. ◇ f ∶ Ľ1 → L̃2 (fault introduction) If antirobust mappings can damage syntactic conformity, this variant can also do semantic damage. Such a tool is flexible beyond reasonable, it is inherently faulty: while holding surprisingly high expectations on its inputs, it generates outputs that are outright wrong. ◇ f ∶ L1 → Ľ2 (non-surjective mapping, conservativeness) Concervative mappings that transform a language instance into an instance 3 with an even stronger linguistic commitment, are not surjective. For cases of L1 = L2 they are called language reductions and represent a form of traditional normalisers without extra tolerance for input violations. Most code generators are conservative: they cover the entire input language but there exist possible target language programs that will never be generated. Classic canonical normal forms in grammarware and databases are also this kind of conservative mappings: they are proven to exist for all inputs and infer standardised provably equivalent counterparts of them. Formally, this is one of the best kinds of flexibility. ◇ f ∶ L1 → L2 (assumed / perfect) We list this form of commitment for the sake of completeness, but in fact it represents no flexibility: this is the standard commitment to grammatical structure [4], more or less precisely defined and precisely obeyed. ◇ f ∶ L1 → L̂2 (liberality) In mathematics, a function is under no circumstances allowed to return a value outside its codomain, but from our point of view this variant is a con- ceptual sibling of the conservativeness discussed above which corresponds to the lack of surjectivity. For the case of L1 = L2 this is called language extension: and it may be intentional — consider a situation of a coupled transformation sequence representing language evolution (if the language is backwards compatible, it will only contain language preserving and extend- ing operators) or some kind of refinement/enrichment plan that recognises patterns of language use and transforms them into constructs of some more sophisticated language. ◇ f ∶ L1 → L̃2 (fault introduction) What classifies mappings of this category is the lack of anticipation: it was probably meant to be a Ľ1 → Ľ2 mapping, and as it turns out, it does not work correctly on unanticipated instances from L ∖ Ľ. One of the typical examples is refactoring engines that claim to cover some programming lan- guage but in fact work correctly only on its subset while yielding irregular results whenever certain concurrency or subtyping patterns are involved [3]. ◇ f ∶ L̂1 → Ľ2 (robustness) This is precisely the model of the “be conservative in what you do, be liberal in what you expect from others”, also known as Postel’s Robustness Prin- ciple [7], or at least its harmless implementations. The input language is extended to a safe superset of L1 , but the output language is limited to a subset of L2 . Many normalisers work this way, accepting mostly valid entities of a language and mapping them onto a strict subset of the same language. For example, BibSLEIGH [15], a project involving a large JSON database, has a normaliser that traverses all JSON files, including those that have trail- ing commas in lists or dictionaries, as well as those with naïve quoting, and transforms each of them into an LRJ file (Lexically Reliable Json) which is basically valid JSON with extra guarantees of one key-value pair per line and lines being sorted lexicographically. ◇ f ∶ L̂1 → L2 (recovery) A mapping that accepts slightly more than obligatory yet remains true to its 4 output language, represents error recovery: in the simplest case of L1 = L2 this may be the only purpose of the mapping, but it does not have to be. For example, consider notation-parametric grammar recovery, a technique that takes a human-written error-containing language document and a definition of the format it is supposed to have, and yields a well-formed grammar extracted from it [12]. Most of its heuristics are such mappings. Approximate pattern matching [6] and semiparsing [13] also belong to the same group, as do screen-scraping libraries like Beautiful Soup [8]. ◇ f ∶ L̂1 → L̂2 (tolerance) In the terminology of negotiated mappings, the previous kind of flexible commitment represented adaptation through adjustment; this one is adap- tation through tolerance [14]. Such a mapping still needs to cope with the unexpected outputs, but has an option of propagating the unexpected parts further down the pipeline instead of fixing them. ◇ f ∶ L̂1 → L̃2 (shotgun effect) This kind of mapping has been identified in the technological space of inter- net security and named “shotgun parsing” there [2]. A shotgun parser is a program that is meant to check its input for linguistic commitment and sani- tise it, yet in the interest of performance optimisations does not perform a thorough check and limits itself to fundamentally flawed approaches such as regular means of treating context-free languages or using plain string substi- tution for escaping; as a result, the system becomes vulnerable to malevolent input (comment bombs, SQL injections, etc). Every shotgun parser in the pipeline increases the span of possible treatments of data, hence the shotgun metaphor. ◇ f ∶ L̃1 → Ľ2 (overrobustness) A few paragraphs above we reintroduced robustness as input type contravari- ance and output type covariance. Overrobustness does the same but crosses the syntax-semantics border and hence becomes dangerously ambiguous, nondeterministic and in general error-prone. Many examples of overrobust- ness leading to security bugs were given by Meredith Patterson et al. at http://langsec.org [2,9]. ◇ f ∶ L̃1 → L2 (overrecovery) Overrecovery is a process of applying heuristic-based fixes to semantically incorrect language instances with the goal to return to the intended out- put language. Some aggressive heuristics like parenthesis matching from notation-parametric grammar recovery [12] fall into this category, as well as desperate matches in semiparsing [13]. ◇ f ∶ L̃1 → L̂2 (overtolerance, semirecovery) Overtolerance is a form of harmful error handling when semantic errors in the input are presumably fixed, but the output is still not always guaranteed to be syntactically correct. ◇ f ∶ L̃1 → L̃2 (ignorance) The ultimate form of flexible linguistic commitment is complete linguistic ignorance: our inputs can be broken beyond any hope of unambiguous re- pair, and the outputs are not much better in that respect. All lexical analysis 5 L → ... L̂ → . . . L̃ → . . . 4 1 ∼ * . . . → Ľ canoniser normaliser regulator 3 ≃ ... → L id codifier calibrator Table 2. Symbols and names for streamlining mappings. methods belong to this category and are still being often use due to their ex- treme speed, ease of development and virtually unlimited flexibility, despite limited expressiveness and abundant false positives and negatives. Some mi- gration and analysis projects of considerable size have been reported being completed with language ignorant lexical methods [5,10]. 5 Streamlining Mappings Before we try to compose flexibly committed mappings and consider higher order combinators, we need to introduce a special subcategory of streamlining map- pings that help us “get back” to L or even Ľ whenever we deviate. In particular, we have a need for the following five (summarised on Table 2: ◇ Canoniser (4 ∶ L → Ľ) has a normal (precise) commitment to L and produces a strict subset of it. Typically this is some kind of canonical normal form that makes sense for the chosen technological space; if it is strictly canonical, there is usually a proof of its uniqueness up to L. ◇ Codifier (3 ∶ L̂ → L) is flexible with its input because it applies certain rules for recovery which are applied until the output conforms to all expectations of L. ◇ Normaliser (1 ∶ L̂ → Ľ) implements a classic Postel-style normalisation scheme that we have briefly discussed before: it is liberal with respect to its input and conservative with respect to its output. In our formalisation it also means that 1 ≡ 4 ○ 3 (a normaliser is equivalent to the superposition of a codifier and a canoniser), and indeed, any normaliser can be conceptu- ally studied as a composition of two parts implementing the liberality and the conservativeness accordingly. We will consider superposition of flexibly committing functions in the next section in more detail. ◇ Calibrators (≃ ∶ L̃ → L) and regulators (* ∼ ∶ L̃ → Ľ) can be decomposed similarly in various ways, but the key when considering them is remembering that the main distinction between L̂ and L̃ is that L̂ ∖ L is anticipated and L̃ ∖ L is not. Hence, when something is not anticipated, it cannot be casually accounted for by an automated streamliner. In practice developing calibrators and regulators requires the same steps as developing codifiers and normalisers, with additional effort dedicated to testing and documenting recovery heuristics, which are usually unreliable. 6 f ⊚g g, applied first . . . → Ľ ... → L . . . → L̂ . . . → L̃ f ○ g, if Ľ ⊆ Ľ′ Ľ′ → . . . f ○4○g f ○1○g f ○* ∼○g f ○4○g f, applied L → ... f ○g f ○g f ○3○g f ○≃○g second f ○ g, if L̂ ⊆ L̂′ L̂ → . . . ′ f ○g f ○g f ○≃○g f ○3○g L̃′ → . . . f ○g f ○g f ○3○g f ○≃○g Table 3. Superposition of two mappings with flexible commitments (f ⊚ g) expressed as normal superposition of f , g and possibly some streamliners. Dashes simply denote non-uniqueness: for example, L̂ may be a different superset of L than L̂′ . 6 Manipulating Flexible Mappings 6.1 Extension and Restriction For every possible combination of input and output types (strictly speaking, for any input type since output type is irrelevant), a restriction on L is defined straightforwardly as a function which returns whatever the original function would have returned, for all inputs from the restricted set, and is underfined otherwise. An extension of a function deliberately committing to a subset of the intended language, on the other hand, cannot be defined in a general fashion. Hence, for any f which preimage is L, L̂ or L̃, we get a f ∣L for free, but the extension f ∣L for f ∶ Ľ → L′ is, generally speaking, unknown. A more interesting observation is that restriction can affect the flexibility of the linguistic commitment to the output language as well: for example, if f ∶ L̂1 → L2 , then f ∣L1 ∶ L → L2 or possibly f ∣L1 ∶ L → Ľ2 . This has bad consequences on calculating inverses of restrictions. 6.2 Inverse Functions Due to the nature of our formalisation that considers preimages and images in- stead of domains and codomains, all functions that map between L1 , Ľ1 , L̂1 and L2 , Ľ2 , L̂2 , have straightforward inverse mappings. Their non-unique existence is guaranteed (the proof is a direct consequence of the definitions of a premiga and an image). For example, for any f ∶ Ľ1 → L̂2 there is at least one f −1 ∶ L̂2 → Ľ1 . Inverses of restrictions can be more desirable, but less reliable, since there is no guarantee that for an arbitrary f ∶ Lˆ1 → Lˆ2 , the image of f ∣L1 (L1 ) ∩ L2 ≠ ∅. 6.3 Composition Table 3 summarises the resulting superpositions of two flexible mappings with respect to all possible kinds of mappings. The most obvious case is the following: ⊚ ∶ (L2 → L3 ) → (L1 → L2 ) → (L1 → L3 ) 7 However, the following compositions are also universally safe (variations of L2 correspond to variations of L in the table): ⊚ ∶ (L2 → L3 ) → (L1 → Ľ2 ) → (L1 → L3 ) ⊚ ∶ (L̂2 → L3 ) → (L1 → Ľ2 ) → (L1 → L3 ) ⊚ ∶ (L̃2 → L3 ) → (L1 → Ľ2 ) → (L1 → L3 ) ⊚ ∶ (L̂2 → L3 ) → (L1 → L2 ) → (L1 → L3 ) ⊚ ∶ (L̃2 → L3 ) → (L1 → L2 ) → (L1 → L3 ) The result type of such compositions given above is a possible overapproxi- mation, because technically we combine g with f ∣L2 or f ∣Lˇ2 , not with f itself, so it is possible for f ⊚ g with the codomain L3 to have image Ľ3 . The rest often require streamliners, with two special cases stemming from the fact that in our notation L̂ expresses any superset of L and not a particular one. Thus, it can be the case that Ľ2 ⊆ Ľ′2 and then ⊚ ∶ (Ľ′2 → L3 ) → (L1 → Ľ2 ) → (L1 → L3 ) is expressible with a simple superposition: f ⊚ g = f ○ g, but otherwise we need an Ľ′2 -canoniser to glue them together: f ⊚ g = f ○ 4 ○ g. Similarly to the safe case discussed above, the image of such composition can be Ľ3 if Ľ2 ≠ Ľ′2 . 7 Conclusion This paper is an attempt to investigate Postel’s Robustness Principle (“be con- servative in what you do, be liberal in what you expect from others”) [7,1], in particular to follow the Postel’s Principle Patch (“be definite about what you ac- cept”) [9] and the formal language theoretical approach to computer (in)security, in the general software language engineering view (not limited to internet pro- tocols and even data languages). The result was a taxonomy of several fami- lies of language mappings depending on the inclusion relations between their domains and preimages, as well as codomains and images. The taxonomy (Ta- ble 1) contains two forms of precise commitment and several forms of harmful and profitable flexible commitments. Streamlining mappings (Table 2), mapping superpositions (Table 3) and other details of manipulating mappings with such flexible commitments, have also been considered. This classification is a refine- ment with respect to any previously existing approach, and provides means to identify safe combinations of mappings and streamliners. Formal treatment of tolerance [11] remains future work, and in general a categorical approach should provide even more solid foundation than a set theoretical one, which can also be refined further to yield interesting and useful properties. 8 References 1. E. Allman. The Robustness Principle Reconsidered. Communications of the ACM, 54(8):40–45, Aug. 2011. 2. S. Bratus and M. L. Patterson. Shotgun Parsers in the Cross-hairs. In Brucon, 2012. http://langsec.org. 3. M. Gouseti. A General Framework for Concurrency Aware Refactorings. Master’s thesis, UvA, Mar. 2015. http://dare.uva.nl/en/scriptie/502196. 4. P. Klint, R. Lämmel, and C. Verhoef. Toward an Engineering Discipline for Gram- marware. ACM ToSEM, 14(3):331–380, 2005. 5. S. Klusener, R. Lämmel, and C. Verhoef. Architectural Modifications to Deployed Software. Science of Computer Programming, 54:143–211, 2005. 6. V. Laurikari. TRE. http://github.com/laurikari/tre, 2005. 7. J. Postel. DoD Standard Internet Protocol. RFC 0760, 1980. 8. L. Richardson. Beautiful Soup. http://www.crummy.com/software/ BeautifulSoup, 2012. 9. L. Sassaman, M. L. Patterson, and S. Bratus. A Patch for Postel’s Robustness Principle. IEEE Security and Privacy, 10(2):87–91, Mar. 2012. 10. D. Spinellis. Differential Debugging. IEEE Software, 30(5):19–21, 2013. 11. P. Stevens. Bidirectionally Tolerating Inconsistency: Partial Transformations. In FASE, volume 8411 of LNCS, pages 32–46. Springer, 2014. 12. V. Zaytsev. Notation-Parametric Grammar Recovery. In A. Sloane and S. Andova, editors, LDTA. ACM DL, June 2012. 13. V. Zaytsev. Formal Foundations for Semi-parsing. In S. Demeyer, D. Binkley, and F. Ricca, editors, CSMR-WCRE ERA, pages 313–317. IEEE, Feb. 2014. 14. V. Zaytsev. Negotiated Grammar Evolution. JOT, 13(3):1:1–22, July 2014. 15. V. Zaytsev. BibSLEIGH: Bibliography of Software Language Engineering in Gen- erated Hypertext. In A. H. Bagge, editor, SATToSE, pages 59–62, July 2015. 9