Taxonomy of Flexible Linguistic Commitments
Vadim Zaytsev
Universiteit van Amsterdam, The Netherlands, vadim@grammarware.net
Abstract. Beside strict linguistic commitments of models to metamod-
els, documents to schemata, programs to grammars and activities to pro-
tocols, we often require or crave more flexible commitments. Some of such
types of flexible commitments are well-understood and even formalised,
others are considered harmful. In this paper, we make an attempt to
classify these types from the language theory point of view, provide con-
crete examples from software language engineering for each of them and
estimate usefulness and possible harm.
1 Introduction
In software language engineering people often speak of linguistic commitment,
structural commitment or commitment to grammatical structure [4]. Examples
include:
◇ well-formed schema-less XML documents that commit to containing tags
with allowed names and in proper combinations;
◇ XML documents that commit to the structure explicitly specified in the
DTD or XSD;
◇ models that conform to their metamodel;
◇ programs in high level programming languages that commit to the structure
specified by a grammar of such language and by other constraints of the
designated compiler;
◇ programs that rely on some library and have to commit to calling its func-
tions in proper order and with properly typed and ordered arguments;
◇ communicating entities that commit to sending and receiving messages ac-
cording to the chosen protocol and must respect it in order to achieve com-
patibility and interoperability.
Any form of flexibility in such commitments is either not present or not
considered. In this paper we propose to model it explicitly and classify its forms
into several categories, varying in usefulness and the ability to lead to robust
systems or to catastrophes.
The paper is organised as follows: §2 state the problem more clearly and
formally. Based on that, §3 introduces three kinds of flexible language varia-
tions. Then, §4 classifies and explains all flexibly committing mappings within
the framework. §5 proposes five kinds of streamlining helper mappings around
the same language. §6 gives preliminary outlook at higher order manipulations
such as composition of mappings,extension/restriction of them and constructing
inverse functions. §7 concludes the paper.
2 Problem Statement
Consider a mapping f ∶ L1 → L2 which domain is a software language L1 and
which codomain is a software language L2 (they can be the same in many cases,
but let us look at the general case). Following the principle of least surprise, we
could assume that f is surjective and total (i.e., its image fully coincides with
its codomain and its preimage with its domain) so that it maps every element of
L1 to some element of L2 and that every element of L2 is reachable. By flexible
linguistic commitment we will understand a situation when this expectation is
violated.
3 Variations in Software Languages
Consider three variants based on a language L:
◇ By Ľ we will denote a language that is strictly smaller than L: it has more
constraints and less elements. Committing to a subset of a language is in
general not that harmful and means limited applicability of the tool or a
mapping.
◇ By L̂ we will denote a similarly strict superset of L which has then more ele-
ments and less constraints. Committing to a bigger language than intended is
considered good practice in some areas that value robustness highly [7]. How-
ever, here we consider only “accidental” violations that extend the intended
language in a way that preserves the semantics of the language instances up
to heuristics that make sense for the problem domain.
◇ By L̃ we will denote another superset of L that refers to semantic changes —
instances from L̃ are not just outside L, but they also allow interpretations
that are incorrect according to the semantics of L. The existence of L̃ is the
main cause for critique on robustness guidelines [1].
In order to stay universally applicable, we consider all definitions to be spe-
cific to the problem domain of the software language (without loss of generality,
so called general purpose programming and modelling languages are assumed to
belong to problem domains of Turing-complete algorithms and the everything-
is-a-model paradigm respectfully). For example, unclosed tags are indicators
of HT ̂ M L since the HTML language is meant to be processed in a soft and
extremely flexible way, yet they are indicators of XM ̃ L in most cases except
for the absolutely trivial ones — say, one could argue that an XML document
which is well-formed except one place with the construction like is
̂
in XM L since it can be treated as . However, even c
̃
is already in XM L since it has two intuitively good resolutions: c
and c. Yet, in a subdomain of XML that deals with exclusively
empty or complex elements but never with cdata or mixed content, c
would also belong to XM ̂ L′ .
2
mapping from
Ľ1 → . . . L1 → . . . L̂1 → . . . L̃1 → . . .
correct program non-surjectivity
. . . → Ľ2 robustness overrobustness
wrong language conservativeness
mapping to
perfect
. . . → L2 partial applicability fault recovery overrecovery
assumed
overtolerance
. . . → L̂2 antirobustness liberality fault tolerance
semirecovery
linguistic
. . . → L̃2 fault introduction shotgun effect
ignorance
Table 1. Forms of linguistic commitments of mappings with a domain L1 and
codomain L2 , classified by their preimages Ľ1 ⊂ L1 ⊂ L̂1 and images Ľ2 ⊂ L2 ⊂ L̂2 .
Additionally, L̃ is a special case where L̃i ⊃ Li , but the difference L̃1 ∖ L1 is unantici-
pated and L̃2 ∖ L2 is unintended.
4 Variations in First Order Mappings
Table 1 presents all possible preimage/image combinations. Let us consider each
of them in turn more closely.
◇ f ∶ Ľ1 → Ľ2 (correct program, wrong language)
Some mappings that look like having flexible linguistic commitments, are
in fact (possibly) correct mappings working on a different language than
expected. This kind of “flexibility” is easily fixable by correcting the specifi-
cation.
◇ f ∶ Ľ1 → L2 (partial mapping, limited applicability)
Theoretically this scenario corresponds to the notion of partial function. In
practice it can be well disguised in a wrapper that extends f to L1 ∖ Ľ1 to be-
have like an identity function. If this is not done, this kind of flexibility is not
flexible at all: it actually means that in order to use f , one needs to normalise
its inputs in a more strict way than documented. In certain cases it mani-
fests itself as a works-on-my-machine antipattern when such undocumented
constraints concern configuration management and deployment details.
◇ f ∶ Ľ1 → L̂2 (antirobustness, ungratefulness)
If robustness (see below) is to expect less and provide more, then antiro-
bustness is its exact opposite: the demands on the input in this case are
higher than officially specified, and even when they are met, the output is
ungratefully relaxed.
◇ f ∶ Ľ1 → L̃2 (fault introduction)
If antirobust mappings can damage syntactic conformity, this variant can
also do semantic damage. Such a tool is flexible beyond reasonable, it is
inherently faulty: while holding surprisingly high expectations on its inputs,
it generates outputs that are outright wrong.
◇ f ∶ L1 → Ľ2 (non-surjective mapping, conservativeness)
Concervative mappings that transform a language instance into an instance
3
with an even stronger linguistic commitment, are not surjective. For cases
of L1 = L2 they are called language reductions and represent a form of
traditional normalisers without extra tolerance for input violations. Most
code generators are conservative: they cover the entire input language but
there exist possible target language programs that will never be generated.
Classic canonical normal forms in grammarware and databases are also this
kind of conservative mappings: they are proven to exist for all inputs and
infer standardised provably equivalent counterparts of them. Formally, this
is one of the best kinds of flexibility.
◇ f ∶ L1 → L2 (assumed / perfect)
We list this form of commitment for the sake of completeness, but in fact
it represents no flexibility: this is the standard commitment to grammatical
structure [4], more or less precisely defined and precisely obeyed.
◇ f ∶ L1 → L̂2 (liberality)
In mathematics, a function is under no circumstances allowed to return a
value outside its codomain, but from our point of view this variant is a con-
ceptual sibling of the conservativeness discussed above which corresponds
to the lack of surjectivity. For the case of L1 = L2 this is called language
extension: and it may be intentional — consider a situation of a coupled
transformation sequence representing language evolution (if the language is
backwards compatible, it will only contain language preserving and extend-
ing operators) or some kind of refinement/enrichment plan that recognises
patterns of language use and transforms them into constructs of some more
sophisticated language.
◇ f ∶ L1 → L̃2 (fault introduction)
What classifies mappings of this category is the lack of anticipation: it was
probably meant to be a Ľ1 → Ľ2 mapping, and as it turns out, it does not
work correctly on unanticipated instances from L ∖ Ľ. One of the typical
examples is refactoring engines that claim to cover some programming lan-
guage but in fact work correctly only on its subset while yielding irregular
results whenever certain concurrency or subtyping patterns are involved [3].
◇ f ∶ L̂1 → Ľ2 (robustness)
This is precisely the model of the “be conservative in what you do, be liberal
in what you expect from others”, also known as Postel’s Robustness Prin-
ciple [7], or at least its harmless implementations. The input language is
extended to a safe superset of L1 , but the output language is limited to a
subset of L2 . Many normalisers work this way, accepting mostly valid entities
of a language and mapping them onto a strict subset of the same language.
For example, BibSLEIGH [15], a project involving a large JSON database,
has a normaliser that traverses all JSON files, including those that have trail-
ing commas in lists or dictionaries, as well as those with naïve quoting, and
transforms each of them into an LRJ file (Lexically Reliable Json) which
is basically valid JSON with extra guarantees of one key-value pair per line
and lines being sorted lexicographically.
◇ f ∶ L̂1 → L2 (recovery)
A mapping that accepts slightly more than obligatory yet remains true to its
4
output language, represents error recovery: in the simplest case of L1 = L2
this may be the only purpose of the mapping, but it does not have to be. For
example, consider notation-parametric grammar recovery, a technique that
takes a human-written error-containing language document and a definition
of the format it is supposed to have, and yields a well-formed grammar
extracted from it [12]. Most of its heuristics are such mappings. Approximate
pattern matching [6] and semiparsing [13] also belong to the same group, as
do screen-scraping libraries like Beautiful Soup [8].
◇ f ∶ L̂1 → L̂2 (tolerance)
In the terminology of negotiated mappings, the previous kind of flexible
commitment represented adaptation through adjustment; this one is adap-
tation through tolerance [14]. Such a mapping still needs to cope with the
unexpected outputs, but has an option of propagating the unexpected parts
further down the pipeline instead of fixing them.
◇ f ∶ L̂1 → L̃2 (shotgun effect)
This kind of mapping has been identified in the technological space of inter-
net security and named “shotgun parsing” there [2]. A shotgun parser is a
program that is meant to check its input for linguistic commitment and sani-
tise it, yet in the interest of performance optimisations does not perform a
thorough check and limits itself to fundamentally flawed approaches such as
regular means of treating context-free languages or using plain string substi-
tution for escaping; as a result, the system becomes vulnerable to malevolent
input (comment bombs, SQL injections, etc). Every shotgun parser in the
pipeline increases the span of possible treatments of data, hence the shotgun
metaphor.
◇ f ∶ L̃1 → Ľ2 (overrobustness)
A few paragraphs above we reintroduced robustness as input type contravari-
ance and output type covariance. Overrobustness does the same but crosses
the syntax-semantics border and hence becomes dangerously ambiguous,
nondeterministic and in general error-prone. Many examples of overrobust-
ness leading to security bugs were given by Meredith Patterson et al. at
http://langsec.org [2,9].
◇ f ∶ L̃1 → L2 (overrecovery)
Overrecovery is a process of applying heuristic-based fixes to semantically
incorrect language instances with the goal to return to the intended out-
put language. Some aggressive heuristics like parenthesis matching from
notation-parametric grammar recovery [12] fall into this category, as well
as desperate matches in semiparsing [13].
◇ f ∶ L̃1 → L̂2 (overtolerance, semirecovery)
Overtolerance is a form of harmful error handling when semantic errors in
the input are presumably fixed, but the output is still not always guaranteed
to be syntactically correct.
◇ f ∶ L̃1 → L̃2 (ignorance)
The ultimate form of flexible linguistic commitment is complete linguistic
ignorance: our inputs can be broken beyond any hope of unambiguous re-
pair, and the outputs are not much better in that respect. All lexical analysis
5
L → ... L̂ → . . . L̃ → . . .
4 1 ∼
*
. . . → Ľ
canoniser normaliser regulator
3 ≃
... → L id
codifier calibrator
Table 2. Symbols and names for streamlining mappings.
methods belong to this category and are still being often use due to their ex-
treme speed, ease of development and virtually unlimited flexibility, despite
limited expressiveness and abundant false positives and negatives. Some mi-
gration and analysis projects of considerable size have been reported being
completed with language ignorant lexical methods [5,10].
5 Streamlining Mappings
Before we try to compose flexibly committed mappings and consider higher order
combinators, we need to introduce a special subcategory of streamlining map-
pings that help us “get back” to L or even Ľ whenever we deviate. In particular,
we have a need for the following five (summarised on Table 2:
◇ Canoniser (4 ∶ L → Ľ) has a normal (precise) commitment to L and produces
a strict subset of it. Typically this is some kind of canonical normal form
that makes sense for the chosen technological space; if it is strictly canonical,
there is usually a proof of its uniqueness up to L.
◇ Codifier (3 ∶ L̂ → L) is flexible with its input because it applies certain rules
for recovery which are applied until the output conforms to all expectations
of L.
◇ Normaliser (1 ∶ L̂ → Ľ) implements a classic Postel-style normalisation
scheme that we have briefly discussed before: it is liberal with respect to
its input and conservative with respect to its output. In our formalisation
it also means that 1 ≡ 4 ○ 3 (a normaliser is equivalent to the superposition
of a codifier and a canoniser), and indeed, any normaliser can be conceptu-
ally studied as a composition of two parts implementing the liberality and
the conservativeness accordingly. We will consider superposition of flexibly
committing functions in the next section in more detail.
◇ Calibrators (≃ ∶ L̃ → L) and regulators (* ∼ ∶ L̃ → Ľ) can be decomposed
similarly in various ways, but the key when considering them is remembering
that the main distinction between L̂ and L̃ is that L̂ ∖ L is anticipated
and L̃ ∖ L is not. Hence, when something is not anticipated, it cannot be
casually accounted for by an automated streamliner. In practice developing
calibrators and regulators requires the same steps as developing codifiers
and normalisers, with additional effort dedicated to testing and documenting
recovery heuristics, which are usually unreliable.
6
f ⊚g g, applied first
. . . → Ľ ... → L . . . → L̂ . . . → L̃
f ○ g, if Ľ ⊆ Ľ′
Ľ′ → . . . f ○4○g f ○1○g f ○* ∼○g
f ○4○g
f, applied
L → ... f ○g f ○g f ○3○g f ○≃○g
second
f ○ g, if L̂ ⊆ L̂′
L̂ → . . .
′ f ○g f ○g f ○≃○g
f ○3○g
L̃′ → . . . f ○g f ○g f ○3○g f ○≃○g
Table 3. Superposition of two mappings with flexible commitments (f ⊚ g) expressed
as normal superposition of f , g and possibly some streamliners. Dashes simply denote
non-uniqueness: for example, L̂ may be a different superset of L than L̂′ .
6 Manipulating Flexible Mappings
6.1 Extension and Restriction
For every possible combination of input and output types (strictly speaking,
for any input type since output type is irrelevant), a restriction on L is defined
straightforwardly as a function which returns whatever the original function
would have returned, for all inputs from the restricted set, and is underfined
otherwise. An extension of a function deliberately committing to a subset of the
intended language, on the other hand, cannot be defined in a general fashion.
Hence, for any f which preimage is L, L̂ or L̃, we get a f ∣L for free, but the
extension f ∣L for f ∶ Ľ → L′ is, generally speaking, unknown.
A more interesting observation is that restriction can affect the flexibility
of the linguistic commitment to the output language as well: for example, if
f ∶ L̂1 → L2 , then f ∣L1 ∶ L → L2 or possibly f ∣L1 ∶ L → Ľ2 . This has bad
consequences on calculating inverses of restrictions.
6.2 Inverse Functions
Due to the nature of our formalisation that considers preimages and images in-
stead of domains and codomains, all functions that map between L1 , Ľ1 , L̂1 and
L2 , Ľ2 , L̂2 , have straightforward inverse mappings. Their non-unique existence is
guaranteed (the proof is a direct consequence of the definitions of a premiga and
an image). For example, for any f ∶ Ľ1 → L̂2 there is at least one f −1 ∶ L̂2 → Ľ1 .
Inverses of restrictions can be more desirable, but less reliable, since there is no
guarantee that for an arbitrary f ∶ Lˆ1 → Lˆ2 , the image of f ∣L1 (L1 ) ∩ L2 ≠ ∅.
6.3 Composition
Table 3 summarises the resulting superpositions of two flexible mappings with
respect to all possible kinds of mappings. The most obvious case is the following:
⊚ ∶ (L2 → L3 ) → (L1 → L2 ) → (L1 → L3 )
7
However, the following compositions are also universally safe (variations of
L2 correspond to variations of L in the table):
⊚ ∶ (L2 → L3 ) → (L1 → Ľ2 ) → (L1 → L3 )
⊚ ∶ (L̂2 → L3 ) → (L1 → Ľ2 ) → (L1 → L3 )
⊚ ∶ (L̃2 → L3 ) → (L1 → Ľ2 ) → (L1 → L3 )
⊚ ∶ (L̂2 → L3 ) → (L1 → L2 ) → (L1 → L3 )
⊚ ∶ (L̃2 → L3 ) → (L1 → L2 ) → (L1 → L3 )
The result type of such compositions given above is a possible overapproxi-
mation, because technically we combine g with f ∣L2 or f ∣Lˇ2 , not with f itself,
so it is possible for f ⊚ g with the codomain L3 to have image Ľ3 .
The rest often require streamliners, with two special cases stemming from the
fact that in our notation L̂ expresses any superset of L and not a particular one.
Thus, it can be the case that Ľ2 ⊆ Ľ′2 and then ⊚ ∶ (Ľ′2 → L3 ) → (L1 → Ľ2 ) →
(L1 → L3 ) is expressible with a simple superposition: f ⊚ g = f ○ g, but otherwise
we need an Ľ′2 -canoniser to glue them together: f ⊚ g = f ○ 4 ○ g. Similarly to the
safe case discussed above, the image of such composition can be Ľ3 if Ľ2 ≠ Ľ′2 .
7 Conclusion
This paper is an attempt to investigate Postel’s Robustness Principle (“be con-
servative in what you do, be liberal in what you expect from others”) [7,1], in
particular to follow the Postel’s Principle Patch (“be definite about what you ac-
cept”) [9] and the formal language theoretical approach to computer (in)security,
in the general software language engineering view (not limited to internet pro-
tocols and even data languages). The result was a taxonomy of several fami-
lies of language mappings depending on the inclusion relations between their
domains and preimages, as well as codomains and images. The taxonomy (Ta-
ble 1) contains two forms of precise commitment and several forms of harmful
and profitable flexible commitments. Streamlining mappings (Table 2), mapping
superpositions (Table 3) and other details of manipulating mappings with such
flexible commitments, have also been considered. This classification is a refine-
ment with respect to any previously existing approach, and provides means to
identify safe combinations of mappings and streamliners. Formal treatment of
tolerance [11] remains future work, and in general a categorical approach should
provide even more solid foundation than a set theoretical one, which can also be
refined further to yield interesting and useful properties.
8
References
1. E. Allman. The Robustness Principle Reconsidered. Communications of the ACM,
54(8):40–45, Aug. 2011.
2. S. Bratus and M. L. Patterson. Shotgun Parsers in the Cross-hairs. In Brucon,
2012. http://langsec.org.
3. M. Gouseti. A General Framework for Concurrency Aware Refactorings. Master’s
thesis, UvA, Mar. 2015. http://dare.uva.nl/en/scriptie/502196.
4. P. Klint, R. Lämmel, and C. Verhoef. Toward an Engineering Discipline for Gram-
marware. ACM ToSEM, 14(3):331–380, 2005.
5. S. Klusener, R. Lämmel, and C. Verhoef. Architectural Modifications to Deployed
Software. Science of Computer Programming, 54:143–211, 2005.
6. V. Laurikari. TRE. http://github.com/laurikari/tre, 2005.
7. J. Postel. DoD Standard Internet Protocol. RFC 0760, 1980.
8. L. Richardson. Beautiful Soup. http://www.crummy.com/software/
BeautifulSoup, 2012.
9. L. Sassaman, M. L. Patterson, and S. Bratus. A Patch for Postel’s Robustness
Principle. IEEE Security and Privacy, 10(2):87–91, Mar. 2012.
10. D. Spinellis. Differential Debugging. IEEE Software, 30(5):19–21, 2013.
11. P. Stevens. Bidirectionally Tolerating Inconsistency: Partial Transformations. In
FASE, volume 8411 of LNCS, pages 32–46. Springer, 2014.
12. V. Zaytsev. Notation-Parametric Grammar Recovery. In A. Sloane and S. Andova,
editors, LDTA. ACM DL, June 2012.
13. V. Zaytsev. Formal Foundations for Semi-parsing. In S. Demeyer, D. Binkley, and
F. Ricca, editors, CSMR-WCRE ERA, pages 313–317. IEEE, Feb. 2014.
14. V. Zaytsev. Negotiated Grammar Evolution. JOT, 13(3):1:1–22, July 2014.
15. V. Zaytsev. BibSLEIGH: Bibliography of Software Language Engineering in Gen-
erated Hypertext. In A. H. Bagge, editor, SATToSE, pages 59–62, July 2015.
9