=Paper=
{{Paper
|id=Vol-1453/20_WalchWalterKuechlin_FormalAnalysisOfTheLinuxKernel_Confws-15_p131
|storemode=property
|title=Formal analysis of the Linux kernel configuration with SAT solving
|pdfUrl=https://ceur-ws.org/Vol-1453/20_WalchWalterKuechlin_FormalAnalysisOfTheLinuxKernel_Confws-15_p131.pdf
|volume=Vol-1453
|dblpUrl=https://dblp.org/rec/conf/confws/WalchWK15
}}
==Formal analysis of the Linux kernel configuration with SAT solving==
Formal Analysis of the Linux Kernel Configuration with SAT Solving Martin Walch1 and Rouven Walter1 and Wolfgang Küchlin1 Abstract. The Linux kernel is a highly configurable software sys- the project, it is a first prototype with only a simplified view on the tem. The aim of this paper is to develop a formal method for the Kconfig files. As a consequence, the results were of the proof-of- analysis of the configuration space. We first develop a Linux prod- concept type and of very limited use for verification purposes. uct overview formula (L-POF), which is a Boolean formula repre- The comprehensive VAMOS project3 at Friedrich-Alexander- senting the high-level configuration constraints of the kernel. Using Universität Erlangen-Nuremberg has lead to numerous results and SAT solving on this L-POF, we can then answer many questions, publications, including the uncovering of hundreds of real world such as which options are possible, mandatory, or impossible for any bugs. It also analyses the Linux kernel configuration with the means of the processor architectures for which the kernel may be config- of Propositional Logic, but goes much further by considering the ac- ured. Other potential applications include building a configurator or tual effects on the kernel code and applying the tools to other projects counting the number of kernel configurations. Our approach is anal- that use LKC as well. The PhD Thesis of Reinhard Tartler [9] from ogous to the methods we use for automobile configuration. However, 2013 gives a detailed overview over the model, the implemented in the Linux case the configuration options (e.g. the individual de- tools, and most of the applications and results. vice drivers) are represented by symbols in Tristate Logic, a special- The PhD Thesis of Sarah Nadi [5] from 2014 picks up the VAMOS ized three-valued logic system with several different data types, and project and extends it to not only consider the Kconfig files and the the configuration constraints are encoded in a somewhat arcane lan- kernel code, but to additionally take the build system into account. guage. We take great care to compile the L-POF directly from the Even more, it extracts configuration constraints from the implemen- files that hold the configuration constraints in order to achieve max- tation. imum flexibility and to be able to trace results directly back to the In this work, we focus solely on the Kconfig files. Although the source. VAMOS model proves to yield good results, it does not aim at being exact and relies on parts of LKC. We present a fairly precise model and translation process into a product overview formula in Proposi- 1 Introduction tional Logic with the goal to account for all details that are relevant in Linux is a kernel for a broad range of platforms with highly versa- real-world use cases. Our implementation works independently from tile configurations of peripheral components. Static configuration at LKC and we show the results from running it against Linux 4.0. compile time helps to adapt to the different requirements. The central This work is loosely based on the paper by Zengler and Küchlin tool for configuring this is LinuxKernelConf, abbreviated LKC. The from 2010 [11] in that it picks up and uses central ideas, but elabo- input to LKC uses a domain specific language to describe configu- rates on many more details. ration constraints. A common alternative name for LKC is Kconfig, There is no precise specification of Kconfig. So we consider what and they are often used interchangeably. In this document, we refer information is available in the documentation, the implementation, to the configuration system as LKC and we denote the language as and the way the input language is used. Section 2 gives a rough Kconfig. The input is large and stored in files which we call Kcon- overview over this input language. fig files. They continuously change as kernel development goes on. Our translation uses several intermediate stages. First we create Automatic checks for semantic consistency on Kconfig files are de- what we call the Zengler Model in Section 3. In this step we abstract sirable, but LKC has no such checks implemented. from technical details of reading the configuration input and isolate At the Workshop on Configuration 2010 in Lisbon, Zengler and the data that is relevant for our purposes. Küchlin presented an approach [11] to encode the whole Linux ker- The Zengler Model retains some parts of the input structure that nel configuration in Propositional Logic. Conceptually this work have an impact on the meaning. In Section 4, we transform it into the parses the Kconfig files and stores the relevant information in a Attributes Model, resolving these parts and switching from a repre- database. Subsequently, this database is translated into a product sentation that focuses on input structure to one that revolves around overview formula2 , abbreviated POF, in Propositional Logic. While the constraints. it demonstrates central ideas and shows the technical feasability of LKC uses a three-valued logic, called Tristate Logic, that is un- common in academic discourse. We take a look at this logic system 1 Symbolic Computation Group, WSI Informatics, Universität Tübingen, in Section 5 and introduce some extensions to it. We then proceed by Germany, www-sr.informatik.uni-tuebingen.de 2 A POF is a single Boolean formula which captures all configuration con- generating a product overview formula in this extended logic from straints of a high-level configuration model (cf. [3]). Historically it has the Attributes Model, encoding the set of all valid Linux kernel con- been introduced in [4] to capture the high-level configuration model of Mercedes-Benz, which is called “Product overview” (German: Produkt- 3 VAMOS: Variability Management in Operating Systems, übersicht). www4.cs.fau.de/Research/VAMOS/ 131 Juha Tiihonen, Andreas Falkner and Tomas Axling, Editors Proceedings of the 17th International Configuration Workshop September 10-11, 2015, Vienna, Austria figurations. LKC reads these files with a scanner and a parser that are gener- Eventually, we translate the product overview formula from the ated using flex4 and GNU Bison.5 Exploring the Bison grammar is three-valued logic into Propositional Logic in Section 6. With com- clearly beyond the scope of this document as it consists of more than mon transformation into CNF and SAT solving, we can reason about 100 production rules. Suffice it to say that the grammar we use very the set of valid configurations, yielding first results in Section 7. closely ressembles the grammar of LKC. To give an overview over the relevant parts of the configuration language, we confine ourselves to explaining an artificial example. A 2 The Linux Kernel Configuration System concise formal description of the relevant parts was done by She and Kconfig is a specialized input language for LKC that describes Berger [7]. available features in terms of symbols and offers several different Listing 1 contains most of the relevant language features. The means to encode constraints and interdependencies. Kernel devel- central element everything else is built around is the configuration opers maintain this information in hierarchically organized Kconfig block. Its purpose is to declare and to describe symbols. The keyword files. config starts a configuration block and is followed by the name of the symbol it refers to. A configuration block contains one or more Listing 1. Example Kconfig file lines that further specify what we call properties of that symbol. Each config T0 symbol must have at least one configuration block. In practice, for the tristate majority of symbols there is exactly one configuration block for each prompt " f e a t u r e T0 " symbol, but there may be more. In Linux 4.0, there are up to six per select T1 symbol. depends on T2 && !C0 Each symbol has one of the five types tristate, bool, config T1 string, int, and hex. In our example, the symbols T0, T1, and tristate T2 have the type tristate, the symbol B0 has the type bool, and prompt " f e a t u r e T1 " if C0 the symbol S0 has the type string. We call a symbol declared if it has at least one configuration block and one of the five types. config T2 On actual configuration, each symbol is assigned an unambigu- tristate ous value from their respective domains. The domain of tristate default y symbols is {0, 1, 2} and the domain of bool symbols is {0, 2}. The purpose of the three-valued tristate type is to encode the three choice tristate possible activation states that apply to many features in the kernel: prompt "CHOICE 0 " Turn it off (0), compile it as a runtime-loadable module (1) or com- pile it into the kernel (2). The bool type is the same except that it is config C0 missing the value 1 for the module state. prompt " f e a t u r e C0 " To identify the three different states in syntax, there are the three constants n, m, and y, which evaluate to 0, 1, and 2, respectively. config C1 The domain of string symbols is the set of all valid strings. prompt " f e a t u r e C1 " Similar to the domain of bool being a subset of the domain of tristate, the domains of int and hex symbols both are subsets endchoice of the domain of string symbols, in that their elements are strings menu " submenu " that read as valid integers or hexadecimal numbers, respectively. visible if T1 There are different types of dependencies that affect properties, thus most properties can be active or inactive. The symbol type is the config S0 only unconditional property. It is always active and cannot dynami- string cally change during configuration. It is also the only property that is default " d e f a u l t s t r i n g " mandatory for every symbol. However, to allow the user to directly assign a value, the symbol needs a prompt property. In our exam- config B0 ple, there are several such prompt properties, one of them for the bool prompt " c u s t o m S0 " symbol T0. The prompt keyword is followed by a string parame- ter "feature T0". This parameter is mandatory and shows up as short if B0 description in the configuration interface. The symbol T0 also has a select property with the argument config S0 T1. This property is only valid with bool and tristate symbols. prompt " S0 " The select property sets up a direct relation between the values of two symbols. The value of the selected symbol is always at least as endif high as the value of the symbol that the select property belongs to. So, in our example the value of T1 is always at least as high as endmenu the value of T0. The documentation calls this a reverse dependency. The last line of the configuration block of T0 starts with depends on and is followed by an expression in Tristate Logic T2 && !C0. 4 flex: The Fast Lexical Analyzer, flex.sourceforge.net 5 GNU Bison, www.gnu.org/software/bison/ Juha Tiihonen, Andreas Falkner and Tomas Axling, Editors 132 Proceedings of the 17th International Configuration Workshop September 10-11, 2015, Vienna, Austria This is a dependency that applies to the whole configuration block, possibly gaining forward compatibility. It starts with the option i.e. in our case it is a dependency to the prompt and to the select keyword, followed by one of a few option names, and depending on property. The actual dependency is encoded in the expression. It has that, the meaning varies and an argument may be appended. As of no direct influence on the value of the symbol, but rather on the prop- Linux 4.0, two different options are defined that may be relevant in erties which in turn set up constraints to the value.We take a closer our context: modules and env. The modules option makes only look at Tristate Logic and the mechanisms of dependencies in sec- sense with a bool symbol, and it may be only used once in the tion 5. whole configuration. It associates the corresponding symbol with the It is possible to add a dependency to a single property by append- availability of support for loading modules, i.e. as a consequence, ing an expression with if to the respective line, like for the configu- deactivating that symbol basically prohibits assigning the value 1 ration prompt of T1. from the tristate domain. The env option is usually only used A default property is used to non-bindingly suggest a value, for string symbols that have no prompt property as it imports but it is also used to automatically set values of symbols that have a value from the runtime environment of the system the configura- no active prompt. In our example, the symbol T2 uses the trivial tion system is running on. Only very few symbols use this option at expression y as default value, but more complicated expressions all and even fewer have an effect on the configuration that is worth are possible. mentioning. Still, as we want to be precise, we account for them, too. Our example contains a choice block. This language con- From our example it is also not apparent how Kconfig files are hi- struct encloses a series of configuration blocks with choice and erarchically organized. It works similarly to #include directives endchoice, constituting a set of symbols that logically exclude of the C preprocessor: A file inclusion is done with the keyword each other. This is useful for features that serve the same purpose string and a string argument. The string argument is treated as and which therefore naturally cannot be simultaneously active, like path and the content of the file at that path is pasted at the position of e.g. the compression algorithm of the kernel image or the preemption the string line. model. Like a symbol, a choice block has a type, but only the two Kconfig still offers more constructs that we have not mentioned, data types bool and tristate are valid. A choice block of type but they are irrelevant at this point. bool enforces that exactly one of the enclosed symbols is set to 2. The tristate type relaxes this strict constraint and allows to not 3 Zengler Model assign 2 to any of the symbols, while setting an arbitrary number of symbols to 1. There is a property optional that even allows setting Linux 4.0 supports 30 different main architectures. Each architecture all symbols to 0. In contrast to a symbol, a choice block must have a has its own tree of Kconfig files, but there are big overlaps across all prompt property. There may also be default properties, suggest- architectures by using common files. Accordingly we want to be able ing one of the enclosed symbols, and dependencies using depends to reason about the configurations across all architectures. However, on, which analogously to configuration blocks apply to the whole capturing several architectures in one data structure without losing choice block. precision is very complicated as this requires keeping track of which A hierarchical menu can be constructed using the menu blocks. files are included for each architecture and in which order they are Its primary purpose is organizing the features to ease navigation. Just read. Therefore, we create a separate model for each architecture and like the prompt property, the menu keyword is followed by a string consider them together if needed. that serves as user visible label for the menu. For our point of view it By considering one fixed architecture, we can deduce the full in- is important that there may be dependencies added using depends clusion hierarchy of Kconfig files for that specific architecture. Ac- on that apply to the whole menu, and a second type of dependency cordingly, our parser moves through the file hierarchy in the same that is set up with visible if. That second kind of dependency way as LKC does, without the need to store which files we in- toggles only the parts that concern direct configurability by the user, clude. However, it produces different data structures. We create two but leaves all other properties unaffected. databases, a symbol database DS , and a choice database DC . They Finally it is also possible to add dependencies to an arbitrary se- are heavily extended versions of the databases created by Zengler and quence of configuration blocks, choice blocks, and menus as long as Küchlin [11], hence we call these two databases the Zengler Model. it does not violate the logical hierarchy. This is done with the enclos- The symbol database contains symbol descriptors. Each configu- ing keywords if and endif with the dependency expression right ration block corresponds to a symbol descriptor. They are grouped by behind if. the symbol they belong to. So, our symbol database is a set of pairs −−−→ −−−→ Our example covers those language constructs that we regard as DS = (s, desc) where s is a symbol and desc is a list of symbol most important for variability, but there is more: For symbols with descriptors desc with the int or hex type, there is a range property that fixes a lower − → −→ −−→ −−−→ −−− and an upper bound for the value. Both bounds may depend on freely desc = (t, pmpt, sel, def , − range, dep, opt, kdesc ) changeable values of other symbols. In practice nearly all bounds pmpt = (ep , kp ) have constant values and those few bounds that are variable can be sel = (ss , es , ks ) manually checked for inconsistencies. From our perspective the im- pact of range properties on the variability of the overall configura- def = (ev , ed , kd ) tion is mostly negligible in current versions of Linux. Therefore, we range = (esl , esh , er , kr ) incorporate range properties in our model, but ignore them when dep = (− e→, − e−→, − if dep e−→)vis creating a product overview formula as we do not expect any observ- opt = (benv , bmod ). able consequences. Another language feature that does not occur in our example is The type t is one of the five types or it is a special type “unknown” the generic option line in configuration blocks, which allows ex- if the configuration block has no type property. It suffices if only one tending the language with only minimal changes to the grammar and descriptor holds a regular type. 133 Juha Tiihonen, Andreas Falkner and Tomas Axling, Editors Proceedings of the 17th International Configuration Workshop September 10-11, 2015, Vienna, Austria The properties prompt, select, default, and range appear We write E to denote sets of expressions. The other attributes are in the descriptor as the tuples pmpt, sel, def , and range, respec- only affected by the other dependencies, hence we define them as tively. As the input order plays a role, each of them and the descriptor follows: contain unique indices kdesc , kp , ks , kd , and kr that are ascending in input order. aS = (Es , s, ks ) = (− e→ −−→ if ∪ edep ∪ es , s, ks ) The string argument of a prompt property is unimportant for the aD = (Ed , ev , kd ) = (− e→ ∪ − if e−→ ∪ e , e , k ) dep d v d configuration logic, so we only store the expression of the optional if dependency ep . If there is no such dependency, we store a spe- aR = (Er , esl , esh , kr ) = (− e→ −−→ s s if ∪ edep ∪ er , el , eh , kr ) cial symbol as placeholder for an empty expression. Analogously, With these definitions we collect for each symbol s the prompt at- the optional if dependencies for select, default, and range tributes aP in a set P (s), the default attributes aD in a set D(s), properties are stored as es , ed , and er . and the range attributes aR in a set R(s). We also create a set S(s) The target of a select property is a single symbol es . A for each symbol, but note that the symbol in the definition of aS is default value may be the result of a non-trivial expression ev . The the symbol of the descriptor that contains the property and not the lower bound of a range property esl may be either another symbol symbol ss that is the select target. Accordingly we add each aS or a constant value. The same goes for the upper bound esh . to the set of the target symbol. If there is no descriptor of the target We do not store any menus or enclosing ifs explicitely. Instead we symbol we discard the attribute. propagate their dependencies to the contained descriptors and store To determine the symbol type that results from the types in the the dependency expressions in lists that distinguish between the dif- corresponding descriptors, we take the type of the symbol descrip- ferent types. The dependencies of surrounding ifs are stored in − e→ if . −−→ tor with the lowest kdesc that is not “unknown”. If there is no such Those of depends on dependencies add up to edep , including any descriptor we also set the type of the symbol to “unkown”. such dependencies contained in the configuration block under con- Finally we concatenate the opt bitvectors component-wise with a sideration. The third list − e− → vis contains the visible if dependen- logical or and store it with the symbol. cies from menus. We do not need dep and kdesc anymore and do not transfer them The pair of bits opt indicates if there is a module or env option. into the Attributes Model. Even though env carries an argument in the configuration block, we Next we transform the choice descriptors from the Zengler Model only care whether it is present. into what we call choice groups. Like for symbols, we determine the Our choice database DC is a set of tuples which we call choice type of the choice group. If t in the choice descriptor is bool or descriptors: tristate, then this is also the type of the choice group. However, − −−− → −−− → it is a frequently used feature to skip the explicit type declaration (t, bopt , pmpt, dep, kdesc , kc ) in choice blocks, resulting in choice descriptors with the special type Each choice block translates into a choice descriptor. The type t “unknown”. In that case the type is taken from the first symbol that is is one of bool, tristate or “unknown”. The bit bopt indicates enclosed in the choice descriptor and has a regular type. If any sym- whether the choice block carries the optional flag. Just like in a sym- bol in the choice descriptor is lacking a regular type then it inherits bol descriptor, there is a list of pmpt tuples, dependencies lists dep, the type of the choice group. and an index kc . Rather than storing the symbol names of the con- Now that we have completed the type resolution for choice groups, −−−→ we determine which symbols are part of the choice group. This is not tained symbol descriptors, we create a list of their indices kdesc . trivial as not all symbols that correspond to symbol descriptors in the In contrast to the symbol descriptors, the default properties of choice descriptor are necessarily transferred. Symbols can be moved choice blocks have no influence on variability and we ignore them. into their own submenu by depending on a symbol that is immedi- ately above, excluding them from the choice group. We actually see 4 Attributes Model this constellation intentionally used in Linux 4.0 and have to process it adequately. LKC involves an extensive logic to determine whether Although the Zengler Model is an abstraction from the underlying to move a symbol into a submenu, but a simple heuristic suffices to Kconfig files, storing the relevant information in a normalized way, correctly capture any real-world case. its focus lies on the input structure: The symbol database holds in- As LKC ignores the attributes S(s) and D(s) of all symbols of dividual symbol descriptors desc and lists of dependencies dep re- the choice group, we clear them if they have any content. flecting artifacts from the input structure. Furthermore, we have yet To complete the choice group, we generate a prompt attribute to determine the final types of the symbols and choices and which aP the same way we do from a symbol descriptor and we keep the symbols a choice includes as this is not trivial in all cases. optional bit bopt . In the next step, we resolve all this and create a model that fo- Our new databases DS0 and DC 0 are now cuses on the effective impact of the descriptor properties on the sym- −−−−−−−−−−−−−−−−−−−−−−−−→ bols. For a better distinction from the properties of the descriptors in DS0 = (s, t, P (s), S(s), D(s), R(s), opt) the Zengler Model, we refer to their resulting equivalents in the new −−−−−−−−−−→ model as attributes. Therefore we call the new model the Attributes 0 DC = (t, bopt , aP , − → s) Model. First we associate the attributes directly with the dependencies from dep that control them. As already mentioned in Section 2, the 5 Tristate* Logic and POF dependencies in − e−→ vis do not affect all types of attributes. In fact, they Our next step is to translate the Attributes Model into a coherent control only our prompt attributes aP . We define them as product overview formula. While our goal is to arrive at a POF in Propositional Logic, we prefer to split this translation into two steps. aP = (Ep , kp ) = (− e→ −−→ −−→ if ∪ edep ∪ evis ∪ ep , kp ) First we define Tristate* Logic, an extension to Tristate Logic that Juha Tiihonen, Andreas Falkner and Tomas Axling, Editors 134 Proceedings of the 17th International Configuration Workshop September 10-11, 2015, Vienna, Austria we specifically design to create an initial POF. Then we translate the operators and symbols in expressions is actually a feature, leading to POF from Tristate* Logic into Propositional Logic in Section 6. frequent conversions between the two domains. The following grammar describes the general syntax of expres- This may become quite complex. Consider the short expression sions in Tristate Logic: A=B. If both, A and B, are declared string symbols, their values are compared, yielding 2 if they are exactly identical and 0 otherwise. hexpressioni → hexpression symboli However, if A is a tristate symbol and B is undeclared, then the | hexpression symboli‘=’hexpression symboli values 0, 1, 2 of A are interpreted as the strings “n”, “m”, and “y” and | hexpression symboli‘!=’hexpression symboli B is interpreted as the string “B”, i.e. in this case a string com- | ‘(’ hexpressioni‘)’ parison is done on these letters, always yielding 0. We take all these | ‘!’hexpressioni details into account when producing our POF in Tristate* Logic. | hexpressioni‘&&’hexpressioni Encoding the constraints of bool and tristate symbols that | hexpressioni‘||’hexpressioni originate from their respective attributes, works in an indirect way: For each of these symbols, we add two auxiliary variables which hexpression symboli → symbol name represent a lower and an upper limit to the value of the symbol in | constant value consideration, and, using the ⇒ operator, we append the constraint There are five operators and there are parentheses to override op- that the value of the symbol must not exceed the values set up by the erator precedences. We distinguish two groups of operators: The auxiliary variables. tristate operators !, &&, and ||, and the string operators = This is in contrast to encoding the choice groups: We encode the and !=. Both groups operate on their respective domains. exclusiveness of symbols with expressions that directly relate to the The tristate operators and their domain form a three-valued symbols instead of their associated auxiliary variables. logic like those of Łukasiewicz and Kleene. A comprehensive Finally, we also take account of the mode without module support overview of these logics has been done by Gabbay and Woods [2]. by adding two different subformulae for tristate symbols and In fact, the three base operators !, &&, and || correspond to the tristate choice groups which depend on the symbol that has the operators ¬, ∧ and ∨ in K3 and Ł3 from Kleene and Łukasiewicz. bmod bit set. This observation has already been made in 2010 by Berger et al. [1]. However, Tristate Logic itself is not expressive enough for a full POF. 6 POF in Propositional Logic We need to extend it for our purposes. The nature of our dependencies is Boolean and not three-valued, Translating the POF from Tristate* Logic into Propositional Logic because either they are met or they are not. There is no third value is mostly straightforward. We encode each symbol with the type like the module state in Tristate Logic. Hence, when extending Tri- tristate or bool and all auxiliary variables using two variables state Logic to allow encoding all constraints in a POF, we look for in Propositional Logic. The three values of the tristate domain operators that operate on tristate values, but only yield two dif- correspond to three states that the two Propositional Variables can ferent values. Such operators are not typically part of K3 or Ł3 and encode. We explicitely prohibit the fourth possible state. hence we exclude these logics from further consideration. Translating tristate constants, symbols, and operators works Instead we introduce our own new operators ⇒ and ⇔ and define mostly the same way as in the paper by Zengler and Küchlin [11] their semantics as shown in Table 1 and Table 2. with the two projections π0 and π1 as listed in table 3. The three constants y, m, and n map to corresponding combinations of > and Table 1. Truth table of tristate* operator ⇔ ⊥, and similarly each tristate symbol AT maps to two proposi- tional variables p0 (AT ) and p1 (AT ). Mapping the unary operator ! is simple, but the entries for the binary operators operators && and ⇔ 0 1 2 || stick out by being generalized to n-ary operators. This is an opti- 0 2 0 0 mization to keep the size of the formulae smaller for long chains of 1 0 2 0 the same operands. 2 0 0 2 Of course, the Tristate* operators ⇔ and ⇒ also have to be translated into Propositional Logic. However, due to our definition of these operands this is trivial for π1 as they yield only 0 and 2, and π0 is intuitive. We use these two translations to also cover the usage Table 2. Truth table of tristate* operator ⇒ of the string operator = with tristate operands. Finding a Propositional encoding for string symbols requires more work, because Propositional Logic is not well suited for encod- ⇒ 0 1 2 ing arbitrary strings. For each string symbol, we iterate over our 0 2 2 2 POF in Tristate* Logic and collect all occurrences in expressions. 1 0 2 2 In our method we consider only cases of equal strings and resulting 2 0 0 2 other equalities and inequalities and define new propositional vari- ables PX S ←s to encode that the string variable X S has the value They express equality and “less than or equal to” inequality on s and accordingly (X S = Y S ) to encode if X S and Y S have the tristate values. We call this extended version of Tristate Logic same value. This suffices, because for the configuration space the ac- the Tristate* Logic. tual value of a string is not important. The same goes for int and Note that Tristate* Logic also contains the string operators = hex symbols. Finally we use the mappings for = and ! to define the and !=. They compare values from the string domain and also mappings for != . yield one of the two values 0 and 2. Mixing tristate and string With these mappings we create our POF in Propositional Logic. 135 Juha Tiihonen, Andreas Falkner and Tomas Axling, Editors Proceedings of the 17th International Configuration Workshop September 10-11, 2015, Vienna, Austria As it encodes the configuration space of the Linux kernel, we call it The fact that there are more regular variables in the POF in Tristate* L-POF. Logic than there are declared symbols on the respective architecture To use modern SAT solvers we do one final step: Our L-POF is comes from the fact that it still contains expressions with undeclared not in CNF. So we use the Warthog Logic Framework6 to generate an symbols. They are cleaned in the translation process into Proposi- equisatisfiable formula using the Plaisted-Greenbaum algorithm [6]. tional Logic. Each of our formulae in CNF contains roughly one million Propositional variables. For some very basics analyses with 7 Results Table 6. Redundant or necessary symbols in Linux 4.0 Our implementation consists of more than 7,000 lines of Java Code, using the features of Java SE 6. We measured execution times on a computer with an Intel Core2Quad Q6600 using the Java implemen- arch redundant (⇔ n) necessary (not ⇔ n) tation of the IcedTea project in version 6.1.13.7 on Gentoo Linux. alpha 3223 55 Creating the L-POFs from the Kconfig files for all 30 architectures, arc 4263 63 takes on average less than 3 seconds per architecture. Transforma- arm 1691 75 arm64 3159 135 tion into CNF varies between 20 and 42 seconds, depending on the architecture. avr32 4522 54 To give a rough impression of size of the configuration space, we blackfin 4430 47 c6x 4644 42 show the distribution of symbol types for each architecture in Table 4. cris 3785 34 Across all architectures, the vast majority of symbols has one of the frv 3700 37 two types tristate and bool which has a major impact on the hexagon 4625 45 form of the formula. ia64 3454 74 m32r 4937 32 Table 4. Distribution of symbol types in Linux 4.0 m68k 3741 32 metag 4301 67 microblaze 3251 71 arch tristate bool string int hex total mips 2773 64 alpha 6317 3402 34 192 27 9972 mn10300 3702 39 arc 6293 3352 36 194 29 9904 nios2 4428 45 arm 6371 4724 38 209 41 11383 openrisc 4424 56 arm64 6366 3511 36 195 27 10135 parisc 3499 51 avr32 6360 3462 36 193 30 10081 powerpc 2652 94 blackfin 6380 3676 36 702 43 10837 s390 4149 107 c6x 6292 3293 35 189 28 9837 score 7068 36 cris 6345 3436 45 254 78 10158 sh 3297 67 frv 6316 3378 35 192 28 9949 sparc 3201 51 hexagon 6292 3292 35 190 27 9836 tile 3633 57 ia64 6400 3519 36 195 27 10177 um 7368 28 m32r 6324 3361 34 194 31 9944 unicore32 3541 58 m68k 6322 3453 35 192 37 10039 x86 2301 138 metag 6293 3363 37 193 28 9914 xtensa 3248 43 microblaze 6294 3335 37 195 32 9893 globally 135 1 mips 6391 3964 36 198 28 10617 mn10300 6316 3426 35 199 33 10009 nios2 6292 3303 36 191 36 9858 SAT solving, we use PicoSAT.7 Processing the CNF formula without openrisc 6292 3292 36 188 27 9835 any additional clauses takes PicoSAT between 6 and 10 seconds. We parisc 6325 3384 34 191 27 9961 search for redundant bool or tristate symbols, i.e. symbols that powerpc 6404 3933 37 205 40 10619 can never be active, and for symbols that are necessary, i.e. it is not s390 6313 3453 35 195 27 10023 possible to fully deactivate them. We find out if a symbol is redun- score 6292 3292 35 188 28 9835 dant by assuming that one of the two corresponding Propositional sh 6374 3640 37 198 34 10283 variables is true. If this is not satisfiable, then the symbol is always sparc 6370 3444 37 193 30 10074 inactive. Vice versa by assuming that both Propositional variables are tile 6306 3393 36 191 29 9955 false we find out whether a symbol must always be active. More than um 6297 3362 38 193 27 9917 unicore32 6363 3425 36 191 27 10042 99.8% of the individual tests run in less than three seconds. Table 6 shows the results of these tests. The reason for the high x86 6420 3781 40 206 32 10479 numbers of features that cannot be activated on each architecture is xtensa 6326 3379 41 194 29 9969 that there are many features that run only on few architectures, but the corresponding Kconfig files are common for all architectures. Sym- Table 5 gives a rough overview of how big the formulae grow. bols that cannot be deactivated on the other side are less, but still a It comes to no surprise that the translation from Tristate* Logic lot. They are symbols that are intentionally not deactivatable and in to Propositional Logic and the transformation into CNF using the general they do not represent selectable features, but basic aspects of Plaisted-Greenbaum algorithm both significantly increase the size. an architecture. 6 Warthog Logic Framework: github.com/warthog-logic/warthog 7 PicoSAT: fmv.jku.at/picosat/ Juha Tiihonen, Andreas Falkner and Tomas Axling, Editors 136 Proceedings of the 17th International Configuration Workshop September 10-11, 2015, Vienna, Austria Table 3. Translation rules for tristate* operators e0 π0 (e0 ) π1 (e0 ) y > ⊥ m ⊥ > n ⊥ ⊥ AT p0 (AT ) p1 (AT ) !e ¬π0 (e) ∧ ¬π1 (e) π V1 (e) W e0 && · · · && en π0 (e0 ) ∧ · · · ∧ π0 (en ) (π0 (ei ) ∨ π1 (ei )) ∧ i∈{0,...,n} π1 (ei ) Vi∈{0,...,n} W e0 || · · · || en π0 (e0 ) ∨ · · · ∨ π0 (en ) i∈{0,...,n} (¬π0 (ei )) ∧ i∈{0,...,n} π1 (ei ) e1 ⇔ e2 (π0 (e1 ) ↔ π0 (e2 )) ∧ (π1 (e1 ) ↔ π1 (e2 )) ⊥ e1 ⇒ e2 π0 (e2 ) ∨ ¬π0 (e1 ) ∧ (¬π1 (e1 ) ∨ π1 (e2 )) ⊥ AT = t π0 AT ⇔ t ⊥ AT = B T π0 AT ⇔ B T ⊥ XS = s PX S ←s ⊥ XS = Y S PX S = Y S ⊥ es1 != es2 ¬π0 (es1 = es2 ) ⊥ To get meaningful results, we merge the results. We collect all valid. Thus, we want to find the minimal set of constraints to remove symbols that do not have any architecture that allows activating, and or change. Such MaxSAT re-configuration use cases have been de- we collect all symbols that are declared across all architectures, but scribed in the context of automotive configuration in [10] and could may not be deactivated on any of them. These numbers are in the be adopted for the Linux kernel configuration. line globally. For most of the 135 symbols that cannot be activated, this is actually the intention of the maintainer. The one tristate symbol that can never be deactivated may intentionally only alternate between the two other possible states. These results do not surprise as similar tests were also done in the context of the VAMOS project and hence many problems have already been uncovered and solved. 8 Conclusion Our approach successfully leads to a precise product overview for- mula. Although Linux has reached more than 10.000 features, our implementation quickly creates the L-POF, a product overview for- mula of the Linux kernel in Propositional Logic. Despite its consid- erable size, fast SAT solving on the formula is in general possible. If needed there is still much room for optimization. A future research direction could be to investigate the possibility of further verification tests beyond redundant and necessary symbols, e.g. specialized verification tests for a choice block analogously to verification tests for positions of a Bill of Materials as it is done in automotive configuration [8]. As we do not use parts of LKC in our implementation, we now have the option to extend our program to do fine-grained tests con- sidering individual lines in Kconfig files and easily locate the origin of inconsistencies. Another interesting topic is re-configuration. Although LKC does not permit invalid configurations by disabling options during the con- figuration process, it might be useful for users to select all wanted options first without caring about the validation. Afterwards, if the configuration is invalid, we can re-configure the selections of the user in an optimal way, i.e. by solving a MaxSAT optimization problem. The reverse is also imaginable: If the selections of a user lead to an invalid configuration, the user might want to know which config- uration constraints have to change in order to make the configuration 137 Juha Tiihonen, Andreas Falkner and Tomas Axling, Editors Proceedings of the 17th International Configuration Workshop September 10-11, 2015, Vienna, Austria Table 5. Sizes of POFs for Linux 4.0 arch regular variables auxiliary variables total number of variables in L-POF variables in CNF clauses in CNF in POF in in POF in variables in POF Tristate* Logic Tristate* Logic inTristate* Logic alpha 10673 48845 59518 117417 996839 1812856 arc 10602 48481 59083 116538 954428 1718683 arm 11976 55760 67736 134270 1299812 2849653 arm64 10824 49640 60464 119333 1007563 1828031 avr32 10793 49366 60159 118671 1001036 1816464 blackfin 11539 51058 62597 123096 1026513 1861942 c6x 10548 48174 58722 115799 949031 1708805 cris 10867 49279 60146 118559 999006 1812066 frv 10666 48722 59388 117126 990489 1797565 hexagon 10540 48169 58709 115795 981542 1781667 ia64 10866 49850 60716 119837 1010856 1834072 m32r 10643 48681 59324 117039 993691 1804508 m68k 10717 49136 59853 118115 1008800 1836987 metag 10605 48535 59140 116671 955382 1720572 microblaze 10591 48406 58997 116370 985108 1788040 mips 11249 52034 63283 125090 1048937 1909971 mn10300 10721 48974 59695 117738 994158 1804015 nios2 10566 48235 58801 115951 950035 1710610 openrisc 10544 48168 58712 115783 949121 1709044 parisc 10658 48794 59452 117297 996712 1810111 powerpc 11247 51964 63211 124935 1055822 1917736 s390 10699 49084 59783 117997 998901 1813210 score 10539 48168 58707 115783 949788 1710461 sh 10955 50336 61291 121037 1020515 1854779 sparc 10774 49327 60101 118582 1004762 1823946 tile 10655 48748 59403 117195 990749 1798113 um 10606 48550 59156 116723 998908 1821791 unicore32 10753 49191 59944 118246 998333 1811566 x86 11135 51280 62415 123314 1051478 1913811 xtensa 10674 48786 59460 117278 993296 1802906 References cial Intelligence for Engineering Design, Analysis and Manufacturing, 17(1), 75–97, (January 2003). Special issue on configuration. [9] Reinhard Tartler, Mastering Variability Challenges in Linux and [1] Thorsten Berger, Steven She, Rafael Lotufo, Andrzej Wasowski, ˛ and Related Highly-Configurable System Software, Ph.D. dissertation, Krzysztof Czarnecki, ‘Variability Modeling in the Real: A Perspective Friedrich-Alexander-Universität Erlangen-Nürnberg, 2013. from the Operating Systems Domain’, in Proceedings of the IEEE/ACM [10] Rouven Walter, Christoph Zengler, and Wolfgang Küchlin, ‘Applica- International Conference on Automated Software Engineering, ASE tions of MaxSAT in automotive configuration’, in Proceedings of the ’10, pp. 73–82, New York, NY, USA, (2010). ACM. 15th International Configuration Workshop, eds., Michel Aldanondo [2] The Many Valued and Nonmonotonic Turn in Logic, Volume 8 (Hand- and Andreas Falkner, pp. 21–28, Vienna, Austria, (August 2013). book of the History of Logic), eds., Dov M. Gabbay and John Woods, [11] Christoph Zengler and Wolfgang Küchlin, ‘Encoding the Linux Ker- North Holland, 1 edn., 8 2007. nel Configuration in Propositional Logic’, in Proceedings of the 19th [3] Albert Haag, ‘Sales configuration in business processes’, IEEE Intelli- European Conference on Artificial Intelligence (ECAI 2010) Workshop gent Systems, 13(4), 78–85, (July 1998). on Configuration, eds., Lothar Hotz and Alois Haselböck, pp. 51–56, [4] Wolfgang Küchlin and Carsten Sinz, ‘Proving consistency assertions (2010). for automotive product data management’, J. Automated Reasoning, 24(1–2), 145–163, (February 2000). [5] Sarah Nadi, Variability Anomalies in Software Product Lines, Ph.D. dis- sertation, University of Waterloo, 2014. [6] David A. Plaisted and Steven Greenbaum, ‘A structure-preserving clause form translation’, Journal of Symbolic Computation, 2(3), 293– 304, (September 1986). [7] Steven She and Thorsten Berger, ‘Formal semantics of the kconfig lan- guage’, Technical note, University of Waterloo, 24, (2010). [8] Carsten Sinz, Andreas Kaiser, and Wolfgang Küchlin, ‘Formal meth- ods for the validation of automotive product configuration data’, Artifi- Juha Tiihonen, Andreas Falkner and Tomas Axling, Editors 138 Proceedings of the 17th International Configuration Workshop September 10-11, 2015, Vienna, Austria