Parsing of Context-Sensitive Languages Lukri5 Rychnovsky Dept. of Information Systems, Faculty of Information Technology. Brno University of Technology, BoZet6chova 1, 612 66 Brno, Czech Republic rychnov@fit . vutbr . cz Abstract. This article presentssome ideas from parsing Context-Sensitive languages. Introduces Scattered-Context grammars and languages and describes usage of such grammars to parse CS languages. Also there are presented additional results fronr type checking and formal prograrn verification using CS parsing. I{eywords: Turing Machines, Parsing of Context-Sensitive Languages, Fornral Program Verification, Scattered-Context Grammars. 1 Introduction This work results from [Kol-04] where relationship between regulated pushdown automata (RPDA) and T\ring machinesare discussed.We are particulary inter- ested in algorithms which rnay be used to obtain RPDAs from equivalent Turing machines. Such interest arisespurely from practical reasonsbecauseRPDAs pro- vide easier implementation techniques for problems where Tbring machines are naturally used. As a representant of such problems we study context-sensitive extensions of programming language grammars which are usually context-free. By introducing context-sensitive synta-x analysis into the source code parsing process a whole class of new problems may be solved at this stage of a compiler. Namely issueswith correct variable definitions, type checking etc. 2 Scattered context grammars Definition 1. A scatter-ed co'ntertg'raTrlTnar (SCC) G is a q'uadruple(V,7, P, S), wh e re V i s a fi " n i tes e t o f s y m b o l s,T cV , S e I/\" , andP i s aset of producti on rules of the form (Ar,...,A,-) - (ur,...,un),fr) I,YAi: Aie I/\?,Ywi:wi€V" D e f i n i t i o n 2 . L e t G - ( V , T , P , S ) b e a S C G . L e t ( A y , . . . , A , . )- ( t r r , . . . , w n )€ P. Then we define a deriuation relation + as follows: for 1 S i I n * I let xi€V". Then tt AtxzAz...xnAnIn+r =* I1w1r2w2...Tn.u)nunq1 ** is refl,eriue, t,ransitiueclosure of +. Langu,agegeneratedbu the grammar G is definedas L(c) - {w € T*l,g +. ar}. 220 L. Rychnovsky Theorern 1. Let L,(,SC) be the family of languagesgeneratedby SC grammars whose number of product'ionsthat contains two or more contert-freeproductions (degreeoJ'co'ntert sensiti'uitu) 'is n o'r less. L(R,E) r\e'notesthe farnilg of recur- siueLyenumerablelanguages. L 2 (S C) - Loo(^S C:) L(R E ) Proof. See [lvIed-}S] Lemma 1 and Theorem S. 3 Parsing of Context-Sensitive Languages The main goal of regulated formal systems is to extend abilities from standard CF Ll-parsing to CS or RE families with preservation of easeof parsing. In [Kol-04] and [Rych-0s] rve can find some basic facts from theory of reg- ulated pushdown automata (RPDA). We figured that regulated pushdown au- tomata can in some casessimulate Turing machines so we could use this theory for constructing parsersfor context-sensitivelanguagesor even type-0 ianguages. \AIe have also demonstrated the basic problem of this concept: complexity. Almost trivial Turing machinewas transformed to regulated pushdown automata with aimost 6 000 rules. As rve have seen in previous lines, converting deterministic (linear bounded) Turing machine or scattered-context grammar to deterministic RPDA is very complex task. For the most simpie context-sensitive ianguages corresponding deterministic RPDA has thousandsof rules. If we want to use these algorithms for creating some practical parser for real context-sensitive programming language it rnay result in millions of rules. Therefore, we are looking for another wav to parse context-sensitive languages.We would like to extend some context- free grammar of any common programming language (such as pascal, ClCff or Java). After extending context-freegrammar to corresponding context-sensitive grammar, parsing should be straightforward. 3.1 KontextZAP03 As an cxample of a coniext-free language we llse a language calied ZAP}J [ZAP-03] rvhich has ven' similar syntax to Pascal. We .rvill use follorving pro- gram as an example program in ZAP03 language. int I &, b, c, d; string : s; begin a = L; b=2; {n. d = 15; s = t'foo"; Parsing of Context-SensitiveLanguages 221 cL - \,. end Now we will define KontcxiZAP03, the context-sensitiveextension of ZAP03. In the first phase we will enrich KontextzAPO3 by variable checking. If the variable is urrdefined or assignedbefore initialized, the parser of KontextZAPO3 will finish in error state. We will need to analyze three fragments of ZAP03 code where variables are used (variable c for example). Variable definitiorr i"nt r &, b, c, d; Assignment statement c = 10; And using variables in commancls a = ci Corresponding grammar fragments from ZAP03 are following. Variable definition DC L--, T YP E [: ] [1 d ] ID -L IS T ID-LIST-' [,] tidl ID-LIST ID-LIST -+ e Fragment of assignment statement COMMAND -- [iA] CMDCOMMAND -+ CM D [= ] S T M T t;l And usage variable in command STMT-' [id] 6PER OPER-- E. Symbols in brackets [,] are terminals. Complete ZAP03 grammar has about 70 context-free grammar rules. \Aiedefine grammar of languageKontextZAP03 in the following way. Substi- tute previous rules with these scattered-context ones: ( D c i -, S ' ) --+ (T YP E [ : ] [i a ] ID -LIS T, D ) ( I D _ L I S T ,S , ) - - ( [ , ] [ i d ] I D - L I S T , D ) (rD_Lrsr)-- (e) assignment statemcnt ( C OMMA N D D,) -' ( tl d ] C MDCOMMA N D ,L) ( C M D *) ( [ = ] S T M Tt ; l ) and usagevariable in command ( S T M T ,D ) - ' ( t i d l 0 P E R , D R ) ( O P E R )- , ( . ) . Parsing rrow proceeds in the fblloiving way: star,rtingsymbol is S S' arrd derivation will go as usual until there is DCL S' processedand (DCL, S') --) (TYPE [ : ] [id] ID-LIST, D) rule is to applied. At this moment S' is rewritlen to D indicating that variable [id] is dcfined. When variable [id] is rised on left resp. right side of assignrnent D is rewritten to DL resp. DR according to second resp. thircl previously shown fragment. If variable [id] is used without being definerl befolehand. a parse error occurs becauseS' is not rewritten to D and S' cannot 222 L. Rychnovslgt be rewritten to DL or DR directly. When the input is parsed S is rewritten to program code and during LL parsiug is popped out of the stack. S' is rewritten onto D{LR}* and this is only string that remains. s s ' = + * D C L S ' , + T YP E [: ] [i d] ID -LIS T D = + * C OMMA NDD= + + [1d] CMDCOMMAND DL +* DL If the only remaining symbol is D, it means that variable [id] was defined but never used. If DL+ is the only remaining symbol, we know that variable [id] was defined and used only on the left sides of assignments. Finally if there is the only remaining DR(LR)*, w€ know that the first occurrenceof variable [id] is on the right side of an assignmentstatement and therefore it is being read without being set. Irr all these casesthe compiler should gerreratea warrrirrg.These and similar probiems are usually addressedby a data-flow analysis phase carried out during semantic analysis. Using this algorithur wc can only process one variable at a time. But the proposed mechanism can be easily extended to a finite number of variables by adding new S' . . .' every time we discover a variable definition. Parsing of de- scribed SC grammar can be implemented by pushdown automaton with finite number of pushdowns. The first pushdorvn is ciassic LL pushdown. The second one is variable specific and every [id] holds its own. Because original ZAP03 grammar is LL1 and using described algorithm wasn't any rule added, KontextZAPO3 grammar has unambiguous derivations. A few examples can clear the idea. This program is well-formed according to ZAP03 grammar, but it's semantics is not correct and parsing it as Kon- textZAPO3 program should reveal this error. 1nt r d, b, c, d; string : s; be g i n a = 1; b=2; d = 15; = ttf s oott; cl - L, end Corresponding stack to variable c will be DR wher,t lead to warning: Varlable c read but not set, Second example shows another variation int i d, b, c, d; string : s; begin a = 1; d = 15; s =ttfoo"; a = c; Languages Parsingof Context-Sensitive 223 end Corresponding stack to variable b will be D what lead to warnings (together with previous one): Variable c read but not set. Variable b is defined but never used. 3.2 Typ. checking By using scattered-context granrmars we can describetype set of ianguage (INT, STR) and type check rules directly in grammar. Almost trivial language with type check using 5 stacks can look like this: (r) - (pr"ogra,m) (n,ert) -- (program,) ( p r o g r a m )- - ( d c l ) ( t A p " , ,) - ( [ l n t ] ,, I M ) ( p ro g ra m ) * (fb e g i n ]c o m m a n dfendl ) (tA p" ,, ) * (l stri ng),, S TR ) (dcl,)-, (ty'pr:.1:)dcl2) (:om'mand, D,I I'17,, ) * ( d c l 2 ,^ 9 1, , 4 / 7 ., ) * ( l i d l c m d c o m m a n d ,D L , I M . I I V f ) ( f i d ] i d - l i s t f ; l n e r t ,D , I N T , I ^ l T ) ( c o m m a n d ,D , S T R , , ) - (dcl2,S, STR,, ) - (lid] c'mdco'rn'rnand, D L,STR, S"R) (lid]id-list[;]nert, D, ST R, ST R) (command) -' (e) ( id -l i s t)--- ([, ]td l i s tz ) (cmd) --' ([:]sl rnt[;]) ( id -l i s t2 , S) * (l i d )i d -l i s t, D ) (stmt, D ) - (l i d,l ,D R ) ( z d - l i s t )- , ( E ) ( s t m t . , , , I ^ l T ) - - -( f d i g i t ] , , , , ) (nert) ---(d.cl,) ( s t r n t , . , , S T R ) - - - ( [ s t ' r u o , l,],,, ) Stacks at even positions are the variable specific stacks as mentioned in pre- vious chapter. The first stack is classic LL stack and the rest of stacks at odd positions are temporary used stacks for additional information. Although the underlying context-free grammar is not LL grammar becausethere are several identic rules (command --* [id] cmd command), this grammar is unarnbiguous. Example of error program code can be int r d, b, c, d; string : s; be g l n {. | = 15; c = t'foott; a = c; end becausec = "foo" is not type correct (STR is assignedto INT) and parsing will fail. 224 L. Rychnovslgt 3.3 Other applications of CS languages It is obvious, that using a siurple scattered-context extension of CF languages, we obtain a grammar with interesting properties with respect to analysis of a programming languagesourcecode. We provide some basic motivation examples. 1 . Errors related to usageof undefined variables rnay be discoveredand handled at parse time without the need to handle them by static semantic analysis. 2 . A CS extension of a grammar of the Java programming language, which copes with problems like mutual exciusion of various key'words, such er,s abstract and final, reflecting the fact that abstract methods cannot be declared fi.nal and vice versa. This situation can be handled quite easily, by introducing additional symbol S" and two corresponding rules S" -- A (cor- respondsto abstract) and S" -- F (correspondsto f inal). Obviously only one of the ruies can be used at a time. J. Introducing an observer keyword for methods in Java, which indicates that this method does not modify the state of thzs object (similar to defining rnethod as const in C++). Handling of such keyword in the langr-ragegram- mar is similar to approach taken in the previous example. =. Accounting of statements in a prograln in a ZAP03 ianguage by introducing A riew size key'word,which definesupper bound on the number of statements in current scope. The parser is than extended such that when the keyword is discovered,the number of specializednonterminals (say X) is generated on the stack - as spccified by the keyword occurrence and the grammar of the ianguage is modified accordingly. The rule: C OMMA N D ---[i d ] C M DC O MMA N D changesto: (C OMMA N D X), * ([i d ] CMDC OMMA NE D ), Then, when a staternent rule is used, one X nonterminal is eliminatcd from the stack. If there are no remaining X nonterminals, the parsing immediately fails. The context-sensitivelanguage used in this example is: L(G): {tu.l.lto}, wheretu is word and lu,'l1sis the length of tl written as a decimalnumber. 4 Formal Program Verification Onc of the pr<-rrrrisirrg applicatiorrsof the described corrcept is introducing a no- tion of preconditions and postcondi,tionsto the ZAP03 programrning language. Preconditions and postconditions are essentially sets of logical formulae, which arc required to hold at entering resp. leaving a program or method. For exam- ple, when computing a square root of r,, we can require a positivity of r using precondition {r ;- 0}. A computation of sirius function {y : sinr} a natural po s tc o n d i ti o n{ (y < - 1 ) A (A >- -1)} ari ses. Parsing of Context-SensitiveLanguages 225 Statements of a programming languagesthen induce transformation rules on precondition and postcondition sets. For example, assignment staternent in the fbrni V'.- E defines a transforrnation: {P[Elvl]v:: E{P}, whcre V is a variable, E is an cxpression,P is precondition and PItr/V] derrotes a substitution of V for all occurrencesof E in P. Other transformation examples may be found in [Gor-98]. The purpose of introducing preconditions and postconditions into a languagc is to be able to derive postconditions from specifiedpreconditions using transfor- rnation rules irr a particular program. Such program than carries a fbrmal proof of its correctnesswith it, which is a desirable property. In the ZAP}S languagewe can implement the described concept by introduc- ing two new kevwords pre and post and by extending the nrles of the language grammar with above mentioned transformation rules. When the parsing of a pro- gram is initiated, the precondition set is constructed using the pre declarations and the transformation rules are applied to it as the parsing progressesthrough the source code. When the parsing terrninates, the resulting set of transformed preconditions is compared with the deciared postconditions. If these two sets match, the parsed program is correct with respectto the specifiedpreconditions and postconditions. However, for practical reasons we must define constraints on the possible preconditions and postconditions. Postconditions must be generally derivable from. preconditions or (as a corollary of the Godeis incompleteness theorem) the postconditions need not be provable from the preconditions at ali, but may still hold. In this particular casewe have encountered a program which may be correct, but we carrnot verif-ythis fact. Conclusions In this article are presented some ideas from parsing of context-sensitive lan- guagesbased on scattered-context grammars. Very powerful cxtension of classic parsing techniques is introduced and this context-sensitive extension iet us dis- co\rersoilre corlnlon errors such as undefined or unused variabies in parsing tirne. The extension suggestedin this work is far more beyond. In some conditions we are abie to add some features to language that enables us to check formal pro- granr verification. References [ A h o 7 2 ] A h o , A . , U l l m a n , J.: Thc Theory of Parsing, Translation, and Compiling, Prentice-Hall INC. 1972. 'fheory [Gor-98] Gordon, J. C. \,I.: Programming Language and its Implementation, r998. 226 L. Rychnovslqt [Kol-04] I(ol6i, D.: Pushdown Automata: New N'Iodifications and Transformations, Habilitation Thesis, 2004. [lvled-00] Meduna, A., Kol, D.: Reguleted Pushdown Automata, Acta Cybernetica, Vol. 14, 2000. Pages 653-664. [Med-03] Medutta, A., Fernau, H.: On the degree of scattered context-sensitivity, El- sevier, Theoretical Computer Science 290 (2003), Pages 2121-2124. [Sal-73] Salomaa, A.: Irormal Languages,Academic Press, New York, 1973. [Rych-Os] Ryc]nrovsky, L.: Relation betrveen regulated pushdown autonrata and Tur- ing mascines,Report from semestral project, 2005. IZAP-03] htt p: I I www.fi t. vutbr.cz/study f coursesf ZAP/public I pr o ject/