172 An Algorithmic Representation of the Syntax Diagram of a Computer Programming Language Anichebe Gregory Emeka University of Nigeria, Nsukka, Enugu State, Nigeria Abstract Every programming language has its own syntax rules. Such rules can be represented with either the Backus-Naur form (BNF) notation or with a Syntax diagram (also called a Railroad diagram). BNF uses text-based mathematical notations for defining those rules, while a Syntax diagram employs a graphical approach. Converting any of the two techniques to an algorithm or computer program is somewhat difficult for students due to the recursive expressions used by each of the techniques in defining the syntactic rules of a grammar. The aim of this work is therefore to showcase how an algorithm for one of such techniques (namely, a syntax diagram) can be written for easy understanding and implementation with a computer. A Finite State automata (FSA) approach was adopted by the researcher for modelling any given grammatical rule of a programming language for easy implementation with a computer. The grammatical rules for generating an integer number was arbitrarily selected by the researcher, amongst other rules, for formulating the required algorithm. The algorithm (which is a pseudocode) was written to be in tandem with the FSA model for easier understanding and programming. Results showed that when the pseudocode is implemented with a computer with some trial data, every data that conformed with the grammatical rules for generating integer numbers was accepted as ”valid integer”, while other incoherent ones were declared “invalid integer”. This helps in smoothening the understanding of students of any rigorous or recursive problem for easy implementation with a computer. Keywords Backus-Naur form (BNF), Syntax diagram, Algorithm, Computer program, Finite State automata (FSA), Recursion 1. Introduction ::= ::= | A Backus-Naur form is a notation used in the ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 field of Computer Science to express the syntax of a programming language [1]. The expression Figure 1: The BNF definition for unsigned integer contains a list of all the rules that defines a number particular grammar of a programming language. The three basic symbols used by the BNF are: In plain English, the first line of figure 1 states ::= (which means, “is defined as”) that an unsigned integer, , is defined | (which means, “Or”) as a number, . The second line contains <> (angular brackets that contain two rules which state that: (i) is defined as a category name) followed by , or (ii) is For instance, to use the BNF to define the defined as . The third line contains 10 grammatical rules for generating an unsigned rules which state that is defined as ‘0’ or integer number, we have the following ‘1’ or ‘2’ or ‘3’ or ‘4’ or ‘5’ or ‘6’ or ‘7’ or ‘8’ ‘9’. structure: The above grammar contains a total of 13 rules. Each rule has a left part and a right part. The ‘left part’ defines the ‘right part’. Any symbol appearing on the left part (or both parts) of a ISIC’21: International Semantic Intelligence Conference, February 25-27, 2021, Delhi, India rule is called a non-terminal symbol, while any EMAIL: gregory.anichebe@unn.edu.ng (A.G) symbol that appears only on the right part (but ORCID: 0000-0003-4057-6277 (A.G) not on both parts) of a rule is called a terminal © 2020 Copyright for this paper by its authors. Use permitted symbol. In other words, the terminal symbols CEUR under Creative Commons License Attribution 4.0 International (CC BY 4.0). are the sentences (or string) that can be derived from a grammar. The first non-terminal symbol h tt p:/ / c eu r-ws. org Worksh op CEUR Workshop Proceedings (CEUR-WS.org) I SSN 1613 -0073 Pr oceedin gs is called the start symbol. All generation of sentences of a grammar commences from the 173 start symbol. In figure1, the start symbol is diagram). . According to [2], The Backus-Naur Form is a way of defining syntax. It consists of • a set of terminal symbols • a set of non-terminal symbols • a set of production rules of the form, Left-Hand-Side ::= Right-Hand-Side where the LHS is a non-terminal symbol and the RHS is a sequence of symbols (terminals or non-terminals). In figure1, the set of non-terminal symbols are, {, , }, while the set of terminal symbols are, {0,1,2,3,4,5,6,7,8,9}. The grammar contains a recursive definition of how 6 1 4 to generate an unsigned integer number, as shown in the second line of the grammar. Such Figure 3: The string, 614 is derivable from the grammar, recursive expression appears somewhat difficult and is therefore a valid integer for an ordinary person to easily understand, not to talk of converting it to an algorithm or computer program. According to William [3], recursion is often regarded as a deep mystery by novices in mathematics or computing, and so aught to be reserved for more advanced courses. The same recursive scenario occurs when the grammar of figure 1 is represented with a syntax diagram, as shown in figure 2. 2 T? Figure 4: The string, 2T is not derivable from the grammar because the symbol, T is not defined; therefore the string is an invalid integer 0 1 Figures 3 and 4 show the various recursive steps required for validating a sentence of a 2 grammar. The question now is, “How can such 3 steps be easily implemented programmatically 4 with a computer?” This forms the basic research 5 question for this work, and of which a solution to it is expatiated in section 3 under 6 “Methodology”. 7 8 9 2. Literature Review In Computer Science, recursion is typically used Figure 2: The syntax diagram definition for an unsigned for solving problems that involve recursive integer relations (such as the generation of integer numbers shown in figure 1 and figure 2 of A ‘parse tree’ shows how valid (or invalid) section1). According to [3], “The concept of sentences can be derived (or non derivable) recursion comes from mathematics where we from a grammar, as shown in figure 3 and figure often encounter recursive relations”. For 4, respectively for the derivation of the string, example, consider the problem of raising a real 614 and 2T from the grammar defined in figure number, X, to an integer power, N. The problem 1 (using BNF) or figure 2 (using syntax can be solved recursively by stepwise 174 refinement (i.e. breaking a problem down into stage is reached. Simply put, [4] explains that, simpler computational parts), as shown in table “Recursion is a process whereby a function calls 1. itself inside its body for the execution of a task until a base case is reached”. The beauty of Table 1: Evaluation of XN by recursion recursion is that it presents a clearer, intuitive, and simpler solution to a problem which would Problem Recursive process Step have been very difficult or too clumsy to solve through other means [5]. For instance, a popular XN X . (XN – 1) 1 mathematical puzzle called ‘The towers of Hanoi’ can be easily solved with recursion, as X . X . (XN – 2) 2 elegantly illustrated by [5], [3], and [6], to X . X . X . (XN – 3) 3 mention a few. The puzzle would have been too clumsy or nasty to solve through other means etc. etc. such as Iteration. The BNF notation, according to [7], was Thus, each recursive step is a simpler version of named after the two inventors: John Backus of the initial problem. The process continues until the United States of America, and Peter Naur of a termination point is reached (i.e. X0 = 1 in the Denmark. The notation makes extensive use of above example) where no further recursive recursion in defining the syntax of a process occurs, and the final result then programming language very succinctly. The determined by backward substitution. The mystery behind recursion can be demystified by above problem can be solved programmatically understanding it as a stepwise refinement of the using a Java function as follows: initial problem (by divide-and-conquer technique) until a trivial case (or terminal point) public static double power(double X, int N) is reached that requires no further { simplification. There are [now] many variants if(N = = 0) and extensions of BNF, generally either for the { sake of simplicity and succinctness, or to adapt return 1.0; it to a specific application, [8]. Typical examples } of these variants are given by [9] as, “Extended else BNF (EBNF) notation”, “Augmented BNF (ABNF) { return X * power(X, N-1); notation”, and “Regular extensions to BNF } notation”. The EBNF notation is almost a } superset of BNF, and is now the most commonly used structure for describing the grammar of a Figure 5: A Java function code for evaluating the language. On the other hand, the ABNF notation function, XN, recursively is commonly used for describing the Internet Engineering Task Force (IETF) protocols; while Thus, we can see that, recursion is a process of the Regular extensions to BNF notation are breaking a computation down in such a way that popularly used by the Python programming a simpler computation of the same kind with the language for its lexical specifications. Table 2 previous problem is derived, and the shows some of the basic differences amongst decomposition process continues until a trivial these notations. Table 2: Some basic differences amongst the BNF, EBNF, ABNF, and Regular extensions notations Examples Description BNF EBNF ABNF Regular BNF EBNF ABNF Regular of symbol extension extension non-terminal uses appears appears appears ::= answer = answer= answer::= symbol angular plain plain plain reply; reply reply bracket, <> definition of ::= = = ::= ::= age = age = integer age ::= rule integer; integer alternative | | / | ::= reply= “yes” reply= “yes” / reply::= rule or choice “yes” | “no” | “no”; “no” “yes” | “no” 175 End of a rule whitespace ; whitespa whitespace ::= result = result= score result ::= ce score; score concatenation the , same as same as ::= CGPA = CGPA = digit CGPA:: = symbol symbols BNF BNF . ,digit; digit next to each other optional rule defined as [...] [...] [...] ::= name name =[title] name::=[ti a separate =[title,] next tle] next rule <next> | next; <next> repeating an defined as {...} * (used as ? (used as <title> ::= title = {“mr” title = title :: = expression 0 a separate prefix) postfix) <“”>|<“mr”> | | “mrs” | *(“mr”/ (“mr” | or 1 times rule <“mrs”> | others}; “mrs”/ “mrs” | <others> others} others)? repeating an defined as {...} n* (used + <age>::= age = age =1*digit; age::= expression 1 a separate as prefix) <digit><no>| digit{digit}; digit+ or more rule <digit> times repeating an defined as {...} n*m n*m (used <age>::= age = age = age :: = expression n a separate (used as as postfix) <digit> | digit{2*digit 1*3digit; digit1*3 to m times rule prefix) <digit><digit }; (where n ≥ 1, >| and m > n ) <digit><digit ><digit> The syntax diagram is used to represent the BNF ▪ δ is a transition function that takes up a notation pictorially for easier understanding, pair of state and input symbol (say, whereby each of the non-terminal symbols in a <q,a>) in order to determine the next grammar is represented as a block or module state (say q՛) for the machine; i.e., δ(q,a) that performs a given task, as earlier illustrated = q՛ in figure 2. By the words of [10], “A syntax The above representation conforms perfectly diagram is a pictorial method of representing with the generation of sentences using the the format of components in a programming syntax diagram or BNF notation, as shown in language, and the direction of the arrowed lines table 3. indicates the order in which the diagram is to be read or followed”. This is where the use of Finite Table 3: The similarity between syntax diagram/BNF State Automata (FSA) becomes very necessary notation and Finite state automata in showing pictorially, as well as in a more compact form, how the sentences (or terminal Syntax diagram / Finite state automata symbols) of a grammar can be generated for BNF notation equivalence easy implementation with a computer. The use terminal symbols Q (finite set of states) of FSA defines the syntax of a grammar as well supplied input string ∑ (input alphabet) as its recursive processes very clearly so that start symbol q0 (start state) valid or invalid sentences can easily be valid sentences F (set of final or accepting identified. According to [11], states) A finite state automata (FSA) is a simple Production rule δ (transition function) idealized machine that recognizes patterns from the input [which consists This work therefore showcases how the FSA can of finite string of symbols] taken from be used to model a syntax diagram/BNF some character set (or alphabet). The notation for easy conversion to a computer job of an FSA is to accept or reject an program. input depending on whether the pattern defined by the FSA occurs in the 3. Methodology input According to [12], an FSA (simply denoted as, The syntax for generating signed or unsigned M) consists of 5 components: M = (Q, ∑, q0, δ, F), integer numbers using the syntax diagram or where, the BNF notation was used by the researcher to ▪ Q is a finite set of states illustrate how such rules can be implemented ▪ ∑ is the input alphabet programmatically. Figure 6 shows the syntax for ▪ q0 ϵ Q is the start state generating signed or unsigned integer numbers ▪ F ϵ Q is the set of final or accepting using the BNF notation. states 176 <int> ::= + <no> | – <no> | <no> <no> ::= <digit><no> | <digit> + <no> <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 <integer <no> > – <no> Figure 6: The BNF definition of signed and unsigned integers <no> <digit> <no> The syntax diagram representation of figure 6 is <digit> shown in the following figure 7. <digit> 0 1 2 3 4 5 6 7 8 9 Figure 7: The syntax diagram definition of signed and unsigned integers We can use a finite state automata to model any of the two techniques, as shown in figure 8. 1 P d՛ d d final or accept state Input d E Error string 0 3 4 5 or start d՛ + E՛ m Reject state d state d՛ P՛ + m՛ + d՛ 2 Variable description P = plus (+) sign P՛ = not a plus sign m = minus (-) sign m՛ = not a minus sign d = digit (0,1,2,3,4,5,6,7,8,9) d՛ = not a digit E = ‘Enter key’ character E՛ = not ‘Enter key’ character Figure 8: The finite state automata for generating signed and unsigned integers Figure 8 is a directed tree that contains all the down to the generation of sentences (i.e. ‘terminal symbols’ of the grammar (which are concatenation of terminal symbols) from the represented as nodes in the diagram), and grammar, as well as how the sentences are ‘production rules’ of the grammar (which are derived (i.e. the use of production rules to derive represented as arrows). The reason for using the sentences). Thus, Figure 8 consists of 6 only ‘terminal symbols’ and ‘production rules’ states (i.e., states 0,1,2,3,4, and 5). State(0) is the for such representation is because every ‘start state’ at which the automata machine grammar of a programming language boils receives an input string for parsing. State(1) is 177 the state for receiving ‘+’ sign. State(2) is the belongs to state(3); the FSA therefore state for receiving ‘–’ sign. State(3) is the state moves from state(0) to state(3). the 2nd for receiving digits(0,1,2,...,9). State(4) is the character of the string is digit(2) which ‘accept state’ for an input string that successfully also belongs to state(3); the FSA moves from state(0) to state(4). Finally, state(5) therefore moves from state(3) to is the ‘reject state’ for an input string that moves state(3) – a recursive movement or from state(0) to state(5). The arrows in the transition. The 3rd character of the figure indicate how the terminal symbols are string is digit(5) which again belongs to concatenated to generate a valid integer state(3); the FSA therefore makes number. Thus, the automata machine changes another recursive transition from state (or transition) according to the characters state(3) to state(3). The last character in the input string in order to determine valid or of the string is the <enter key> which invalid integer. For instance, the input string, belongs to state(4); the FSA therefore +621 undergoes the following states: 0 => 1 => moves from state(3) to state(4) – the 3 => 3 => 3 => 4. It is therefore accepted as a ‘Accept state’. The string, 725, is valid integer number. subsequently accepted as VALID integer. 4. Data Analysis ▪ Similarly, the second input string, – 6A31<enter key>, is parsed as follows: at state(0), the finite state automata Table 4 shows in detail how the automata (FSA) receives the input string supplied machine of figure 8 can be used to parse a by the user; the 1st character of the sample of some input string such as 725, – string is a minus(–) sign which belongs 6A31, and +9. to state(2); the FSA therefore moves from state(0) to state(2). the 2nd Table 4: An illustration of input string parsing with the character of the string is digit(6) which automata machine of figure 8 belongs to state(3); the FSA therefore moves from state(2) to state(3). The 3rd Input Start character by Next Result character of the string is letter(A) which string state character state belongs to state(5) – ‘Error or Reject parsing state’ because the received character is 725<enter 0 7 3 key> d՛ (i.e. not a digit); the FSA therefore 2 3 moves from state(3) to state(5). 5 3 Subsequently, the string, –6A31, is <enter key> 4 Valid rejected as INVALID integer (without integer parsing the remaining characters, ‘31’, –6A31 0 - 2 in the string). <enter ▪ Lastly, the input string, +9<enter key>, key> is parsed as follows: at state(0), the 6 3 finite state automata (FSA) receives the A 5 Error (invalid input string supplied by the user; the 1st integer). character of the string is a plus(+) sign The which belongs to state(1); the FSA symbol, ‘A’ therefore moves from state(0) to is illegal state(1). The 2nd character of the string +9<enter 0 + 1 is digit(9) which belongs to state(3); the key> FSA therefore moves from state(1) to 9 3 state(3). The 3rd character of the string <enter key> 4 Valid is the <enter key> which belongs to integer state(4); the FSA therefore moves from state(3) to state(4) – the ‘Accept state’. ▪ The input string, 725<enter key>, is The string, +9, is subsequently accepted parsed as follows: at state(0), the finite as VALID integer. state automata (FSA) receives the input ▪ The pseudocode for implementing the string supplied by the user; the 1st finite state automata of figure 8 is character of the string is digit(7) which shown in figure 9 that follows. 178 1. input any integer number as string 1.1 input strgintno 2. determine the number of characters, N, of the string, strgintno 2.1 N = length(strgintno) 3. determine the first character, fchar, of the string, strgintno 3.1 if fchar = ‘+’ then process state1() else if fchar = ‘–‘ then process state2() else if fchar = digit(‘0’ or ‘1’ or ‘2’ or ‘3’ or ‘4’ or ‘5’ or ‘6’ or ‘7’ or ‘8’ or ‘9’) then process state3() else process state5() //error routine end if 4. //definition of functions 4.1 state1() Determine the second character, schar, of the string, strgintno if schar = digit(‘0’ or ‘1’ or ‘2’ or ‘3’ or ‘4’ or ‘5’ or ‘6’ or ‘7’ or ‘8’ or ‘9’) then process state3() else process state5() //error routine end if 4.2 state2() Determine the second character, schar, of the string, strgintno If schar = digit(‘0’ or ‘1’ or ‘2’ or ‘3’ or ‘4’ or ‘5’ or ‘6’ or ‘7’ or ‘8’ or ‘9’) then Process state3() Else Process state5() //error routine End if 4.3 state3() //determine the subsequent characters of the string, strgintno, as follows: for charcount = 1 to N if strgintno(charcount) = digit(‘0’ or ‘1’ or ‘2’ or ‘3’ or ‘4’ or ‘5’ or ‘6’ or ‘7’ or ‘8’ or ‘9’) then Continue else process state5() //error routine end if next charcount //display the status of the data supplied Print “valid integer” Stop 4.4 state5() //display the status of the data supplied Print “invalid integer Stop Figure 9: The pseudocode for the implementation of the automata machine of figure 8 5. Results and Discussion -946890 VALID integer +73.6 WRONG integer Table 5 shows the output of the program of +152 VALID integer figure 9 when some input data are supplied to the computer during its execution. -600 VALID integer Table 5: A sample output of the program of figure 9 496+ WRONG integer 8112 VALID integer Input string Output 16 – 4 WRONG integer 5A WRONG integer 2 VALID integer C32 WRONG integer 179 008 VALID integer http://www.cs.umsl.edu/~janikow/cs4 1,500 WRONG integer 280/bnf.pdf The use of automata machine to model the [3] William, I.S., “Structures and generation of integer numbers (which are Abstractions – an introduction to defined recursively by a syntax diagram or BNF Computer Science with Turbo Pascal”, notation) makes the programming aspect of its Richard Irwin inc., 1992, p.409, p.415, implementation very easy to write and p.428 understand. The 6 states in the machine were well represented in the program (as functions [4] “What is Recursion? How is it helpful, that perform specific tasks) so that any and where it is used?”, 2020. URL: character that goes contrary to the rules of the www.equationanswers.com/c/c- grammar is reported by state(5) as an error, and recursion.php the parsing process is terminated immediately without processing the remaining part of the [5] Liang, Y.D., “Introduction to Java input string (if any). For instance, the input Programming”, 3rd ed., Prentice-Hall string, ‘+73.6’ in the sample output of table 4 is Inc., Upper Saddle River, New Jersey, ‘WRONG integer’ because the decimal point(.) is 2001, p.123 not defined in the grammar being represented by the automata machine. Again, the input [6] Deitel, P. & Deitel, H., “Java-How to string, ‘496+’ is ‘WRONG integer’ because the Program”, 8th ed., Pearson Education order of arrangement of the characters does not Inc., Upper Saddle River, New Jersey, agree with the syntax of the grammar. On the 2010, p.790 other hand, any input string that transits successfully from state(0) to state (4) is [7] Daniel, D.M. and Edwin, D.R., “Backus- accepted as VALID integer because it agrees Naur form (BNF)”, 2020. URL: with the syntax of the grammar in (i) order of https://www.researchgate.net/publicat arrangement, and (ii) in symbols. ion/262254296_Backus- Naur_form_NBF [8] Wikipedia, “Backus-Naur form”, 2020. 6. Conclusion URL: https://en.wikipedia.org/wiki/Backus Recursion is a powerful mathematical technique %E2%80%93Naur_form used ubiquitously in Computer science. The BNF notation or syntax diagram employs it [9] “Grammar_The language of extensively in defining the grammatical rules of languages(BNF, EBNF, ABNF, and a programming language. The use of Finite State more)”, 2020. URL: Automata (FSA) in transforming the ‘terminals’ http://matt.might.net/articles/gramma and ‘production rules’ of a grammar into a rs-bnf-ebnf/ directed graph helps immensely in modelling the grammar into a very compact form for easier [10] Holmes, B.J., “Pascal Programming”, 2nd understanding and programming. ed., The Guernsey Press Co Ltd., Channel Islands, 1990, p.30 [11] “Finite Automata”, 2020. URL: References https://www.cs.rochester.edu/u/nelso n/courses/csc_173/fa/fa.html [1] “Theory of Computation: Backus-Naur [12] Carol, C. & David, J.E., “Finite-State form”, 2020. URL: Automata”, 2020. URL: https://en.M.wikibooks.org/wiki/A- https://eng.libretexts.org/Bookshelves level_Computing/AQA/Paper_1/Theory /Computer_Science/Book%3A_Foundat _of_computation/Backus-naur-form ions_of_Computation_(Critchlow_and_E ck)/03%3A_Regular_Expressions_and_F [2] “What is BNF?”, 2020. URL: SA's/3.04%3A_Finite-State_Automata