Finite State Automata – a concept for data representation © Marian Mindek, Michal Burda Department of Computer Science, FEI, VSB - Technical University of Ostrava, 17. listopadu 15, 708 33, Ostrava-Poruba, CZ {marian.mindek, michal.burda}@vsb.cz Abstract More about it can be found in chapter Compression). Tail of the paper is intended to discuss the methods In this paper, we introduce finite automata as a of storing automata to usual data tables, matrices or tool for matrix specification and compression. XML documents. We also describe, how to get additional interesting information from such automata. At 2 How to represent matrix with automaton last, we focus on techniques for storing resultant automata as matrices, tables of an SQL database, or as XML document. 2.1 Elementary theory To understand the following, we allege here the 1 Introduction necessary background only. For more about automata theory please read [1, 6]. Finite automata can be used as a tool for efficient In the subsequent, we work with images instead of matrix storage with the possibility of compression. matrices. It is due to simplicity and better Since matrices are general purpose data structures, this understandability of this paper. To work with matrices, approach could be used on images, text, sound files etc. no additional effort is needed, as one may realize. Also, Storing suitable data as automaton brings up also the for more about using automata as a tool for specifying benefit of obtaining the additional interesting image, please read [3, 4, 5, 7, 8]. information about our data [1, 3, 4, 5, 7, 8]. A digitized image of the finite resolution m x n Such technique is especially useful when dealing consists of m x n pixels each of which takes a Boolean with large binary matrices. A traditional approach value (1 for black, 0 for white) for bi-level image, or (compression using common algorithms) solves only a real value (practically digitized to an integer value from part of the problem – they save a lot of space, but there 0 to 256) for a gray-scale image, or 24bit information is no way how to make any changes on the compressed (RGB) for true-color image. matrix. In the subsequent, we consider only square images In the next chapter, we allege only the necessary of resolution 2n x 2n. In order to facilitate the background of the automata theory. After that, we application of finite automata to image description, we describe simple algorithm for compressing matrices by can assign unique word (path through the automaton) of creating the finite state automata. length n over the alphabet As one can see in the following sections, we can use the created automaton to get additional interesting Σ={0, 1, 2, 3} information about the compressed data, namely the to each pixel of the 2n x 2n resolution image. patterns of similar parts of source matrix. Such patterns Each Σ’s symbol in the word represents the address may be used in special searching algorithms to find e.g. of a sub-square of the square addressed with the similar parts of faces, medical pictures, buildings preceding symbols of the word. We choose ε as an tracing, parts of large sparse matrices, similar noise, address of the whole unit square. similar trends, pieces of text, etc. Single digits, as shown in the left of figure 1, Storing matrices as automata is not the read-only address the quadrants. Thus, the four sub-squares of a compression. It is also possible to modify the square with address w have address w0, w1, w2 or w3, compressed matrix. However, this process is not very respectively. The middle of fig. 1 shows addresses of all straightforward, but there exist many variants how to pixels of a 4 x 4 image. The sub-square (pixel) with enable update of compressed matrices: we can do the address 3203 is shown on the right of figure 1. slow re-compression of the overall matrix to get best compression ratio, it is also possible to re-compress the changed part only, which is faster but saves less space. Proceedings of the Spring Young Researcher's Colloquium on Database and Information Systems SYRCoDIS, St.-Petersburg, Russia, 2005 Example. Clearly, L1 = {1,2}*0 are addresses of the infinitely many squares illustrated at the left of Fig. 4. If we place the completely black square defined by L2 = Σ* into all these squares we get the image specified by the concatenation L1L2={1,2}*0Σ* which is the triangle shown in the middle of Fig. 4. Figure 1. The addresses of the quadrants, of the sub- square of resolution 4 x 4, and the sub-square specified by the string 3203. We denote Σm the set of all words over Σ of the length m, by Σ* the set of all words over Σ. Figure 4. The squares specified by {1,2}*0, a triangle In order to specify a black-white image of resolution defined by {1,2}*0Σ*, and the corresponding 2m x 2m, we need to specify a language L ⊆ Σm where automaton. the word w belongs to L iff the coresponding sub- square on the image is black. Example. By placing the triangle L= L1L2 from the The automaton coresponding to the given image previous example into all the squares with addresses should be created as to recognize the language L. That L3={1,2,3}*0 we get the image L3={1,2,3}*0{1,2}*0Σ* is, it must end in accept state iff the sub-square of given shown at the left of Fig. 5. address is black. To be able to compress color images (multi-valued Zooming [5] is easily implemented for images matrices), each ending state of the automaton must represented by regular expressions and is very contain the color (value) of each pixel (cell) in the sub- important for matrix compression shown in next square. section. Now, we are ready to give some examples. We assume that the reader is familiar with the elementary facts about finite automata and regular sets – see e.g. [1, 6]. Example. Consider the 2 x 2 chess-board in the left of figure 2. Its automata could be described by a regular expression {1,2}Σ*. Please note, the regular set also describes the 2 x 2 chess board in arbitrary resolutions Figure 5. The diminishing triangles defined by (concretely, 2n x 2n for any positive integer n). {1,2,3}*0{1,2}*0Σ*, and the corresponding automaton. The 8 x 8 chess-board in the right of figure 2 can be described by the regular expression Σ2{1,2}Σ* or by automaton A in Fig. 3. 2.2 Construction of Finite Automaton We have just shown that a necessary condition for black and white multi-resolution image (is evident, that the same reads for binary matrices) to be represented by a regular expression is that it must have only a finite number of different sub-images in all the sub-squares with addresses from Σ*. We will show that this Figure 2. 2 x 2 and 8 x 8 chess-boards condition is also sufficient. Therefore, matrices that can Notice that here we used the fact that the regular be perfectly (i.e. with infinite precision) described by expression Σ2{1,2}Σ* is the concatenation of two regular expressions (finite automata) are images of regular expressions Σ2 and {1,2}Σ*. regular or fractal character (matrices with many same part). Self-similarity is a typical property of fractals. Any image can by approximated by a regular expression (finite automaton) however; an approximation with a smaller error might require a larger automaton. Now we will give a theoretical procedure which, given a multi-resolution image or multi-frequency Figure 3. Finite automaton A defining the 8 x 8 chess- sound, finds a finite automaton perfectly specifying it, if board. such an automaton exists. (Multi-resolution principle can be applied to mathematical binary matrices, but from word w (a, b, c ∈ {0,1,2,3} - represent sub-square only occasionally. For text is not useful!) of matrices). Procedure Construct Automaton For given matrix M, we denote Mw the zoomed part of M in the square addressed w. The (sub) matrix represented by state numbered x is denoted by ux. Procedure Construct Automaton i=j=0 create state 0 and assign u0 = M assume ui = M w loop for k ∈ {0,1,2,3} do if M wk = uq for some state q then Figure 6. Analysis of possible cases in constructed create an edge labelled k automaton. from state i to state q else j=j+1 uj = M wk create an edge labelled k from state i to the new state j end if Figure 7. Analysis of possible cases in constructed end for automaton from figure 6 without loop. if i == j than Stop (all states have been processed) The cases (ii) and (iii) in figure 6 correspond with else case (II) in figure 7. We can depress increasing count of i=i+1 state at approach where is not loop. end if end loop Procedure for reconstruction matrix from automaton end procedure is very simple, than we describe only procedure Reconstruct matrix for compressed matrix at next The procedure Construct Automaton terminates if section. there exists an automaton that perfectly specifies the given matrix and produces a deterministic automaton 3 Compression with the minimal number of states. Our algorithm for non-binary matrix (e.g. grey-scale image, sound, text) is based on this procedure, but it will use evaluated 3.1 Basic compression finite automata (as like WFA) introduced in the next In this part of chapter, we will demonstrate in brief a section and only replacing binary information 0/1 to a method for matrix compression / decompression real value (e.g. 256 colour, or greyness image), no applicable on construction of matrix storage, or matrix creating loop and add some option for set-up database. Our algorithm is based on algorithm shown in compression. previous section. There lead 4 edges from each node at most and they are labelled with numbers representing Example: For the Image diminishing triangles in Fig. matrix part. Every state can store information of 5, the procedure constructs the automaton shown at the average value of given sub-square. right-hand side of Fig. 5. First, the initial state D is The procedure Construct Automaton for created a processed. For 0 a new state T is created, for compression terminates if there exists an automaton that 1,2 and 3 a loop to itself. Then state T is processed for 0 perfectly (or with the defined error) specifies the given a new state S is created, for 1 and 2 a loop to T. There is matrix and produces a deterministic automaton with the no edge labelled 3 coming out of T since the quadrant 3 minimal number of states. The number of states can be for T (triangle) is empty. Finally the state S (square) is a little reduced or extended by changing error threshold. processed by creating loops back to S for all 4 inputs. This principle is naturally useful only for matrices where it is possible to apply loss-compression. In this example we can see, that state D does not Changing the part (or only one matrix element) in contain edge labelled 0 and state T edge labelled 0 and source matrix changes the number of states in resultant 3. If either one of them is not present several cases may automata. happen, as shown in figure 6. The same example, but In the following, we propose one of the simpler and without loop is shown in figure 7. Where 0 ... X, X∈ faster algorithms for reconstructing the matrix from the {1,2,3,4} denote some state of automaton (not automata. It is recursive algorithm, but it does not dive necessarily adjoining), and a, b, c represent symbol very deeply. Procedure Construct Automaton for Compression For given matrix M, we denote Mw the zoomed part of Procedure Reconstruct Matrix M in the square addressed w. The matrix represented by qo = w = {ε} = M (matrix represented by the empty state numbered x is denoted by ux. word) i(qo)=1 Procedure Construct Automaton for Compression t(qo)=∅(ε) (average greyness of the matrix M) i=j=0 create state 0 and assign u0 = M (matrix represented recursively for state q do by empty word and define average value of M) for u = u0, u1, u2, u3do assume ui = M w u = Mua loop if u = t(qo) and word has shorter then for k ∈ {0,1,2,3} do requested then if M wk = uq (or with small error) or if RC q = uX the matrix Mwk can be expressed as a else part or expanded part of the matrix uq RC q = q(uY) for some state q then end if create an edge labelled k end for from state i to state q if u = {ε} or u = {requested word} else stop j=j+1 end if uj = M wk end recursively create an edge labelled k from state i to the new state j where: X denotes part of image end if Y is a next part of image end for ∅(x) is average value of the matrix part Mux, we can if i == j than change to computed value, if we wont that. Stop (all states have been processed) else Example: Original computed word is w = {3203}, see i=i+1 figure 1. The length of word is 4 ⇒ the number of end if elements of matrix is 256 (clearly x2 where x is y2 where end loop y is z2 where z is 2). If we process all words we obtain end procedure all matrix elements, if we process only first 2 symbols from the word we obtain only matrix with 22 elements. The procedure Reconstruct Matrix stops for every The average value of element (if value is from interval automata computed by a procedure Construct 0-255) with address w = {32} is approximately 16, then Automaton (for Compression), or other similar we obtain matrix, where sub-square with address w = algorithm. The procedure computes original matrix only {32} have every elements equal 16. This principle is if we process every word from input alphabet and useful only for images, sounds or el. signals. Text will alphabet was computed by loss free procedure. be destroyed. The procedure Reconstruct Matrix can Otherwise, we obtain approximately same matrix. The reconstruct all matrix elements (same dimension as percent similarity is in relation with length of processed original) with defined error if we use word with first 2 word and original computed word. In figure eight is symbol and append empty word w={32εε}. depicted other approach to storing and reconstructing the image (real evaluated matrix). 3.2 Compression and changes Procedure Reconstruct Matrix (RC) For given automata A, make matrix Mw the zoomed part Now we describe, how to re-compress the matrix after of M in the square addressed w. The matrix represented some updates. by state number x is denoted by ux. If we compress the matrix with traditional algorithm and some element is changed, we must re-compress all matrices every time. But if we represent matrix as a Finite State Automata, we can change / re-compress only corresponding part with changed element. There exist at least three basic solutions for selection of corresponding part of Finite State Automata or corresponding sub-square of source matrix: 1) Re-calculating the biggest corresponding sub- square: Figure 8. Reconstructing image from root to the leaf node. This approach leads to a big quantum of data We can easily get the group of equal matrix parts. The manipulation (as much as one quarter), but with this algorithm for assembling resultant automata together is approach we can reach the high compression ratio. very simple. Method is useful for all types of source matrices. 2) Re-calculating the least corresponding sub- square: This approach re-calculates the least quantum of data (radix ten elements), but with this approach we gain very small compression ratio and often changes lead to the grow in size of the automata. Compression is after that disutility, but if we want only obtain the interesting information from our resultant automata, this method is useful too. This approach is useful for all types of source matrices. 3) Re-calculating the optimal corresponding sub- square: Retrieval of such sub-square may be difficult, but in most cases, it shows that it hasn't sense to work with sub-square greater than three or four least corresponding sub-squares. Naturally, it greatly depends on the character of the matrices. If we know that our matrices contain many equal blocks (e.g. wiring gate in microprocessor), we can state the amount of levels, which we should still take in consideration. This choice naturally has not influence on algorithm, but only on machine time and resultant size of compressed matrices. 4 Resultant automata 4.1 Coalescence There exist many methods for assembling matrices represented by finite automata. We will depict one of Figure 9. Two sample images with common parts and better ones, namely assembling resultant automata in the corresponding automata. direction from leaf node to the root. Suppose a couple of matrices and their representation by means of final state automata computed with procedure 1, depicted in Procedure Composition Automaton for Storage figure 9. For simplicity, black and white pictures with For given automaton A and automaton B - resultant resolution 8 × 8 pixel were used. Moreover, the states of automaton from previously composition compute new automata contain also the average greyness of resultant automaton B´ ∈ A∪B and combine similar corresponding picture parts. These values are in interval parts of both. 0 − 255, where 0 represents black and 255 white colour. Procedure Composition Automaton (Automaton A, Example: It is clear that images in figure 9 have some Stored automaton B) common parts, highlighted in following figure 10. Assign state qx from A to the corresponding state in These common parts could be joined into the same state stored automaton B. in composed resultant automata (see figure 10 on the If such case does not exists, assign a new state and bottom). The corresponding automata with images in take qx+1 from A. figure 9 have some common state and edge labelled end if with same symbol. There exists some similar or the foa all state of automaton A do equal word w depicting the way from the root to the if not exist edge from state qi labelled with corresponding part represented by a state. same word w as edge from correspond state in stored automaton then This principle can be used for no-loss compression. create a new edge labelled w to a new Additional information can be obtained from the state i structure of the resultant automaton, for example the otherwise information about the similarity of the stored matrices, take next edge common lines, and alternatively equal parts in matrix. end if if all edge from actual state is processed, take automaton has more states then the new states represent next state the differences between input matrices. With procedure end if Construct Automaton for Compression, we can control end for the type of acceptable (or unacceptable) changes, e.g. end procedure trivial differences in value, small noise, etc. With this, we can separate interesting changes from those that are rather to be ignored. For example, in medicine, we can obtain many similar images and only medical specialist can mark an interesting parts. We can help him or her to reduce the number of non-interesting changes and of course notice him or her to some small differences, which can be important, but hard to find. With procedure Composition Automaton for Storage, we can increase the ability to make the difference between interesting and non-interesting parts of the resultant automaton. We can even do it by involving the tolerance in value (average value), or similarity of words, etc. If we store our source matrix in more than one automaton, we can focus on our interesting part of the matrix and there compute the profundity automaton. On other part of matrix, we can compute automaton with less number of states. For this purpose, we can use the pattern matrix shown in table 1, where the values in cells are the counts of profundity of automaton, which represents that part of matrix. This principle can be used only for loss compression (e.g. images, signals, etc.). The part with less count of states stores much fewer information than the part with more states. It is clear, with this principle we can save much more space and also preserve high information value of our data. We can transfer only interesting part of matrix or any nearest part and save machine time or network capacity. It is sufficient to choose a state from resultant Figure 10. Composite automata from figure 9 automata, which represents our interesting part of the respecting the common parts. matrix, and operate with this as with the root. This principle is used with the principle automata The procedure Composition Automaton for Storage composition. Pattern may be arbitrary. stops for every automaton computed by a procedure Now we have a background for using finite state Construct Automaton for Compression or other similar automata as a database with included information. algorithm. Resultant automaton is in most cases smaller than original automata and contains interesting information from both automata. In some cases, the 3 3 3 3 3 3 resultant automaton could be larger, but if we will store 3 5 5 5 5 3 more and more matrices, the resultant automaton will 3 5 10 10 5 3 increase less and less. On the other side we can store 3 5 10 10 5 3 two (or more) almost same matrices in automaton with 3 5 5 5 5 3 almost non-increasing number of states. It is clear, 3 3 3 3 3 3 when we are storing matrices of the same domain (e.g. Table 1: Pattern of approach with 36 automata. medical super-sound radiograph, X-ray, pictures of building, el. signal, similar noise, etc.) we could obtain an interesting knowledge about stored matrices hand in 4.3 Representation hand with saving the memory space. In this chapter we introduce ideas how to represent Finite State Automata as a matrix, as a table for SQL 4.2 Interesting information in matrix database and as a XML structure. If we store some similar matrices in one automaton, 1) Matrix representation we can obtain interesting information about changes in Suppose automata as depicted in figure 5. Such resultant automaton. Concretely, if our resultant automata could be transformed into matrix with at least four columns. The first to four columns contain Table representation is very similar to matrix destination for each transition. We can append fifth representation. Instead of matrix, we use common table column with the source state. If the matrix is encoded to store all information needed to describe the then the average value is usually stored in the sixth automaton. We store simply the source state, average column. To support additional features we may store value, etc. in the table. Procedure Store Automaton to subsequent information but for basic functionality, it is Table is adequate to procedure Store Automaton to enough. The matrix itself is built using following Matrix then we describe only resultant table at Table 3. procedure: The first to fifth columns contain information about the state and the sixth column contains the average Procedure Store Automaton to Matrix value (in our example it contains the average colour of Input is resultant automaton from procedure Construct sub-square in image depicting the diminishing triangles, Automaton for Compression or other similar algorithm. where background of image is black colour denoted with 0 and other colour is only white - 255. Procedure Store Automaton to Matrix (automaton A) for all qi ∈ Q do 0 1 2 3 source a. value create new line in matrix and assign value i in T D D D D 96 source if appended S T T T 128 for all edge ej outgoing from qi do S S S S S 255 assign value i in column j * Table 3: Table representation of automaton in fig. 5 end for end for This representation is best for storing in SQL end procedure database particularly if we append additional interesting information in additional column (e.g. average value, where: Q is se of states of automaton A state depth, most similar state, etc.) witch emphasise j starts from 0 but 3 at most included benefit. It is very fine for indexing and * unused transitions leave empty facilitates manipulating very large resultant structure source - number of processed state (composed FSA). The matrix representation of the automata presently 3) XML representation used is straightforward encoding of states and This representation is very useful for using Finite transitions that could be supplied with additional State Automata as means to database with added information. interesting information (e.g. multimedia database, database of medical images, el. signals, trends, etc.) and For automaton in figure 5: is best for sharing resultant FSA on internet/intranet. It state D = 1 is very easy to manipulate with only a part of the T=2 automaton, in this structure. There are many approaches S = 3 and S is final state of how to solve the problem and we offer the on of the simpler. For more about XML please read [9]. 2 1 1 1 3 2 2 Example: The automaton representing a diminishing 3 3 3 3 triangles in figure 5 represented by simplified XML Table 2: Matrix representation structure. Simple resultant matrices without additional information contain only integral numbers from interval (0 - count of state). To reconstruct the source matrix from such representation is very simple and fast. We can use procedure Reconstruct Matrix above-mentioned or other similar procedure. This representation of FSA is useful for direct hardware-based processing or for running on machines with simple operating system. It is easy to convert it to any other representation and to manipulate with it. Computation costs no many machine time and space. Over all this advantages, the matrix representation is not so good for database manipulation (e.g. searching, updating, etc.). Example: The exemplary composed automaton For such purposes, there are some other represented by simplified XML structure. representations. 2) Table representation matrices (e.g. multimedia database) with the ability to search of interesting parts of our data. In our future work, we focus on manipulating with ... data in database (at various structures, e.g. table or XML), describe language or a set of tools that is able to query stored data. ... References [1] Alur, R. and Dill, D. L. A Theory of Timed ... Automata. In Theoretical Computer Science, 126(2):183–235, 1994. [2] Daniela BERARDI, Fabio DE ROSA, Luca DE SANTIS and Massimo MECELLA. Finite State Automata as Conceptual Model for E-Services. In Integrated Design and Process Technology, IDPT- ... 2003, June 2003. [3] K. Culik II and J. Kari. Image compression using 17:305–313, 1993. [4] K. Culik II and V. Valenta. Finite automata based ... compression of bi-level and simple color images. In ... Computers & Craphics, 21:61–68, 1997. ... [5] K. Culik II and J. Kari. Image compression Using Weighted Finite Automata, in Fractal Image Compression. In Theory a Techniques, Ed. Yuval Fisher, Springer Verlag, pp 243-258, 1994. [6] J.E.Hopcroft and J.D.Ullman. Introduction to ... automata theory, languages and computation. In Addison-Wesley, 1979. [7] Marian Mindek. Finite State Automata and Images. In WOFEX 2004, PhD Workshop, Ed. V. Snášel, ISBN: Where: 80-248-0596-0, 2004 X is average value of part of matrix [8] Marian Mindek. Finite State Automata and Image wC, C∈{0,1,2,3} is sub-square (pixel), if wC is not Recognition. In DATESO 2004, Ed. V. Snášel, J. present, then sub-square with symbol C is represented Pokorný, K. Richta, pp 132-143, ISBN: 80-248-0457-3, itself (with the state witch represent - recursively) 2004 S is value: 0 - not present, 1 - present [9] W3C (2004) XML Protocol. XML Protocol Web Page (link checked January, 10th 2005): In this structure we can store every automata http://www.w3.org/XML/ computed by procedure Construct Automaton for Compression or Composition Automaton for Storage or other similar algorithm. Final state can obtain other information e.g. following automaton increasing deep. 5 Conclusions and future work In this paper we have presented a Finite State Automata for storing and compressing data represented as a matrix. FSA allow us to capture a large class of data represented as matrices and obtaining interesting information without using other algorithm to compute it. Additionally, data stored in FSA save more space, make it possible to access and manipulate with them without the decompressing process. Storing data in automaton allows us to send via network only the part of data of our interest. We do not need to have available the whole matrix (e.g. picture) if we are interested in a small part of it only. Additional benefit is included.