Rough Set Flow Graphs and Ant Based Clustering in Classification of Disturbed Periodic Biosignals Krzysztof Pancerz1 , Arkadiusz Lewicki1 , Ryszard Tadeusiewicz2 , and Jan Warchol3 1 University of Information Technology and Management Sucharskiego Str. 2, 35-225 Rzeszów, Poland {kpancerz,alewicki}@wsiz.rzeszow.pl 2 AGH University of Science and Technology Mickiewicza Av. 30, 30-059 Krakow, Poland rtad@agh.edu.pl 3 Medical University of Lublin Jaczewskiego Str. 4, 20-090 Lublin, Poland jan.warchol@umlub.pl Abstract. In the paper, we are interested in classification of disturbed periodic biosignals. An ant based clustering algorithm is used to group episodes into which examined biosignals are divided. Disturbances in periodicity of such signals cause some difficulties in formation of coherent clusters of similar episodes. A quality of a clustering process result can be used as an indicator of disturbances. A local function in the applied clustering algorithm is calculated on the basis of temporal rough set flow graphs representing an information flow distribution for episodes. Keywords: rough set flow graphs, ant based clustering, biosignals 1 Introduction An increasing interest is now evident in classification and clustering for time series and signals using a variety of methodologies (cf. [5]). One of the frequently examined problems concerns analysis of biosignals (cf. [6], [12]). In our research, we consider a problem of classification of voice signals in order to detect some disturbances indicating the possible presence of laryngeal pathologies. Periodicity is one of the main features of some biosignals (e.g., voice, ECG). However, in case of some diseases, this periodicity can be disturbed. Especially, it is expressed by different shapes in selected time windows corresponding to periods of biosignals. The main idea of the proposed approach is based on recog- nition of temporal patterns and their replications in a selected fragment of the signal being examined. In case of some disturbances, temporal patterns cannot be found. This problem has been considered by us in examination of a voice signal for a non-invasive diagnosis of selected larynx diseases. We have used dif- ferent approaches to solve this problem, for example, recurrent neural networks [13], [14], [15], mining unique episodes in temporal information systems [10], ant based clustering with similarity measures [9]. In this paper, an ant based clustering algorithm working on the basis of tem- poral rough set flow graphs is proposed. Rough set flow graphs [11] introduced by Z. Pawlak are a useful tool for the knowledge representation. We use them as a tool for representing the knowledge of transitions between consecutive samples of examined signals. A quality of a clustering process result can be used as an indicator of disturbances in periodicity of biosignals. If episodes included in time windows corresponding to periods of the examined signal are similar (there is a lack of non-natural disturbances), then they should be grouped into very co- hesive clusters. If significant replication disturbances in time appear, then time windows are grouped in several clusters or they are scattered on the grid with- out any distinct groups. To provide such a classification ability, we use a special algorithm for clustering a set of well categorized objects, originally proposed by us in [8]. A set of well categorized objects is characterized by a high similarity of objects within classes and a relatively high dissimilarity of objects between different classes. The algorithm is based on versions of ant based clustering al- gorithms proposed earlier by Deneubourg, Lumer and Faieta as well as Handl et al. Temporal rough set flow graphs representing an information flow distribu- tion for episodes are used in the clustering algorithm for calculation of the local function. 2 Episode Information Systems and Temporal Rough Set Flow Graphs In this section, we recall the basic concepts concerning information systems and rough set flow graphs which are crucial to understand the approach proposed in the paper as well as we introduce a notion of episode information systems and show how to create rough set flow graphs for such systems. An information system is a pair S = (U, A), where U is a set of objects, A is a set of attributes, i.e., a : U → Va for a ∈ A, where Va is called a value set of a. Any information system can be represented as a data table, whose columns are labeled with attributes, rows are labeled with objects, and entries of the table are attribute values. In our approach, we will use information systems to represent time series data. Biosignals recorded in electronic devices have a digital form and they can be treated as time series, i.e., sequences of signal samples. Each consecutive sample represents a value of the signal at a given time instant. Therefore, we assume that a set of attributes in an information system is ordered in time, i.e., A = {at : t = 1, 2, . . . , n}, where at is the attribute determining values of the signal at time instant t. Each object in such a system is said to be an episode and the whole system will be called an episode information system and denoted by Se . An example of the episode information system Se = (E, A), where E is a nonempty finite set of episodes and A is a nonempty finite set of attributes ordered in time, is shown in Figure 1. This system includes five episodes of real voice signals and each episode consists of five consecutive samples. Table 1. An example of the episode information system Se E/A a1 a2 a3 a4 a5 e1 -9362.00 -10168.00 -11222.00 -11749.00 -12400.00 e2 -9734.00 -10261.00 -11346.00 -12090.00 -12524.00 e3 -10261.00 -11160.00 -11656.00 -12524.00 -13268.00 e4 -9672.00 -10664.00 -11191.00 -11718.00 -12400.00 e5 -8990.00 -9517.00 -10013.00 -10664.00 -10912.00 Rough set flow graphs have been defined by Z. Pawlak [11] as a tool for rea- soning from data. A flow graph is a directed, acyclic, finite graph G = (N, B, σ), where N is a set of nodes, B ⊆ N × N is a set of directed branches and σ : B → [0, 1] is a flow function. An input of a node x ∈ N is the set I(x) = {y ∈ N : (y, x) ∈ B}, whereas an output of a node x ∈ N is the set O(x) = {y ∈ N : (x, y) ∈ B}. σ(x, y) is called a strength of a branch (x, y) ∈ B. We define the input and the output of the graph G as I(G) = {x ∈ N : I(x) = ∅} and O(G) = {x ∈ N : O(x) = ∅}, respectively. I(G) and O(G) consist of external nodes of G. The remaining nodes of G are its internal nodes. With each node x ∈ N we associate its inflow δ+ (x) and outflow δ− (x) defined by: P – δ+ (x) = σ(y, x), y∈I(x) P – δ− (x) = σ(x, y). y∈O(x) For each node x ∈ N , its throughflow δ(x) is defined as follows:   δ− (x) if x ∈ I(G), δ(x) = δ+ (x) if x ∈ O(G), (1) δ− (x) = δ+ (x) otherwise.  With each branch (x, y) ∈ B we may also associate its certainty cer(x, y) = σ(x,y) δ(x) , where δ(x) 6= 0. A directed path [x . . . y] between nodes x and y in G, where x 6= y, is a sequence of nodes x1 , x2 , . . . , xn such that x1 = x, xn = y and (xi , xi+1 ) ∈ B, where 1 ≤ i ≤ n − 1. For each path [x1 . . . xn ], we define its certainty as n−1 Q cer[x1 . . . xn ] = cer(xi , xi+1 ). In our approach, we will use different operators i=1 (not only product) for calculating a certainty of a given path. To represent an information flow distribution in an episode information sys- tem we can use rough set flow graphs. A very similar problem has been considered Algorithm 1: Algorithm for creating a temporal rough set flow graph corre- sponding to an episode information system Input : An episode information system Se = (E, A), where A = {at : t = 1, 2, . . . , n}. Output: A temporal rough set flow graph G = (N, B, σ, cer) corresponding to Se . N ←− ∅; B ←− ∅; for each i = 1, 2, . . . , n do Ni ←− ∅; for each attribute value v of ai do Create a node nv representing v and add it to Ni ; end N ←− N ∪ Ni ; end for each i = 1, 2, . . . , n − 1 do for each node nv ∈ Ni do for each node nw ∈ Ni+1 do Create a branch b = (nv , nw ); card(E(ai ,v)∩E(ai+1 ,w)) σ(b) ←− card(E) ; card(E(a ,v)∩E(a i i+1 ,w)) cer(b) ←− card(E(ai ,v)) ; B ←− B ∪ {b}; end end end in [4]. The algorithm presented in this section is modeled on that presented in [4]. Let Se = (E, A), where A = {at : t = 1, 2, . . . , n}, be an episode informa- tion system. A rough set flow graph G corresponding to Se consists of n layers. Nodes in the i-th layer of G represent values determined by the attribute ai , where i = 1, 2, . . . , n. Since the attribute set A is ordered in time, we can call the graph G as a temporal rough set flow graph. It represents temporal flow distribution in an episode information system. In order to construct a temporal rough set flow graph G corresponding to an episode information system Se we may perform Algorithm 1. A graph constructed using this algorithm is supple- mented with a certainty function which assigns a number called certainty from the interval [0, 1] to each branch in G. Hence, we have G = (N, B, σ, cer), where N is a set of nodes, B is a set of directed branches, σ : B → [0, 1] is a flow func- tion, and cer : B → [0, 1] is a certainty function. Certainty factors of branches will be used in our approach presented here. Let Se = (E, A), where A = {at : t = 1, 2, . . . , n} be an episode information system. By E(ai , v) we denote the set of all episodes in E for which the attribute ai has the value v. 3 Ant Based Clustering Algorithm Ant based clustering of time series was considered by us in [7]. Next, the approach presented there has been modified in our last work [8]. Here, we briefly remind this approach. It is based on algorithms proposed earlier by Deneubourg [1], Lumer and Faieta [3] as well as Handl et al. [2] with some our modifications. A set of steps is formally listed in Algorithm 2. In this algorithm: – E is a set of episodes being clustered (the size of E is n), – N is a number of iterations performed for the clustering process, – Ants is a set of ants used in the clustering process, – ppick (e) and pdrop (e) are probabilities of the picking and dropping operations made for a given episode e, respectively (see Formulas 3 and 4). Let P be a set of all places of the square grid G on which objects are scattered. Each place p of G is described by two coordinates, i and j, written as p(i, j). The neighborhood π(p) of p(i, j), where a given object is foreseen to be dropped or whence it is foreseen to be picked up, is defined as a square surrounding p(i, j), i.e.: π(p) = {p0 (i0 , j 0 ) ∈ P : abs(i0 − i) <= r and abs(j 0 − j) <= r}, where r is a radius of perception of ants and abs denotes the absolute value. For the neighborhood π(p), a local episode information system Se (p) is cre- ated. Se (p) includes all episodes placed in π(p). Next, on the basis of Se (p), a temporal rough set flow graph T RSF G(p) is built using Algorithm 1. T RSF G(p) serves as a base for calculating a local function value. Let Se (p) = (E, A), where A = {at : t = 1, 2, . . . , n}, be a local episode information system and e∗ be the episode designated to pick up from or to drop at the place p. If e∗ is described by the following attribute values: a1 (e∗ ) = v1 , a2 (e∗ ) = v2 , ..., an (e∗ ) = vn , then a local function floc (e∗ ) is calculated as: n−1 floc (e∗ ) = Op cer(nivi , ni+1 vi+1 ), (2) i=1 where: – Op is an aggregation operator, for example, median, arithmetic average, average weighted with different ways, etc., – nivi is the node in the i-th layer of T RSF G(p) corresponding to value vi , – ni+1 vi+1 is the node in the (i + 1)-th layer of T RSF G(p) corresponding to value vi+1 . It is easy to see that a local function is calculated as the aggregation of certainty factors of branches belonging to the path in T RSF G(p) determined by attribute values of the episode e∗ . To avoid forming smaller clusters of very similar episodes, the threshold den- sity minDens is used. The density dens(e) of the neighborhood π(p) for the episode e designated to pick up from or to drop at the place p is calculated as a ratio of a number of all episodes placed in the neighborhood π(p) to a number of all places in the neighborhood π(p). Picking and dropping decisions for the episode e can be formally expressed by the following threshold formulas: ( 1 if floc (e) <= ϑpick sim , ppick (e) = 1 2 (3) pick 2 (floc (e) − 1) otherwise (1−ϑ ) sim and ( 1 if floc (e) >= ϑdrop sim , pdrop (e) = 1 2 (4) drop 2 floc (e) otherwise. (ϑ ) sim Thresholds ϑpick drop sim and ϑsim for picking and dropping operations, respectively, are fixed. These values are selected experimentally for given sets of signals. 4 Classification Procedure It was mentioned in the introduction that we are interested in periodical biosig- nals. The classification procedure proposed in this paper is as follows. We divide a given fragment of a periodic biosignal into time windows. Each time window represents one period of the signal. A signal included in one time window is called an episode. The set of episodes is used in the clustering process. In the experiments, voice signals collected by J. Warchol [16] were examined. Examples of two voice signals, divided into episodes are shown in Figures 1 and 2, respectively. The first signal does not include disturbances of periodicity whereas the second one includes them. For comparison, we also present a pure sine signal (see 3) divided into episodes. Before clustering, we apply some pre-processing procedures: 1. Values of signal samples are normalized to the interval [−1.0, 1.0]. 2. Each episode is transformed into the so-called delta representation, i.e., val- ues of samples have been replaced with differences between values of current samples and values of previous samples. After transformation, each episode is a sequence consisting of three values: -1 (denoting decreasing), 0 (denot- ing a lack of change), 1 (denoting increasing) (cf. [7]). This transformation enables us to obtain a multistage decision transition system with discrete values. Experiments for the ant based clustering have been performed with the fol- lowing parameters: – grid size: 100 × 100, – no. of ants: 5, – radius of perception: 2, – no. of iterations: 10000, – aggregation operator : arithmetic average. Algorithm 2: Algorithm for Ant Based Clustering of Episodes for each episode ei ∈ E do Place ei randomly on a grid G; Set ei as dropped; end for each ant aj ∈ Ants do Place aj randomly on a grid place occupied by one of episodes from E; Set aj as unladen; workT ime(aj ) ← 0; end for k ← 1 to N do for each ant aj ∈ Ants do if aj is unladen then if place of aj is occupied by episode e then Draw a random real number r ∈ [0, 1]; if dens(e) < minDens or r ≤ ppick (e) then set e as picked; set aj as carrying the episode; else move aj randomly to another place occupied by one of episodes from E; end else move aj randomly to another place occupied by one of episodes from E; end else Draw a random real number r ∈ [0, 1]; if r ≤ pdrop (e) then move e carried by aj randomly to a new place on a grid; set e as dropped; set aj as unladen; workT ime(aj ) ← 0; else workT ime(aj ) ← workT ime(aj ) + 1; end end if workT ime(aj ) > maxW orkT ime then set e carried by aj as dropped; set aj as unladen; move aj randomly to another place occupied by one of episodes from E; workT ime(aj ) ← 0; end end increase minDens; end Fig. 1. Selected episodes of the signal without disturbances of periodicity Fig. 2. Selected episodes of the signal with disturbances of periodicity After a clustering process, we expect to have two distinguishable situations. If episodes are similar (there are no disturbances), then they are grouped into very cohesive clusters. If significant replication disturbances appear, then episodes are grouped in several dispersed clusters or they are scattered on the grid. A result of clustering is an indicator used to classify the examined biosignal. Exemplary results of ant based clustering, for the signals without and with disturbances of periodicity, are shown in Figures 4 and 5, respectively. For com- Fig. 3. Selected episodes of the pure sin signal parison, we also present a result of ant based clustering for a pure sine signal (see Figure 6). Fig. 4. The result of clustering the signal without disturbances of periodicity 5 Conclusions In the paper, the first attempt to application of ant based clustering based on temporal rough set flow graphs for classification of disturbed periodic biosignals Fig. 5. The result of clustering the signal with disturbances of periodicity Fig. 6. The result of clustering the pure sin signal has been presented. At the beginning, we were interested in simple classification, i.e., periodicity of the examined signal is disturbed or no. In the future, we plan to make more detailed analysis of results of a clustering process. This should give us additional information about the scale and character of disturbances. Acknowledgments This paper has been partially supported by the grant No. N N519 654540 from the National Science Centre in Poland. References 1. Deneubourg, J., Goss, S., Franks, N., Sendova-Franks, A., Detrain, C., Chrétien, L.: The dynamics of collective sorting: Robot-like ants and ant-like robots. In: Proceedings of the First International Conference on Simulation of Adaptive Be- haviour: From Animals to Animats 1. pp. 356–365. MIT Press, Cambridge, MA (1991) 2. Handl, J., Knowles, J., Dorigo, M.: Ant-based clustering and topographic mapping. Artificial Life 12(1), 35–62 (2006) 3. Lumer, E., Faieta, B.: Diversity and adaptation in populations of clustering ants. In: Proceedings of the Third International Conference on Simulation of Adaptive Behaviour: From Animals to Animats 3. pp. 501–508. MIT Press, Cambridge, MA (1994) 4. Matusiewicz, Z., Pancerz, K.: Rough set flow graphs and max - * fuzzy relation equations in state prediction problems. In: Chan, C.C., Grzymala-Busse, J.W., Ziarko, W. (eds.) Rough Sets and Current Trends in Computing. Lecture Notes in Computer Science, vol. 5306, pp. 359–368. Springer-Verlag, Berlin Heidelberg (2008) 5. Mitsa, T.: Temporal Data Mining. CRC Press (2010) 6. Nait-Ali, A. (ed.): Advanced Biosignal Processing. Springer-Verlag, Berlin Heidel- berg (2009) 7. Pancerz, K., Lewicki, A., Tadeusiewicz, R.: Ant based clustering of time series discrete data - a rough set approach. In: Panigrahi, B.K., et al. (eds.) Swarm, Evolutionary, and Memetic Computing, Lecture Notes in Computer Science, vol. 7076, pp. 645–653. Springer-Verlag, Berlin Heidelberg (2011) 8. Pancerz, K., Lewicki, A., Tadeusiewicz, R.: Ant based clustering of two-class sets with well categorized objects. In: Greco, S., Bouchon-Meunier, B., Coletti, G., Fedrizzi, M., Matarazzo, B., Yager, R. (eds.) Advances in Computational Intelli- gence, Communications in Computer and Information Science, vol. 299, pp. 241– 250. Springer-Verlag, Berlin Heidelberg (2012) 9. Pancerz, K., Lewicki, A., Tadeusiewicz, R., Szkola, J.: Classification of speech signals through ant based clustering of time series. In: Computational Collective Intelligence Technologies and Applications. Lecture Notes in Artificial Intelligence, Springer-Verlag, Berlin Heidelberg (2012), to appear 10. Pancerz, K., Paja, W., Wrzesień, M., Warchol, J.: Classification of voice signals through mining unique episodes in temporal information systems: A rough set approach. In: Proceedings of the 21th international Workshop on Concurrency, Specification and Programming (CS&P 2012). Berlin, Germany (2012), to appear 11. Pawlak, Z.: Flow graphs and data mining. In: Peters, J., Skowron, A. (eds.) Trans- actions on Rough Sets III, pp. 1–36. Springer-Verlag, Berlin Heidelberg (2005) 12. Semmlow, J.: Biosignal and Medical Image Processing. CRC Press (2009) 13. Szkola, J., Pancerz, K., Warchol, J.: Computer-based clinical decision support for laryngopathies using recurrent neural networks. In: Hassanien, A., et al. (eds.) Proc. of the ISDA’2010. pp. 627–632. Cairo, Egypt (2010) 14. Szkola, J., Pancerz, K., Warchol, J.: Computer diagnosis of laryngopathies based on temporal pattern recognition in speech signal. Bio-Algorithms and Med-Systems 6(12), 75–80 (2010) 15. Szkola, J., Pancerz, K., Warchol, J.: Recurrent neural networks in computer-based clinical decision support for laryngopathies: An experimental study. Computational Intelligence and Neuroscience 2011 (2011), article ID 289398 16. Warchol, J.: Speech Examination with Correct and Pathological Phonation Using the SVAN 912AE Analyser (in Polish). Ph.D. thesis, Medical University of Lublin (2006)