-

Extraction of Classification Rules from Sequences of Crystal Growth Data

Radek Buša

Yann Dauxais

Stefan Ecklebe

Natasha Dropka

Martin Holenˇ a

1 5 0 Faculty of Information Technology, Czech Technical University , Thákurova 9, Prague , Czech Republic 1 Institute of Computer Science, Czech Academy of Sciences , Prague , Czech Republic 2 Institute of Control Theory, TU Dresden , Georg-Schumann-Str. 7a, Dresden , Germany 3 KU Leuven , Celestijnenlaan 200a, Leuven , Belgium 4 Leibniz Institut für Kristalzüchtung , Max-Born Str. 2, Berlin , Germany 5 Leibniz Institute for Catalysis , Albert-Einstein Str. 29a, Rostock , Germany

The paper presents a generalization of a data mining method for the extraction of classification rules for classification of sequences of events, which is called discriminant chronicles mining. The generalization is motivated by the objective to extract classification rules from crystal growth data, for which the original method needs to be extended to events with vectors of attributes and to real-valued attributes. The paper elaborates incorporating both extensions into the theoretical fundamentals of the original method, and describes a corresponding modification of a system for discriminant chronicles mining, which has been developed three years ago to implement the original method. Finally, an application of the generalized method, using the modified system for discriminant chronicles mining, to data from the growth of GaAs crystals by vertical gradient freeze method is briefly sketched.

This paper deals with data mining of crystal growth data, obtained either experimentally or from simulations. Such data records the crystal growth process, its performance, and conditions in the melt, such as temperatures in various control points or the power of heaters, or parameters of the magnetic fields influencing melt convection [ 6, 7 ]. In particular, we consider the common situation that the performance data indicates whether the crystal growth process can be classified as satisfactory according to a given criteria , e.g., according to the shape or position of the solid/liquid interface. Hence, the primary data mining approach to that data is the extraction of classification rules.

Although a plethora of methods for classification rules extraction exist [ 10, 11 ], most of them cannot be used for our data. The reason is that crystal growth proceeds sequentially, hence, the data is inheretly sequential. Therefore, we have chosen a specific rules extraction method extracting classification rules for the classification of sequences of events, which was proposed in [ 3 ]. It is called “discriminant chronicles mining” because it was originally developed for events described with attributes conveying a temporal meaning. However, it cannot be directly applied to crystal growth data, due to the following two restrictions: (i) The events in [ 3 ] are described with scalar values of the temporal attribute. On the other hand, members of sequences of crystal growth data, which we will for simplicity also call events, are described with vectors of attribute values. (ii) The temporal attribute describing events in [ 3 ] has a finite number of values, thus it can be represented by a finite subset of integers. On the other hand, attributes describing events in sequences of crystal growth data are real-valued.

Therefore, we have extended the method from [ 3 ] to sequences of events described by real-valued vector attributes. This extension is the main contribution of the paper.

The next section briefly recalls the original method proposed in [ 3 ]. Its extension removing the restrictions i and ii above is described in Section 3. Finally, an application of the proposed method to crystal growth data is sketched in Section 4. 2

Classification Rules Extraction with Discriminant Chronicles

Let E be a finite set, the elements of which are called event types, and let T be an arbitrary subset of the extended reals, T R¯ . For a multiset of m event types, the rather unusual notation ffe1; : : : ; emgg has been introduced in [ 3 ], and a couple (e; t) 2 E T is called event.

Assume further that some ordering is imposed to the event types in the application domain. In [ 3 ], where temporal relationships are investigated, a total ordering corresponding to their order of occurrence is considered. For the rules extraction method proposed in Section 3), however, the weaker concept of a partial ordering will be sufficient, with a semantics tailored to a partiucular application (see Section 4 for the real-world application considered in this paper). For e1; e2 2 E, t ; t+ 2 R¯ such that e1 e2; t t+, a temporal constraint is a tuple (e1; e2; t ; t+), also denoted e1[t ; t+]e2. The semantics of such a temporal constraint is as follows: the difference between the timestamps t2 of an event (e2; t2) of type e2 and the timestamp t1 of an event (e2; t2) of type e2 fulfills t t2 t1 t+. A temporal constraint e1[t ; t+]e2 is called satisfied by a couple of events ((e; t); (e0; t0)) if e1 = e; e2 = e0 and t0 t 2 [t ; t+]. Because constraining two events to occur in a fixed interval duration is too strict for most applications, the simpliest way to represent temporal constraints is by using duration intervals. These intervals can be interpreted as two constraints defining the lowest and highest accepted duration, respectively.

Using a set E of event types and a set T of temporal constraints, two complementary concepts can be introduced: (i) If we are interested in finding events that pairwise satisfy a given set T of temporal contraints, then the concept of a simple temporal constraint network [ 12 ], alternatively called simple temporal problem [ 4 ] is useful, which can be defined as the triple (E; ; T ); whereT is a set of temporal constraints e[l; u]e0; such that e; e0 2 E; e e0: (1) (ii) If we are interested in mining temporal constraints from given sequences of events, then the concept of a chronicle [ 3, 9 ] is useful, which can be defined as the couple

(E ; T ); where E = ffe1; : : : ; emgg; ei 2 E and T is a set of temporal constraints e[l; u]e0 such that (9i; j = 1; : : : ; m)i 6= j & ei e j & e = ei & e0 = e jg: (2) A chronicle is a temporal extension of episodes or partial orders introduced in [ 8 ], a type of pattern dedicated to summarize sequential data. Chronicles have proven their usefulness in applications where the temporal dimension is mandatory to differentiate two different behaviors. The first application of them is on alarm log data in [ 5 ] where the temporal distance between two alarm events is very important.

Observe that the constraint e[ ¥; ¥]e0 holds for any e; e0 2 E, meaning that this constraint actually does not constrain anything. In case that no constraint from T constrains anything, the set T will be denoted T¥. Hence, (E ; T¥) is a chronicle for the set of temporal constraints T¥ of which, it is only required: (9i; j = 1; : : : ; m)i 6= j & ei e j & e = ei & e0 = e j: (3)

Because the research reported in this paper concerns data mining, it relies on the latter concept, as well as on several additional concepts concerning chronicles.

Let m; n 2 N; 2 m n;C = (ffe1; : : : ; emgg; T ) be a chronicle and s = ((e1; t1); : : : ; (en; tn)); n 2, be a sequence of events. An occurence of C in s is a subsequence s˜ = ((e f (1); t f (1)); : : : ; (e f (m); t f (m))) of s such that: (i) f : mˆ ! nˆ is an injective function; (ii) e0i = e f (i); i = 1; : : : ; m; (iii) if i 6= j and e0i e0j, then t f ( j) t f (i) 2 [a; b], where e0i[a; b]e0j 2 T .

We say that C occurs in s if there exists at least one occurrence of C in s.

Let further S be a set of sequences. The support of a chronicle C in S is the number of sequences from S in which it occurs:

supp(C; S) = #fs 2 SjC occurs in sg: If (4) (5) (6) (7) for a given smin > 0 or equivalently supp(C; S)

smin supp(C; S) #S fmin for a given fmin = s#mSin , then C is called frequent in S on the level fmin.

Finally, let S+ and S be two disjoint sets of sequences. The growth rate of C for S+ with respect to S is defined: g(C; S) = ( supp(C;S+) supp(C;S ) +¥ if supp(C; S ) > 0 if supp(C; S ) = 0:

If C is frequent and g(C; S) gmin for a given minimal growth rate gmin 1, then C is called discriminant for S+ with respect to S on the level gmin.

The sequence sets S+ and S can be viewed as two classes of their union S = S+ [ S . Hence, the algorithm DCM for discriminant chronicles mining presented in [ 3 ] is actually a sophisticated algorithm for extraction of classification rules. Before searching frequent chronicles satisfying some temporal constraints, it searches frequent chronicles with the set of temporal constraints T¥, which is equivalent to the extraction of classification rules without temporal constraints. To this end, any rules extraction algorithm can be used. In the implementation of DCM in [ 3 ], the algorithm Ripper, based on the minimal descrition length principle, [ 2 ] has been employed. 3 3.1

Proposed Rules Extraction Method Discriminant Multi-dimensional Chronicles

Rd , with bi Let d; n 2 N; d; n 2. For a set E of event types with a partial ordering , T R¯ d and a label set L = f+; g, an event is a couple (e; t), where e 2 E; t 2 T and a labelled sequence of events is defined as a tuple (SID; (e1; t1); : : : ; (en; tn); L), where SID 2 N is a sequence index, unique among all considered labelled sequences of events, (e1; t1); : : : ; (en; tn) are events, and L 2 L.

Let ~a = (a1; : : : ; ad );~b = (b1; : : : ; bd );~c = (c1; : : : ; cd ) 2 ai; i = 1; : : : ; d, and R(~a;~b) = [a1; b1] [a2; b2] : : : [ad ; bd ]. The relation ~c = R(~a;~b) () 8 i 2 dˆ : ci 2 [ai; bi] will be called hyperrectangle test. A hyperrectangle constraint is a tuple (e1; e2; ~t1; ~t2), also denoted as e1[[~t1;~t2]]e2 where e1; e2 2 E and ~t1; ~t2 2 R¯ d . A hyperrectangle constraint e1[[~t1; ~t2]]e2 is said to be satisfied by a couple of events ((e;~t); (e0;~t0)) if and only if e = e1 & e0 = e2 &~t0 ~t = R(~t1; ~t2).

A multi-dimensional chronicle is a couple (E ; T ) such that E = ffe1; e2; : : : ; engg, ei 2 E, i 2 nˆ is a multiset of event types and T = mfe1[[~t1; ~t2]]e2je1; e2 2 E ; e1 e2g is a set of hyperrectangle constraints. If in particular all its constraints are e[[( ¥; : : : ; ¥); (¥; : : : ; ¥)]]e0, i.e., they don’t constraint anything, then this T is again denoted T¥:

T¥ = fe[[( ¥; : : : ; ¥); (¥; : : : ; ¥)]]e0j (9i; j 2 mˆ ) i 6= j & ei

e j & e = ei & e0 = e jg: (8)

Let s = ((e1;~t1); : : : ; (en;~tn)) be a sequence of events, m 2 nˆ and C = (E = ffe01; e02; : : : ; e0mgg; T ) be a multidimensional chronicle. An occurrence of the multidimensional chronicle C in s is a subsequence s˜ = ((e f (1);~t f (1)); (e f (2);~t f (2)); : : : ; (e f (m);~t f (m))), such that f : mˆ 7 ! nˆ is an injective function, 8i : e0i = e f (i), and if i 6= j, then~t f ( j) ~t f (i) = R(~a;~b) where e0i[[~a;~b]]e0j 2 T . A multidimensional chronicle C is said to occur in sequence s if there exists at least one occurrence of C in s.

The support of a multi-dimensional chronicle C in a sequence set S is again defined by 2 like for chronicles in Section 2. Finally, also the definition of frequent chronicles and chronicles discriminant for one set of sequences with respect to another transfers to multi-dimensional chronicles. 3.2

Discriminant Multi-dimensional Chronicles Mining

The DCM-MD algorithm illustrated in Listing 1 is a modification of the DCM algorithm for discriminant chronicles mining proposed in [ 3 ]. The main aspects of the modification are the data model (substituting scalar integer values for vectors of real numbers) and a new discriminant hyperrectangle constraints mining algorithm (a substitution of an algorithm used for discriminant temporal constraints mining proposed in [ 3 ]). It operates with multidimensional input data and multi-dimensional chronicles, mining an incomplete set of discriminant multidimensional chronicles, determined by user-supplied argument values fmin (in the pseudocode as fmin) and gmin (in the pseudocode as gmin).

The branching statement in Listing 1 containing the condition supp(S+,{m,tinf}) > (gmin*supp(S-,{m,tinf})) is used to check whether given frequent multiset without further specific hyperrectangle constraints is discriminant (in the pseudocode, T¥ is represented by the tinf symbol). If the given condition is true, no discriminant temporal constraints are mined using the extractDC(...) function.

DCM-MD(S+, S-, fmin, gmin): M := extractMultiSet(S+,fmin). // M is a set of // frequent multisets C := emptySet(). // C is a set of resulting // discriminant multi-dimensional // chronicles for (m of M): if supp(S+,{m,tinf}) > (gmin * supp(S-,{m,tinf})): C.add({m,tinf}). // adds a discriminant chronicle // without temporal constraints else: for t of extractDC(S+,S-,m,fmin,gmin):

C.add({m,t}). // adds a discriminant chronicle // with temporal constraints return C.

Listing 1: DCM-MD pseudocode

The extractMultiSet(...) function extracts a set of frequent multisets from a given sequence set and usersupplied minimal support threshold ( fmin). It applies a regular frequent itemset mining algorithm where an event type a 2 E occurring n times in a sequence is encoded by n items I1a; I2a; : : : ; Ina. An intermediate frequent itemset e of size m denoted as (Iikk )1 k m is extracted from the supplied sequence set and is further transformed into the resulting multiset. The last phase of the algorithm incorpoe rates converting each frequent itemset (Iikk )1 k m to a multiset containing mutually different events ek; k = 1; : : : ; m, each of them exactly ik times.

The extractDC(...) function is used to mine discriminant hyperrectangle constraints from a given frequent multiset E = ffa1; a2; : : : ; angg, disjoint sequence sets S+ and S , and with user-defined parameters fmin and gmin. Exact conceptual and implementation details regarding the extraction of discriminant hyperrectangle constraints are further elaborated in [ 1 ]. 4

Application to Crystal Growth Data

The need for affordable high quality semiconducting crystals such as gallium arsenide GaAs is continuously increasing, particularly for the electronic and photovoltaic applications. Despite GaAs has a number of outstanding physical properties, its production is hampered by challenging processes control due to high melting temperatures (1238 C) and chemically-aggressive environment. Particularly in-situ measurements of the process variables (e.g. temperatures, velocities, concentrations etc.) in the GaAs have high contamination potential and lead to the low crystal quality. Moreover, in-situ visual observations of the crystal growth are not possible. Prediction of the position of the crystallization front, i.e. length of the grown crystal after usage of certain growth recipe (i.e. temporal profiles of a power of heaters) is a key information for the process monitoring.

Here, we considered Vertical Growth Freeze (VGF) method for the growth of GaAs crystals. VGF growth method involves the progressive freezing of the lower end of a melt upward by moving the desired temperature gradient in a furnace via temporal change of heating power. 1-dimensional model of VGF-GaAs growth is shown in Figure 1. 4.1

Used Data

The above described implementation extending the method proposed in [ 3 ] has been applied to data gathered in the German Research Foundation (DFG) project “Modelbased control and regulation of the VGF crystal growth process using distributed parametric methods”. The data records the position of the solid/liquid interface of GaAs crystals grown by the vertical gradient freeze (VGF) method, which involves progressive freezing of the lower end of a melt upward by moving the temperature gradient in a furnace, together with the evolution of temperatures in 0th–4th quarter of the GaAs height. They have been obtained by solving the inverse problem for a simplified one dimensional model of the VGF process for different desired growth rates as described in [ 7 ], using as input the evolution of 2-dimensional vectors describing the heat flux in and heat flux out (Figure 1). All simulations were performed for 100 times, among which the 5th, 10th, . . . , 95th, 100th time will in the following serve as milestone times.

For an application of the method presented in Section 3, event types and events have been defined as follows. The 2-dimensional inputs of the 500 numeric simulations underlying [ 7 ] have been clustered into k = 20 clusters using the Matlab implementation of the standard k-means clustering algorithm. The centers of the resulting clusters are listed in Table 1. An event type is now the fact that the input belongs to a particular cluster. For each numeric simulation, an event type is recorded at every milestone time. Consequently, the size of any multiset of event types from one numeric simulation is at most 20. . An event is a pair (e; T ), where e is an event type and T 2 R5 is a vector of temperatures obtained in the numeric simulation and at the milestone time when e was recorded, provided the position of the solid/liquid interface at the end of that simuation was at least 17.25 cm. There were 255 such simulations available, thus we have 255 event sequences of length 20, due to the 20 milestone times. They were divided into two disjoint sequence sets as follows: S+ = f((e1; T1); : : : ; (e20; T20)jei; Ti; i = 1; : : : ; 20; originated in a simulation ending with the position of the solid/liquid interface >25 cm)g (9) The experimental setup aimed at a chronicle set containing about 20-30 elements and including both chronicles discriminant for S+ with respect to S and chronicles discriminant for S with respect to S+. Each chronicle (E ; T ) 2 C should contain only a minimal number of T¥ constraints.

Assume that C = (E ; T ) is a chronicle, C is a set of chronicles and ts 2 [0; 1]. Chronicle specificity denoted as s(C) is defined as: s(C) = #fe[[t; t0]]e0 2 T je[[t; t0]]e0 62 T¥g :

Chronicle set C specific for a specificity threshold ts denoted as s(C; ts) is defined as s(C; ts) = fCjC 2 C & s(C) tsg.

The metrics used for evaluating the convenience of parameters passed to the DC-PBC component are described in the rest of this paragraph. #M is the size of the set of frequent multisets set as introduced in the pseudocode of the DCM-MD algorithm in Listing 1. #E is the count of distinct frequent multisets which occurred in some discriminant chronicle of the resulting chronicle set C: #E = #fE j(9T – a set of

hyperrecrtangle constraints)(E ; T ) 2 Cg: maxs(C) = maxfs(C)jC 2 Cg is the maximal specificity value found among the chronicles in C. #s(C; ts) is the count of chronicles specific for ts found in C.

The following parameters were tuned: fmin implemented by the --fmin parameter representing minimal support threshold. gmin implemented by the --gmin parameter representing the minimal growth rate threshold parameter of the DCM-MD algorithm as introduced in Listing 1. min(#E ) implemented by the --mincs parameter representing minimal chronicle event multiset size. max(#E ) implemented by the --maxcs parameter representing maximal chronicle event multiset size. ts representing the specificity threshold for a custom tool implemented for extracting specific discriminant chronicles from a set of discriminant chronicles.

After evaluating the metrics for each parameter tuning step, the argument values fmin = 0:1, gmin = 5000, min(#E ) = 2, max(#E ) = 5, ts = 0:7 proved sufficient for retrieving a set of specific discriminant chronicles with the desired properties. 4.3

Examples of Extracted Rules

The implementation of generalized DC-PBC available at github.com/busarade-itat/md-dc-pbc was invoked for the data described above with argument values --mincs 2, --maxcs 5, --fmin 0.1, --gmin 5000.

The resulting set of discriminant chronicles was afterwards filtered to include only specific discriminant chronicles. To this end, a tool chronicle_statgen available at github.com/busarade-itat was invoked with arguments --minspec 0.7, --vecsize 5.

The final result is presented in Tables 2 and 3, counting a total of 26 specific discriminant chronicles – 18 of them discriminant for S+ with respect to S , the remaining 8 discriminant for S with respect to S+.

Proposed method enables prediction of the conditions for reaching targeted crystal length by following the differences among segments in temporal profiles of temperatures in characteristic points in the GaAs. If the same approach is further applied on the experimental temperature profiles measured by thermocouples in heaters (outside of the melt and crystal) as in real experiments, it will be possible to determine moment of reaching desired crystal length without visual observations and GaAs contamination. From that moment on, crystal growth process step terminates and cooling down of the furnace starts. Such accurate prediction of the end of solidification step will be very beneficial for the process economy and the final crystal quality. fe1[[( 30:8; 30:8; 30:9; 31:0; 31:0); ( 30:1; 30:2; 30:2; 30:3; 30:3)]]e2g

Conclusion

The paper has presented a generalization of the method for discriminant chronicles mining proposed in [ 3 ]. This generalization has been motivated by the objective to extract classification rules from crystal growth data, bringing two additional problems not pertaining to the data to which the original method had been applied: the events are described with a vector of attributes instead of a single scalar attribute, and the attributes are real-valued instead of integer-valued. The theoretical fundamentals of the method in [ 3 ] have been extended to tackle those two problems and the system for discriminant chronicles mining based on [ 3 ] has been adapted to accomodate those extensions, together with some additional implementation improvements such as refactoring. As a proof of concept of the presented generalization, it has been applied, using the modified system, to real-world data with events characterizing the heat fluxes for the growth of GaAs crystals by vertical gradient freeze method, and with a vector of 5 attributes recording the temperatures in different heights. Although most of the hyperrectangles in Tables 2 and 3 are not very restrictive, the extracted classification rules nevertheless show that the proposed approach allows to assess whether the grown crystal will have a desired length based solely on the temperature profiles. Regarding future research, it would be interesting to assess how small changes to the mined hyperrectangle constraints affect the manufacturing process of the VGF-GaAs crystals.

Acknowledgement

The research reported in this paper has been supported by the Czech Science Foundation (GACˇ R) grant 18-18080S.

[1]

Radek

Buša . Implementation of a generalized version of a system for discriminant chronicles mining . Czech Technical University in Prague. Computing and

Information

Centre ., Cham , 2020 .

[2]

W.W.

Cohen . Fast effective rule induction . pages 115 - 123 , 1995 .

[3]

Yann

Dauxais , Thomas Guyet, David Gross-Amblard, and

André

Happe . Discriminant chronicles mining . 2017 .

[4]

Dechter , I. Meiri , and

Pearl . Temporal constraint networks . Artifical Intelligence , 49 : 61 - 95 , 1991 .

[5]

Christophe

Dousson and Thang Vu Duong . Discovering chronicles with numerical time constraints from alarm logs for monitoring dynamic systems . In IJCAI , pages 620 - 626 , 1999 .

[6]

Dropka and M. Holenˇa. Optimization of magnetically driven directional solidification of silicon using artificial neural networks and Gaussian process models . Journal of Crystal Growth , 471 : 53 - 61 , 2017 .

[7]

Dropka , M. Holenˇa, S. Ecklebe,

Frank-Rotsch , and

Winkler . Fast forecasting of VGF crystal growth process by dynamic neural networks . Journal of Crystal Growth , 521 : 9 - 14 , 2019 .

[8] Gemma

Garriga . Summarizing sequential data with closed partial orders . In SDM , 2005 .

[9]

Ghallab ,

Nau , and

Traverso . Automated Planning and Acting . Cambridge University Press, Cambridge, 2016 .

[10]

D.J.

Hand . Construction and Assessment of Classification Rules . John Wiley and Sons, New York, 1997 .

[11] M. Holenˇa , P. Pulc, and M. Kopp . Classification Methods for Internet Applications . Springer, 2020 .

[12]

H.C.

Lau ,

Ou , and

Sim . Robust temporal constraint networks . In International Conference on Tools with Artificial Intelligence , pages 82 - 88 , 2005 .