1. Introduction

Information Control Systems & Technologies, September

Hyperparameter tuning in the learning of multithreshold neurons

Vladyslav Kotsovsky

2024

23 25 0000 0002

The modification of the online learning algorithm for multi-valued multithreshold neurons is proposed in the paper. Conditions are stated and proved that ensure the finite successful learning. The influence of the algorithm hyperparameters on the learning process is analyzed on the base of simulation results. The recommendations are formulated concerning the choice of values of these hyperparameters, which may significantly reduce the learning time. The experiment results prove that the proposed algorithm and Parberry. Obtained results can be useful in the design of artificial neural network classifiers employing multithreshold activation functions in network nodes.

eol>Multithreshold neural unit classification machine learning 1

1. Introduction

Neural networks (NN) became mainstream in modern artificial intelligence (AI) systems [ 1 ] and smart data proceeding [ 2, 3 ]. Both hardware [ 4 ] and software infrastructure of AI [ 1 ] widely employs concepts and solutions based on neural-like approaches [ 5, 6 ]. Different network architectures [ 7 ] as well as appropriate learning and synthesis techniques [ 8, 9 ] provide powerful capacities of artificial NNs in the solving numerous real-time problems. Modern NN-based AI systems depend on billions of parameters [ 2 ] and their behavior is influenced by many hyperparameters [ 9 ], which are used in the learning of the underlying machine learning (ML) model [ 10 ]. This implies the importance of the proper choice of these hyperparameters during the training process in order to adopt AI system to the solution of the given ML problem [ 11 ].

The tremendous power of latest AI systems is provided by the capability of underling NN [ 12 ]. Therefore, the main efforts in neural computation are devoted to the improvement of the network capacities [ 2 ]. It can be achieved in many ways [ 13 ]. The most popular one consists in the increasing of the network size by using deeper models with many neurons in every hidden layer [ 9 ], as well as the application of new hybrid network architectures, e.g., as in [ 2, 5, 6 ]. This approach can be extremely successful, but usually it requires considerable computation resources and may be very chip and inappropriate in many cases [ 2, 9 ].

The second approach consists in the use of a relatively small NN enhanced by the application of modified network nodes, which are more powerful than usual linear neural units with RELU- or sigmoid-like activation functions [ 14 ]. In simplest cases a single such unit is sufficient to solve a classification task on small- or medium-sized dataset [ 11, 15 ].

In order to overcome the limitation of classical neural units, many modified models were proposed, e.g., in [ 2, 9, 16 ]. They all were intended to increase the recognition capacity of a single neuron. As mentioned in [ 16 ], they can be divided into at least two classes.

The first class contains models using the modified modes of the aggregation of the input signals of the neural unit instead of the usual weighted sum of inputs. This approach includes different kernel models, which make the shape of decision region of the neural unit more complicated and more appropriate to the distribution of data patterns [ 2, 9 ].

The second class consists of models that benefit of the use of a modified activation function [ 17 ]. This class is sometimes more useful than the first one, because its representatives adopt the kind of activation to the particular task without adding many new parameters to the ML model [ 16 ]. Note that this approach requires the development of special learning techniques adapted to the chosen modification of the activation function [ 9, 18 ].

The current research is devoted to the study of the one kind of neural models belonging to the second class the multi-valued multithreshold neural unit [ 19, 20 ]. The goal of the research is the design of the learning algorithm for such multithreshold units and the investigation what values of algorithm hyperparameters would be used in order to speed-up the training process and improve the capacity of resulting neuron.

2. Related works

Multithreshold approach was proposed in the early studies in threshold logic [ 17 19, 21 ]. The first models employed the multithreshold binary-valued activation in order to enhance the capacity of the classical threshold gate based on the famous McCulloch and Pitts model [ 22 ]. This enhancement was theoretically confirmed in [ 23, 24 ], where it was shown that a linear threshold unit strengthened by additional thresholds considerably overperforms single-threshold gate. The explanations and quantitative expressions of the increase of the unit capacity can be found in [ 17, 24 ]. Despite the strict confirmation and justification of the advantage of multithreshold models in pattern classification, the practical benefits of this approach were almost missing, because few synthesis (as well as learning) algorithms were proposed for such multithreshold systems. And this, in order, implies the decline of the interest in the development and the use of multithreshold models and systems [ 7 ].

Hardness results for multithreshold units stated in [ 16, 20 ] explain that the learning task for a multithreshold unit is considerably harder in the sense of complexity theory than similar task for a single-threshold unit. This conclusion was also confirmed for general multithreshold neural units with an arbitrary number of thresholds in [ 25 ]. Paper [ 20 ] also contains the result concerning the connection between multithreshold neurons and single-threshold neural networks with a single hidden layer.

Nevertheless, in [ 6, 26, 27 ] some recent advances were observed in the application of bithreshold and multithreshold neural units and networks, respectively. In the bithreshold case it was caused by new approaches in the synthesis of NN by employing bithreshold neurons in hidden layers of networks [ 16 ]. This approach can be combined with the reducing of drawbacks of bithreshold activations [ 23 ] by making network deeper using hybrid blocks, which consist of group of heterogeneous neurons preserving the information concerning the location of training patterns [ 16 ]. The similar approach was proposed in [ 23 ], where the smoothed modification of activation function was used as well as neuron center defined by the portion of training patterns, which activate this neuron.

In the multithreshold case the progress is related to the use of multi-valued outputs instead of binary ones [ 28 ]. This leads to lesser complexity of the learning task compared to the case of the application of binary-valued neurons, because the complexity of the learning of multi-valued multithreshold neurons proved to be equal to the complexity of the learning of linear single-threshold units [ 29, 30 ].

3. Models and methods

The multithreshold multi-valued model of the neuron will be considered in this section as well as issues related to its learning.

3.1. Model of multi-valued multithreshold neural unit

Consider a model of multithreshold neuron. It is a computation unit provided with weight vector w = ( w1, , wn )  Rn and ordered threshold vector t = (t1, ,tk )  Rk . Each weight wi is associated with corresponding input xi, i = 1, , n . The use of multiple thresholds allows the neuron to operate in two modes: binary-valued and multi-valued, respectively [ 30 ]. Further only multi-valued neurons will be considered. The unit output is denoted by y and is defined in the following way: 0, 1,  y = ........................

if w  x  t1, if t1  w  x  t2 , k −1, if tk−1  w  x  tk ,  k, if tk  w  x, (1) where w  x denotes the inner products of the weight vector w and the input vector x.

It is evident that the neural unit describing by equation (1) has k + 1 different values. Therefore, we can use it as a single output node of NN classifier in the case when the number of classes is greater by 1 than the number of thresholds.

The pair (w, t) completely defines the multi-valued multithreshold neuron. Further, this pair will be used as the short -valued multithreshold neuron with weight vector w and threshold vector t .

Let A be an arbitrary set of patterns in n-dimensional real space. Every multi-valued k-threshold neuron (w, t ) induces the ordered partition ( A0 , A1,..., Ak ) of the set , where the set Ai contains all elements of the set A such that ti  w  x  ti+1, (i = 0, , k ) . Note that two additional pseudothresholds t0 = − and tk+1 = + were used in the previous equation for convenience.

This partition is called an ordered k-threshold partition of the set A by strongly k-separable sets A0 , A1,..., Ak [ 30 ]. Notice that the order matters for such partitions.

3.2. Learning of multi-valued k-threshold neural unit

Two algorithms for single multi-valued multithreshold neuron were proposed in [ 30 ]. This subsection contains a brief description of the modification of the first one.

Consider the search for a multi-valued k-threshold neuron (w, t ) that performs the desired ordered partition ( A0 , A1,..., Ak ) of the finite set A. We can consider the elements of the set A as members of our training set. It is evident that without loss of generality one can replace all non-strict inequalities in (1) by strict one (this is true, because A is finite). Furthermore, it is also easy to show that the learning task is equivalent to the solution of the following system of linear inequalities: t0  w  x  t1, if x  A0 , t  w  x  t2 , if x  A1,  1    tk−1  w  x  tk , if x  Ak−1, tk  w  x  tk+1, if x  Ak . (2)

Note that similar as in the definition of ordered partition, two additional sentinel thresholds t0 = − and tk+1 = + were used in (2) in order to simplify notations. There exists, in addition, MLlike interpretation of the solution of (2). We can consider it as the task of supervised learning on the dataset consisting of training pairs (x, yx ) , where x  A, yx = i if and only if x  Ai .

3.2.1. Data preprocessing

Consider the method of the transformation of the task (2) to the solution of the homogenous system of linear inequalities in n + k variables w1,...wm ,t1,...,tk , which was proposed in [ 30 ].

Let us search for solution vector in the form v = ( w1,..., wn , −t1,..., −tk ) , which contains all sough weights as well as all negated thresholds. Consider the sequence of transformations f j : Rn → Rn+k It follows from (3) that every chained inequality t j  w  x  t j+1 in (2) is equivalent to the following system:

Thus, it is possible to reduce (2) to the solution of homogenous system:  f j (x)  v  0,  − f j+1 (x)  v  0. where vectors bi are obtained using (3) and (4), (i = m). Note that in the case of the use of pseudo-thresholds t0 = − and tk+1 = + (5) consists of exactly 2|A| inequalities, where |A| is a cardinality of the set A, but we can pseudo-thresholds. Thus, the actual value of m is 2 A − A0 − Ak . Let V(B) denotes the set of all solution of (5).

The reduction process was described in detail in [ 30 ], where the corresponding function ReduceSet ( A0 , A1, , Ak ) was defined, which returns the set b1, ,bm .

3.2.2. Online algorithm with shift

Consider the online-version of the learning algorithm for a multi-valued k-threshold neural unit. The idea of this algorithm is from [ 30 ] and it actually derives many steps of relaxation algorithm for systems of linear inequalities [ 9 ]. The pseudocode of this algorithm is shown below: ShiftedMultithreshold ( A0 , A1, , Ak , r, , v0 , , d ) 1

B  NormalizedSet ( A0 , A1, , Ak ) 2 v  v0 3 (i, j, err )  (0, 0,1) 4 while i  r and err  0 : 5 err  0 6 shuffle B 7 for b in B: 8 s  b  v 9 if s  0 : 10 continue 11 j  j + 1 12 err  err + 1 13 v  v + ( j )( d − s )b 14 i  i + 1 15 w  (v1,..., vn ) 16 t  (−vn+1,..., −vn+k ) 17 return w, t

Above algorithm has the single parameters ( A0 , A1,..., Ak ) an ordered partition consisting of strongly k-separable sets. Algorithm has also five hyperparameters: r the upper bound on the number of learning epochs, a binary value defining the learning mode, v0  R n+k an initial approximation,  the schedule function defining the value of the learning rate, and d nonnegative real value, which is a measure of the shift used during each correction. They all are used in crucial step 13, where the correction of the vector v is performed. The learning process continue until we find such vector v  Rn+k that all inequalities in (5) are satisfied. If it is not true, then there exists a vector b such that b  v  0 . This vector b is used in step 13 in order to improve the current value of the vector v by nudging it in the direction of b. Note that inner product is used in this step to define the correction step as well as shift d and the current value of the learning rate.

It should be mentioned that considered algorithm differs from the similar algorithm from [ 30 ] only in steps 1 and 13, respectively. In step 1 and additional preprocessing transformation is performed, which consists in the normalization of the elements of the set B in order to obtain the set of vectors with unit Euclidean norm. The proposed modification of learning algorithm uses also an additional hyperparameter d in step 13, which should be non-negative. This allows us to avoid the possible convergence to the point lying on the bounding surface of the set V(B). Notice also that the correction is performed only in the case s  0 . Hence, during every iteration performed in steps 713 by all elements of the set B the value of d − s is always equal to d + s , if correction step 13 was reached.

The issues related to the convergence of the above algorithm will be considered in the next subsection.

3.3. Convergence conditions for learning algorithms

Let us consider theoretical foundation of the above algorithm ensuring its convergence and even finiteness.

Proposition. If finite sets A0 , A1,..., Ak are strongly k-separable,  ( j ) =1 ( j ) +

2 ( j ) d + b j  v j−1 , 0 1 ( j )  2, 0  2 ( j )  max , 1 ( j ) + 2 ( j )  min , where  min and  max are arbitrary positive constants, then there exists r such that after at most r corrections ShiftedMultithreshold yields a multi-valued k-threshold neuron (w, t ) , which produces the partition ( A0 , A1,..., Ak ) . where b j is a train vector used in jth correction, v j−1 is the value of sought vector v after previous correction and r  2 ( j ) lim j=1 r→  r    ( j )   j=1 

In the first case the learning process is similar to the classical perceptron learning with the learning rates d ( j ) used in jth correction or its extension in the case of multi-valued multithreshold functions proposed by Obradovi and Parberry (see [ 11 ]). It is well known [ 11 ] that the equality (6) (7) (8) is the sufficient condition of the finiteness of the learning. Let us prove that (8) follows from the correction rule in step 13 and conditions (6), (7). r Prove first that the sequence ( Sr )rN is divergent, where Sr =  ( j ) (note that the denominator j=1 of the fraction in (8) contains squared value of Sr ). Suppose the contrary. Then b j  v j−1 = b j ( v0 + d (1)b1 + for some positive constant D, because dot product of unit vectors does not exceed 1.

Therefore,  ( j ) 1 ( j ) + 2 ( j ) min min (1,(d + D)−1 ). This implies that ( Sr ) is divergent.

d + D Thus, our assumption about the convergence of the sequence ( Sr ) was wrong. Therefore, in the conditions of proposition this sequence always diverges.

Consider the numerator in (8). We can split the corresponding sum into two parts: r  2 ( j ) = j=1   2 ( j ) + j: ( j)1   2 ( j ).

j: ( j)1 Let Sr be the first sum in the previous equation. It is evident that r Sr =   2 ( j )    ( j )   ( j ) = Sr .

j: ( j)1 j: ( j)1 j=1 Therefore,

S S 1

Srr2  Srr2 = Sr r→→ 0 .

Consider Sr the second sum in the corresponding equation.

Sr = j: (j)11 ( j ) + d +b2 (j j v) j−1 2  j: (j)1 2 +  mdax 2 = nr  2 +  mdax 2 , where nr is the number of terms in Sr . If for all r numbers nr are bounded by nmax  N , then Otherwise, let us estimate Sr2 :

Hence, Therefore,

lri→m SSrr2 =  2 + mdax 2 lri→nmmaSxr2 = 0 .

 r 2  r 2

Sr2 =  j=1 ( j )    j: (r)1 ( j )   nr2.

S lri→m Srr2  lri→m

2 nr  2 + max    d  =  2 +  max  lim 1 nr2  d  r→ nr

= 0.

lri→m S1r2 jr=1 2 ( j ) = lri→m  SS1r22 + SSr2  = 0 , and (7) holds.

Consider now the case  = 1. Let us prove that the sequence ( v j ) v j − v  v j−1 − v , (9) for all v V ( B ). It is evident that (8) is equivalent to v j − v 2  v j−1 − v 2 . Since

Fejér condition (9) is satisfied if

for all v V ( B ).

v j − v 2 = v j − v j−1 2 + 2( v j − v j−1 )  ( v j−1 − v) + v j−1 − v 2 ,

v j − v j−1 2 + 2( v j − v j−1 )  ( v j−1 − v)  0 We can rewrite the step 13 of the learning algorithm in the following way:

v j = v j−1 + ( j )(d − b j  v j−1 )b j .

Therefore, it is possible to rewrite the last inequality as follows:

 2 ( j )(d − b j  v j−1 )2 b j 2 + 2 ( j )(d − b j  v j−1 )b j  ( v j−1 − v)  0.

Remember that b j  v j−1  0 in every correction. Thus, d − b j  v j−1 = d + b j  v j−1 and the last quadratic inequality holds only if We can rewrite this inequality in a following form: 0  ( j )  2( v − v j−1 )  b j d + b j  v j−1

 0  ( j )  21 +   v  b j − d 

. d + b j  v j−1   Let us slightly relax Fejér condition from the whole set V(B) to its own subsets. By using the techniques describing in [ 5 ], it is easy to verify that for every d  0 and every   0 the cone V(B) contains such point v = v (d , ) that, the unit closed ball B1 v = x  Rn+k : x − v  1 is the subset of V(B) and for all x  B1 v v  x  d +  . Therefore, it follows from (11) that sequence ( v j ) satisfies Fejér condition (9) for the ball B1  v if

 0  ( j )  21 +  

 . d + b j  v j−1   Let  = max / 2 . If (6) and (7) are satisfied, then (12) holds.

Suppose that the learning process is infinite, i.e., for all r ShiftedMultithreshold is unable to produce v j V ( B) for some j  r . Then the sequence ( v j ) satisfies is Fejér condition (9) for the ball B1  v and, hence, is convergent by well-known fact from the theory of linear normed spaces [ 9 ].

Consider the increment vectors v j = v j − v j−1 . It follows from (6), (7), (10) and (12) that  v j = 1 ( j ) +   2 ( j ) 

( d + b j  v j−1 )b j = (1 ( j )(d + b j  v j−1 ) + 2 ( j ))b j . d + b j  v j−1 

 It implies v j = (1 ( j )(d + b j  v j−1 ) +2 ( j )) b j min min ((d + b j  v j−1 ),1) 1 min min (d ,1). (10) (11) (12)

It follows from the last equation that increment vectors do not go to zero as j goes to infinity. Therefore, the sequence ( v j ) is not convergent. This apparent contradiction completes the proof in the case  = 1 .

Note that in the case  = 1 convergence conditions (without the proof) appeared for the first time in [ 30 ].

4. Experiment

In the above theoretical study of the issues related to the algorithm convergence and finiteness the range of feasible values for the learning rate hyperparameter was found, but proved Proposition does not suggest what values are preferable in order to ensure the faster convergence.

This question can be clarified by empirical study of the dependence of ShiftedMultithreshold on different strategies to the choice of the values of hyperparameters, which are used in this algorithm.

During simulation k-threshold neurons were trained for different 2  k  10 . This range of values was chosen in accordance with recommendation from the paper [ 30 ]. Randomly generated kthreshold neuron was used to produce a partition ( A0 , A1,..., Ak ) of the set A containing M uniformly distributed point from n-dimensional hypercube [ −1,1 ]n , where cartesian product is used for the power in the previous formula. Two series of experiments were performed. In the first series whole A was used as training set. In the second A was randomly split into training set and test set, where test set contained 20% of all points. The first series of experiment was more intensive. Only it was used to determine values of last four hyperparameters of algorithm, which then was used in the second series. For this reason, the most part of this section is devoted to the description of the experiments of the first type.

Note that the value of r was not studied in the first series of experiments and constant upper bound 100,000 was used for the number of learning epochs in the first experiment. The reaching of this bound during learning considered as signal that algorithm failed to learn neuron to solve a given task. In the next experiments r was reduced to 1,000.

The final value of the counter j, which corresponds to the total number of corrections performed during the learning process, was considered as the performance metric. Therefore, the further

X performed better than Y ber of corrections in the case of X was lesser by 30% than the number of corrections in the case of Y

The general tendence during the first series of experiments remained the same for every value of k from the above-mentioned range. For this reason, results will be presented only for a single value of k, namely, k = 3 . This implies that 4-valued units will be considered.

The dimension of the feature space n was chosen to be 50. Different sizes of the training set were tried. In the next section results for M from {256, 512, 1024, 2048, 4096} will be presented. Random sampling was used. Each experiment was repeated for 110 times (more precisely, 11 random partitioning were performed for every of 10 randomly chosen sets A) and 5 best and 5 worst results were rejected in order to avoid outliers. The remaining 100 results were averaged. The obtained means will be analyzed in the next section.

The first experiment consists in the estimation of the influence of the value of binary hyperparameter to the performance of the learning algorithms with random initial approximation (more precisely, random uniformly distributed in ( 1, 1) numbers were used as coordinates of v0 ). The constant learning rate  = 2 was used for both possible values of hyperparameter . This value is suggested by [ 9 ] as recommended in the case of relaxation-based algorithms. Note that application of any constant learning rate means that for all j 1 ( j ) = ,  2 ( j ) = 0 , because otherwise it follows from (6) that  ( j ) depends on j. The case  = 0 corresponds to the fixed increment used in classical perceptron-like models. The opposite case  = 1 leads to relaxation-like learning in which the increment in the jth correction is adaptive and depends on the classification error on the current training pattern b j measured by b j  v j−1 in (10). It was observed that relaxation-like approach to the learning considerably overperformed the perceptron-like one in the online learning of 4-valued 3-threshold neurons. The grid search in segment [ 1, 2 ] was also performed with the step 0.01 in the case  = 0 , but it did not make significant impact to the difference of learning times for both abovementioned types of the increment (actually, the change of influenced the performance for relaxation-like mode much stronger than for perceptron-like one). Therefore, the perceptron-like approach to online learning was rejected, the value  = 1 was fixed, and, consequently, only relaxation-like online learning studied in all next experiments. During the second simulation the choice of initial approximation was considered alongside with constant learning rate. The learning with v0 randomly chosen was compared with optimized initial approximation + bm ) / m , where m = B . Both the idea and justification of such approximation are from [ 30 ]. The idea of the use of v0 is suggested by the fact that its coordinates have signs and ordering similar to same characteristics of coordinates of feasible solution from the set V(B). For all considered M and

results for v0 were on average at least twice as good as for a random v0 . For this reason, only v0 was used further. Next simulation was devoted to the search of the appropriate values for the first term 1 ( j ) of the learning rate in (6). In order to reduce the impact of the second term in (6)  2 ( j ) = 0 was used here. The quite simple constant schedule strategy was used, i.e., was assigned to 1 ( j ) (and, consequently, to  ( j ) ) in every correction step. None of tried outside the segment [1.23, 2.31] was successful and only  [1.5, 2.2] performed well. For this reason, the grid search on [1.5, 2.2] with the step 0.001 was used to determine (M) empirically the best for the given M. Further simulation was devoted to the search of the appropriate values for the second term  2 ( j ) of the learning rate in (6). It was observed that only constant  2 ( j ) = 2 0.1, 0.5 provided good performance. Another grid search was performed on two dimension to find (1(M ) ,2(M ) ) empirically the best pair for the given M. In the next simulation the impact of the value of the shift hyperparameter d was studied. The learning rate was calculated by using (6) and pairs (1(M ) ,2(M ) ) from the previous experiment. The last simulation was the second series of experiments. Previously found values of hyperparameters were tried to solve the classification task on the split dataset in order to estimate the generalization ability of multi-valued k-threshold neuron in the case of different 2  k  10.

5. Results and discussion

Consider results that were obtained in above-mentioned experiments. Table 1 contains comparative results of perceptron-like ( = 0) and relaxation-like ( = 1) learning mode, respectively, in the case of the learning of 3-threshold neuron with constant learning rate 2.

It is evident from Table 1 that the learning mode has great impact on the performance.

The single adaptive correction (10) allows to move vector v in the right half-space in accordance to the violated inequality b j  v j−1 instead of numerous fixed increments in the direction bj, which are necessary for perceptron. Thus, we obtained the empirical proof of the significant advantage of relaxation approach in the online learning of k-threshold neurons. Consider results concerning the impact of optimized initial approximation. They were presented in Table 2 also in the case of the learning of 3-threshold neuron with  = 2 . It is evident from Table 1 and Table 2 that the optimized initial approximation can at least halve the number of corrections. Thus, it provides the important improvement of the learning process.

Consider performance results in the case of different constant values of learning rate. In Table 3 the best value of the learning rate for every dataset size is shown, which was found using the grid search, as well as the average number of corrections for it. Consider learning in more general case when constant pairs ( 1, 2) were used. Corresponding results are presented in Table 4.

The final experiment of the first series consists in the study of role of hyperparameter d on the learning. It was observed that for all datasets the best performance was obtained using d = 0. Moreover, in the case d  0.1 the learning became considerably slower. Consider the second series of experiments. Unlike the first series performed only for k = 3 , the second series was consisted in the learning of multi-valued k-threshold neuron for all 2  k  10 using  = 1 , optimized initial approximation calculated only on the proper training set, and values of (1, 2 ) from Table 4. The shift was not performed. Table 5 contains the average percentage of accuracy of a trained k-threshold neuron that was measured on the test set for every combination of the dataset size and the number of thresholds.

The learning mode defined by is extremely significant to the performance of online learning and its proper value 1 decreases the number of corrections in 10 times.

The initial approximation also matters. The use of the improved approximation requires additional calculations but this can reduce the number of corrections in 2 4 times compared to random initial approximation.

Constant learning rate in the case 1.93   2.05 is good choice for the relaxation learning. The best values of the second terms in (6) were quite low compared to the first term. The variation of the values of 1 and 2 is not so important and could improve the performance by 6 12%.

The generalization ability of k-threshold neuron decreases with the growth of k. The shift hyperparameter d has mainly the theoretical importance as a guarantee of the finite learning. Its practical application is limited by small values, whereas larger d can significantly decrease the learning process.

6. Conclusions

The modification of the online learning algorithm for multi-valued multithreshold neurons has been considered. It uses the additional preprocessing step as well as new shift parameter d ensuring the convergence. Conditions has been proved for the first time that guarantee the finiteness of the learning process. The influence of the algorithm hyperparameters on the behavior of the learning algorithm has been also studied. The suggestions were stated concerning the preferred values of hyperparameters, which provided better performance during experiments on synthetic datasets. Simulation results proved the advantage of relaxation learning mode over perceptron-like one and testified that the proposed algorithm is able to greatly overperform procedure of Obradovi and Parberry [ 11 ]. The use of optimized initial solution has also great positive impact on the performance. Despite the fact that quantitative characteristics of the improvement presented in the fifth section are not absolute and may vary depending on the dimension of feature space, the content and the size of a dataset as well as other factors, proposed recommendation could be useful for ML projects employing NNs designed using multi-valued multithreshold neurons.

[1]

V.K.

Venkatesan ,

M.T.

Ramakrishna ,

Batyuk ,

Barna ,

Havrysh , High-Performance artificial intelligence recommendation of quality research papers using effective collaborative approach , Systems 11.2 ( 2023 ): 81 . doi: 10 .3390/systems11020081.

[2]

Géron , Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems , 3rd ed., O'Reilly Media , Sebastopol, CA, 2022 .

[3]

E.H.

Houssein ,

M.E.

Hosney , M.M. Emam , E.M.

Younis , A.A.

Ali , W.M.

Mohamed , Soft computing techniques for biomedical data analysis: open issues and challenges , Artificial Intelligence Review 56 ( 2023 ): 2599 2649.

[4] I. Izonin ,

Tkachenko ,

S. A.

Mitoulis ,

Faramarzi ,

Tsmots ,

Mashtalir , Machine learning for predicting energy efficiency of buildings: a small data approach , in: Procedia Computer Science, volume 231 , 2024 , pp. 72 77 . doi: 10 .1016/j.procs. 2023 . 12 .173.

[5]

Geche ,

Kotsovsky ,

Batyuk ,

Geche ,

Vashkeba , Synthesis of time series forecasting scheme based on forecasting models system , in: CEUR Workshop Proceedings , volume 1356 , 2015 , pp. 121 136 .

[6]

Tkachenko , An integral software solution of the SGTM neural-like structures implementation for solving different Data Mining tasks , in: S. Babichev , V. Lytvynenko (Eds.), Lecture Notes on Data Engineering and Communications Technologies , volume 77 , Springer, Cham, 2022 , pp. 696 713 .

[7]

Havryliuk ,

Hovdysh ,

Tolstyak ,

Chopyak ,

Kustra , Investigation of PNN optimization methods to improve classification performance in transplantation medicine , in: CEUR Workshop Proceedings , volume 3609 , 2023 , pp. 338 345 .

[8]

Vladov ,

Yakovliev ,

Bulakh ,

Vysotska , Neural network approximation of helicopter turboshaft engine parameters for improved efficiency , Energies 17 .9 ( 2024 ): 2233 .

[9]

Haykin , Neural Networks and Learning Machines , 3rd ed., Pearson

Education

, Upper Saddle River, NJ, 2009 .

[10]

Kuchanskyi et al., Gender-related differences in the citation impact of scientific publications and

[11]

Anthony , Learning multivalued multithreshold functions , CDAM Research Report no. LSECDAM-2003-03 , London School of Economics, 2003 .

[12]

Amirgaliyev ,

Kuchanskyi ,

Andrashko , Building a dynamic model of profit maximization Eastern- European Journal of Enterprise Technologies , 2 . 4 - 116 ( 2022 ): 22 29.

[13]

Vysotska et al., Sentiment analysis of information space as feedback of target audience for regional e-business support in Ukraine , in: CEUR Workshop Proceedings , volume 3426 , 2023 , pp. 488 513 .

[14]

Rajput ,

Sreenivasan ,

Papailiopoulos ,

Karbasi , An exponential improvement on the memorization capacity of deep threshold networks , in: Advances in Neural Information Processing Systems , volume 16 , 2021 , pp. 12674 12685 .

[15]

Z.-G.

Zhang ,

Y.-L.

Xiao ,

Zhong , Unitary learning in conditional models for deep optics neural networks , in: Proceedings of SPIE The International Society for Optical Engineering , volume 12565 , 2023 , no. 1256543 .

[16]

Kotsovsky , Hybrid 4-layer bithreshold neural network for multiclass classification , in: CEUR Workshop Proceedings , volume 3387 , 2023 , pp. 212 223 .

[17]

Takiyama , Multiple threshold perceptron, Pattern Recognition 10.1 ( 1978 ): 27 30.

[18]

Jiang ,

Zhang , X. Ma,

Wang ,

Yang , Analysis of nonseparable property of multi-valued multi-threshold neuron , in: Proceedings of 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence) , Hong Kong, China, 2008 , pp. 413 - 419 , doi: 10.1109/IJCNN. 2008 . 4633825 .

[19]

D. R.

Haring ,

R. J.

Diephuis , A realization procedure for multithreshold threshold elements , IEEE Transactions on Electronic Computers, EC-16.6 ( 1967 ): 828 - 835 .

[20]

Kotsovsky ,

Batyuk , Multithreshold neural units and networks , in: Proceedings of IEEE 18th International Conference on Computer Sciences and Information Technologies , CSIT 2023 , Lviv, Ukraine, 2023 , pp. 1 - 5 , doi: 10.1109/CSIT61576. 2023 . 10324129 .

[21]

Takiyama , The separating capacity of a multithreshold threshold element , IEEE Transactions on Pattern Analysis and Machine Intelligence. PAMI-7 .1 ( 1985 ): 112 116.

[22]

Ashenayi ,

Vogh ,

M.R.

Sayeh ,

Karimi , T. Baradaran, Multiple threshold perceptron using sinusoidal function , International Journal of Modelling and Simulation 12.1 ( 1992 ): 22 26.

[23]

Kotsovsky ,

Batyuk , Feed-forward neural network classifiers with bithreshold-like activations , in: Proceedings of IEEE 17th International Scientific and Technical Conference on Computer Sciences and Information Technologies , CSIT 2022 , Lviv, Ukraine, 2022 , pp. 9 12 .

[24]

Olafsson ,

Y. S.

Abu-Mostafa , The capacity of multilevel threshold function , IEEE Transactions on Pattern Analysis and Machine Intelligence 10 .2 ( 1988 ): 277 281.

[25]

Jiang ,

Y. X.

Yang ,

X. M.

Ma , and

Z. Z.

Zhang , Using three layer neural network to compute multi-valued functions , in 2007 Fourth International Symposium on Neural Networks, June 3-7 , 2007 , Nanjing,

P.R.

China , Part

III

, LNCS 4493 , 2007 , pp. 1 - 8 .

[26]

V.K.

Venkatesan , I. Izonin ,

Periyasamy ,

Indirajithu ,

Batyuk , M.T. Ramakrishna, Incorporation of energy efficient computational strategies for clustering and routing in heterogeneous networks of smart city , Energies 15 .20 ( 2022 ): 7524 .

[27]

Andrashko et al., A method for assessing the productivity trends of collective scientific subjects based on the modified PageRank algorithm , Eastern-European Journal of Enterprise Technologies , 1 .4 ( 121 ) ( 2023 ): 41 47.

[28]

Kotsovsky ,

Batyuk ,

Voityshyn , On the size of weights for bithreshold neurons and networks , in: Proceedings of IEEE 16th International Conference on Computer Sciences and Information Technologies , CSIT 2021 , Lviv, Ukrain, 2021 , volume 1 , pp. 13 16 .

[29]

Baum , On the capabilities of multilayer perceptrons , Journal of Complexity 4.3 ( 1988 ): 193 215.

[30]

Kotsovsky , Learning of multi-valued multithreshold neural units , in: CEUR Workshop Proceedings , volume 3688 , 2024 , pp. 39 49 .