<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Analysis of the Application of Stochastic Gradient Descent to Detect Network Violations</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Nataliya</forename><surname>Boyko</surname></persName>
							<email>nataliya.i.boyko@lpnu.ua</email>
						</author>
						<author>
							<persName><forename type="first">Iryna</forename><surname>Khomyshyn</surname></persName>
							<email>khomyshyni@gmail.com</email>
						</author>
						<author>
							<persName><forename type="first">Natalia</forename><surname>Ortynska</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Viktoria</forename><surname>Terletska</surname></persName>
							<email>viktoriia.o.terletska@lpnu.ua</email>
						</author>
						<author>
							<persName><forename type="first">Oksana</forename><surname>Bilyk</surname></persName>
						</author>
						<author>
							<persName><forename type="first">Oleksandra</forename><surname>Hasko</surname></persName>
							<email>oleksandra.i.hasko@lpnu.ua</email>
						</author>
						<author>
							<affiliation key="aff0">
								<orgName type="institution">Lviv Polytechnic National University</orgName>
								<address>
									<addrLine>Profesorska Street 1</addrLine>
									<postCode>79013</postCode>
									<settlement>Lviv</settlement>
									<country key="UA">Ukraine</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff1">
								<orgName type="department">International Conference on Computational Linguistics and Intelligent Systems</orgName>
								<address>
									<addrLine>May 12-13</addrLine>
									<postCode>2022</postCode>
									<settlement>Gliwice</settlement>
									<country key="PL">Poland</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Analysis of the Application of Stochastic Gradient Descent to Detect Network Violations</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">B65006E2367B6DC6FFF773ED88A0D14B</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T12:56+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Machine Learning, Artificial Intelligence, stochastic gradient descent, stochastic algorithms, negative log likelihood, convolution neural network 0000-0002-6962-9363 (N. Boyko)</term>
					<term>0000-0002-6180-3478 (I. Khomyshyn)</term>
					<term>0000-0002-5061-5340(N. Ortynska)</term>
					<term>0000-0002-9334-2557 (V. Terletska)</term>
					<term>0000-0001-6042-1147 (O. Bilyk)</term>
					<term>0000-0003-4519-610X (О. Hasko)</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Over the past decade, data sizes have grown faster than processor speeds. In this context, the capabilities of statistical machine learning methods are limited by computational time, not sample size. A more accurate analysis reveals qualitatively different trade-offs in the case of small and large learning problems. This work considers the occurrence of a large-scale case involving the basic algorithm's computational complexity and the implementation of optimization in non-trivial ways. Optimization algorithms such as stochastic gradient descent, which demonstrate exceptional performance for large-scale problems, are considered. The work compares the algorithms of gradient descent, stochastic gradient, and standard deterministic gradient descent. Here are given ways to apply them in machine learning and considered a formal description of the operation of gradient descent, stochastic gradient, and standard deterministic gradient descent algorithms. Here is analyzed the difference in the process of algorithms. The advantages and disadvantages of each of them are given. A comparison of both approaches is considered: deterministic and Stochastic algorithms. This work has analyzed the effectiveness of these two approaches. Convolutional (convolutional) neural networks are considered, which, unlike conventional deep neural networks, use convolutions to find patterns in images, thereby improving prediction in image recognition problems. A model of a convolutional neural network is presented. Here are analyzed the parameters of training the input sample and its testing. As a result of using a simulation of the usual gradient descent, when the weight is updated only after viewing all the elements, the value of the loss function is considered. The analysis of the selected problem and model also justifies the results of its use.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>As a sub-branch of artificial intelligence, machine learning is actively developing in various areas of technology and human life. It is used in several computational problems in which the development and programming of explicit algorithms with good performance are difficult or impossible. Examples of its applications include email filtering, detecting network criminals or malicious insiders, Optical Character Recognition, ranking training, computer vision, etc.</p><p>For so-called" training", machine learning often uses statistical techniques and algorithms, one of which is the gradient descent method.</p><p>However, when training the model, a massive sample of data is most often used, as a result of which the standard deterministic gradient descent does not show the best time results. A modification of deterministic gradient descent is stochastic gradient descent, which provides accelerated learning without significant loss of efficiency.</p><p>This work aims to analyze and reveal the operation of stochastic gradient descent, explore the application of this algorithm in machine learning training, and compare it with a deterministic algorithm.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Review of literature sources</head><p>Almost all machine learning problems are optimization problems, which is to find the extremum of the objective function <ref type="bibr" target="#b0">[1]</ref>. Finding a model and the correct objective function is the first step in solving the machine learning problem. Then you need to define analytical and numerical methods that will help solve the optimization problem <ref type="bibr" target="#b1">[2]</ref>.</p><p>The paper discusses first-order methods, namely methods that use the gradient of a function. In the author's work <ref type="bibr" target="#b2">[3]</ref>, the general optimization method is the usual gradient descent. The author of <ref type="bibr" target="#b4">[5]</ref> describes a formal description and description of the algorithm for solving the gradient descent method. In his work <ref type="bibr" target="#b4">[5]</ref>, the parameters are updated iteratively in the direction of the gradients of the objective function. The mechanism of the gradient descent method is described in the paper <ref type="bibr" target="#b1">[2]</ref>. The gradient descent method is easy to implement. The solution is a global minimum or maximum when the objective function is convex <ref type="bibr" target="#b5">[6]</ref>. It often converges more slowly if the variable is closer to the optimal solution, which requires more thorough iterations <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b5">6]</ref>. Due to the high computational complexity in each iteration, stochastic gradient descent (SGD) <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b18">19]</ref> is proposed for a large amount of data. Therefore, the authors <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b9">10]</ref> present an analysis of the application of this algorithm in the process of machine learning training and its comparison with a deterministic algorithm.</p><p>The manually adjusted learning rate strongly affects the SGD effect. This is a complex problem of setting the appropriate learning rate value. It is described in the authors' works <ref type="bibr" target="#b0">[1]</ref><ref type="bibr" target="#b1">[2]</ref><ref type="bibr" target="#b2">[3]</ref><ref type="bibr" target="#b3">[4]</ref><ref type="bibr" target="#b4">[5]</ref><ref type="bibr" target="#b5">[6]</ref><ref type="bibr" target="#b6">[7]</ref><ref type="bibr" target="#b7">[8]</ref>. The study uses popular machine learning libraries for the Python language (such as PyTorch, TensorFlow, and SkLearn). The standard algorithm is not even implemented, but only its stochastic version exists <ref type="bibr" target="#b11">[12]</ref><ref type="bibr" target="#b12">[13]</ref><ref type="bibr" target="#b13">[14]</ref><ref type="bibr" target="#b14">[15]</ref>.</p><p>Using a simulation of the usual gradient descent described in <ref type="bibr" target="#b0">[1]</ref><ref type="bibr" target="#b1">[2]</ref><ref type="bibr" target="#b2">[3]</ref><ref type="bibr" target="#b3">[4]</ref><ref type="bibr" target="#b15">16]</ref>, the model's accuracy becomes very low when the weight is updated only after viewing all the elements. It is described that with this approach, gradients are updated more accurately. In this paper, the authors conduct a study with many epochs, where gradient descent shows the best accuracy of training.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Materials and methods</head><p>Gradient descent is one of the most popular optimization algorithms and the most common way to optimize neural networks. At the same time, each modern deep learning library contains implementations of various algorithms for optimizing gradient descent. However, these algorithms are often used as black-box optimizers because it is difficult to find practical explanations for their strengths and weaknesses.</p><p>The study aims to provide an analysis of the behaviour of various algorithms for optimizing Gradient Descent, which will help to apply them. First, you should consider different options for gradient descent. The next step is to summarize the problems during training. Therefore, this work presents the most common optimization algorithms that demonstrate their motivation to solve these problems and lead to the conclusion of update rules.</p><p>Gradient descent is a way to minimize an objective function</p><formula xml:id="formula_0">) ( J that is parameterized through model parameters d R </formula><p> by updating parameters in the opposite direction of the objective function</p><formula xml:id="formula_1">) (  J </formula><p>gradient to parameters. The learning rate  determines the size of the steps that need to be taken to reach the (local) minimum. When analyzing, you need to monitor the direction of inclination of the surface down, which is created by the objective function.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Gradient Descent</head><p>Gradient descent is an iterative first-order optimization algorithm in which steps are taken to find the local minimum of a function proportional to the opposite value of the process's gradient (or approximate gradient) at the current point.</p><p>In short, since the gradient shows us the direction and speed of movement of the function relative to all arguments, to find the local minimum, we must move opposite to the action of the process (if it increases, we move it to decrease, if it declines, we move even more to the minimum).</p><p>For an intuitive understanding of gradient descent, we can imagine the movement of the ball (arguments) along with a parabolic bucket (function) in the direction of the bottom (minimum). Still, our algorithm will perform its role (Figure <ref type="figure" target="#fig_0">1</ref>). When the ball is on the left side (i.e., the function decreases), the algorithm must "push" the ball forward to reach the bottom. When it is, on the right side (the function increases), as shown in Figure <ref type="figure" target="#fig_0">1</ref>, then the algorithm, on the contrary, "pulls" it down. This movement continues step by step until the ball reaches the bottom. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Standard deterministic gradient descent</head><p>Deterministic refers to an algorithm that returns the same result each time, regardless of the number of its runs.</p><p>Normal gradient descent can return the same values because it is performed on the total amount of data that comes to its input (that is, it works with the same input data and also serves the same steps on it without any accidents, which gives the same result at the end).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Stochastic gradient descent</head><p>Stochastic refers to an algorithm that returns different values each time, i.e. the algorithm contains a certain stochastic (random) relationship.</p><p>Different output values occur because stochastic gradient descent is not performed on the entire volume of input data but stochastically selects only a certain predefined number of records for each epoch. Such a set of individual records is called a batch.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Application in machine learning</head><p>Gradient descent is widely used in machine learning because it allows the model to learn from its mistakes and inaccuracies.</p><p>When training the model, a loss or error function is introduced, which provides information about the difference between the model's expected results. Since we cannot change the input data from the dataset, we must optimize the coefficients of our model.</p><p>The gradient of this function shows us how each coefficient affects the final error function. Applying a gradient descent to the loss function after this allows us to achieve the minimum of this function by adjusting the coefficients (i.e., make the error minimal). In this way, the configured model will enable us to use it for our tasks.</p><p>Thus, it becomes clear that gradient descent itself (stochastic and standard) is not used in machine learning. Its implementation is only an auxiliary tool for minimizing loss or error functions in machine learning algorithms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Formal description of gradient descent algorithms 4.1. Mathematics in gradient descent</head><p>The gradient descent algorithm looks like this: 1. The calculation of new values of variables is presented in equation 1: </p><formula xml:id="formula_2"> = − = − = n i i w f grad n w w f grad w w 1 )), (<label>(</label></formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Mathematics in Stochastic Gradient Descent</head><p>The stochastic gradient descent algorithm looks like this: 1. Randomly shuffle the dataset. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Difference of algorithms</head><p>As can be seen from points 4.1 and 4.2, there is no difference in the formula for calculating new values for both algorithms. Differences appear in the approach to updating variables. While a standard gradient descent updates variables only after all n training samples have been passed, the stochastic algorithm updates these variables after passing only randomly selected models, which increases the number of updates per epoch.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.">Comparison of both approaches</head><p>On the one hand, a deterministic algorithm behaves more precisely since it processes a complete dataset to update data, but such training takes a very long time. On the other hand, the stochastic algorithm is not as accurate after each update, but its speed allows you to achieve the required accuracy in much less time.</p><p>Since machine learning works with massive data sets, the stochastic algorithm is much more efficient for its tasks. The standard algorithm is not even implemented in most popular machine learning libraries for the Python language (such as PyTorch, TensorFlow, and SkLearn), but only its stochastic version exists.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Recognition by convolutional neural networks trained using stochastic gradient descent 5.1. Analysis and justification of the selected task and model</head><p>Digit recognition is still a relevant topic today, which can be applied in various areas of our lives, starting from recognising bank card numbers in bank applications and ending with recognising mathematical expressions.</p><p>As you know, convolutional neural networks are best suited for image recognition, which, unlike conventional deep neural networks, use convolutions to find patterns in images that allow for improved prediction. Convolutions also compress images, which significantly increases the speed of analysis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Description and analysis of the selected dataset</head><p>To recognize numbers, a well-known dataset is called MNIST. MNIST (short for Mixed National Institute of Standards and Technology)a dataset of samples of handwritten writing of numbers. It is a standard proposed by the US National Institute of standards and technology to calibrate and compare image recognition methods using machine learning, primarily based on artificial neural networks. The dataset contains 60,000 images for training and 10,000 images for testing (Figure <ref type="figure" target="#fig_3">2</ref>). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">Convolutional neural network model</head><p>To implement a convolutional neural network, I chose the following architecture, a diagram of which can be seen in Figure <ref type="figure" target="#fig_5">3</ref>:  </p><formula xml:id="formula_3">• input 28x28x1 • convolutional layer 24x24x10 • Max-</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.">Training and testing parameters</head><p>The data is randomly divided into 60,000 training images and 10,000 test images to prevent the possibility of retraining, when the model shows good results in training, but then cannot distinguish samples that were not seen during training.</p><p>As an optimizer for training, a stochastic gradient descent was chosen with the experimentally selected best learning rate learningrate = 0.01.</p><p>The training will last for 3 epochs with a different number of images in each betcha for the step of the stochastic gradient algorithm.</p><p>The negative logarithmic likelihood function</p><formula xml:id="formula_4">) log( ) ( y y L − =</formula><p>is chosen as the loss function, which is simple, but at the same time powerful in classification problems.</p><p>Accuracy testing will reflect the ratio of the number of correctly predicted images to the total number of images in the dataset</p><formula xml:id="formula_5">      = % 100 * N N Accuracy true .</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Work results</head><p>The graph shown in Figure <ref type="figure" target="#fig_6">4</ref> shows the change in the value of the loss function relative to the number of elements seen during a pure stochastic gradient descent (1 element per batch). At the same time, the accuracy of the neural network for 3 epochs was only 8%, which means that only about 800 out of 10,000 images were correctly identified by the network. The final error was about 4.2315. The graph shown in Figure <ref type="figure" target="#fig_7">5</ref> shows the change in the value of the loss function relative to the number of elements seen with a batch size of 64 elements. At the same time, the accuracy of the neural network for 3 epochs was 97%. The final error was about 0.327. The graph shown in Figure <ref type="figure" target="#fig_8">6</ref> shows the change in the value of the loss function relative to the number of elements seen with a batch size of 128 elements. At the same time, the accuracy of the neural network for 3 epochs was 96%. The final error was about 0.421. The graph shown in Figure <ref type="figure" target="#fig_9">7</ref> shows the change in the value of the loss function relative to the number of elements seen with a batch size of 256 elements. At the same time, the accuracy of the neural network for 3 epochs was 93%. The final error was about 0.686. Since there is no ready-made implementation of normal gradient descent among known libraries, we can only stimulate it with stochastic gradient descent by setting the number of elements in the batch to 60,000 (i.e., to update the algorithm weights, we will need to view 60,000 images. The graph shown in Figure <ref type="figure" target="#fig_10">8</ref> shows the change in the value of the loss function relative to the number of elements seen during the simulated normal gradient descent. At the same time, the accuracy of the neural network for 3 epochs was only 9%. The final margin of error was about 2.284. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Discussion of results</head><p>When we consider only one random element to change the weights in pure stochastic gradient descent, the loss function has the highest values. One random component is not enough to describe the trend of movement of the loss function and the corresponding gradients for all other weights. It can be seen that the value of the loss function is constantly moving up and down. Obviously, with this movement of the process, the accuracy of neural network predictions was very low (only 8%).</p><p>When we use mini-batch stochastic gradient descent for a different number of elements in a batch (64/128/256), the value of the loss function increases and decreases alternately. However, in all cases, it still moves to a minimum. The only difference with a different number of batch elements is the frequency of these movements since the weights are updated with a different number of viewed features. At the same time, mini-batching showed the highest accuracy of predictions, which was 97% for a 64-element batch.</p><p>As a result of using the standard gradient descent simulation, when the weight update takes place only after viewing all the elements, we see that the value of the loss function also falls. Still, since the weight update is so rare, this movement is prolonged, so the model's accuracy also becomes very low. However, the gradients with this approach are updated more accurately. Presumably, with more epochs, gradient descent would show better training accuracy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.">Conclusions</head><p>Looking at the results of stochastic gradient descent and comparing them with the simulated average gradient, we can see that SGD performs training much faster because only three epochs are enough to achieve 97% accuracy with the correct number of elements in the batch. At the same three epochs, gradient descent reduced the value of the loss function by several tenths.</p><p>At the same time, pure GHS also shows poor accuracy results since it cannot describe the change in weights using a single random element.</p><p>Considering the previous thesis, we can conclude that using a mini-party stochastic gradient is more effective and more in demand since it performs the fastest and best training of the model.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Intuitive image of the gradient descent algorithm</figDesc><graphic coords="3,177.00,319.40,255.00,159.00" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head></head><label></label><figDesc>small value, the pace of learning 2. Repeat the calculation of Step 1 until</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>2 . 3 .</head><label>23</label><figDesc>Select the k-required samples (in pure stochastic descent 1 The calculation of new values of variables is presented in equation 2</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Example of the appearance of images from the MNIST dataset</figDesc><graphic coords="5,227.00,398.60,155.00,155.00" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head></head><label></label><figDesc>puling 12x12x10 • ReLU activation function • convolution 8x8x20 • Max puling 4x4x20 • complete connection of 320 neurons • complete connection of 50 neurons • output for 10 classes (10 digits) Note: there is a ReLU activation function between all full connection layers</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: CNN diagram</figDesc><graphic coords="6,86.20,100.60,453.40,206.80" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Graph of the ratio of the amount of losses relative to the number of elements passed by the network with a batch size of 1 element CNN diagram</figDesc><graphic coords="7,173.20,72.00,262.60,197.20" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: Graph of the ratio of the amount of losses relative to the number of elements passed by the network with a batch size of 64 elements</figDesc><graphic coords="7,146.40,371.80,316.20,236.20" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_8"><head>Figure 6 :</head><label>6</label><figDesc>Figure 6: Graph of the ratio of the amount of losses relative to the number of elements passed by the network with a batch size of 128 elements</figDesc><graphic coords="8,150.80,72.00,307.60,230.20" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_9"><head>Figure 7 :</head><label>7</label><figDesc>Figure 7: Graph of the ratio of the amount of losses relative to the number of elements passed by the network with a batch size of 256 elements</figDesc><graphic coords="8,155.60,405.00,297.80,221.20" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_10"><head>Figure 8 :</head><label>8</label><figDesc>Figure 8: Graph of the ratio of the amount of losses relative to the number of elements passed by the network with a batch size of 60,000 elements (simulation of a normal gradient descent)</figDesc><graphic coords="9,159.60,72.00,290.00,216.60" type="bitmap" /></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Another approach to polychotomous classification</title>
		<author>
			<persName><forename type="first">J</forename><surname>Friedman</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Machine Learning</title>
				<imprint>
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">The Elements of Statistical Learning</title>
		<author>
			<persName><forename type="first">T</forename><surname>Hastie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Tibshirani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Friedman</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2001">2001</date>
			<publisher>SpringerVerlag</publisher>
			<pubPlace>New York</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">A look trough methods of intellectual data analysis and their applying in informational systems</title>
		<author>
			<persName><forename type="first">N</forename><surname>Boyko</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">XIth International Scientific and Technical Conference Computer Sciences and Information Technologies (CSIT)</title>
				<imprint>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="183" to="185" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Prediction Games and Arcing Algorithms</title>
		<author>
			<persName><forename type="first">L</forename><surname>Breiman</surname></persName>
		</author>
		<idno type="DOI">10.1162/089976699300016106</idno>
	</analytic>
	<monogr>
		<title level="j">Neural Computation</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="issue">7</biblScope>
			<biblScope unit="page" from="1493" to="1517" />
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">An ensemble learning method for classification of multiple-label data</title>
		<author>
			<persName><forename type="first">P</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Pan</surname></persName>
		</author>
		<idno type="DOI">10.12733/jcis14783</idno>
	</analytic>
	<monogr>
		<title level="j">Journal of Computational Information Systems</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="issue">12</biblScope>
			<biblScope unit="page" from="4539" to="4546" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Characteristics of the quality of the regression model</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">O</forename><surname>Pererodov</surname></persName>
		</author>
		<ptr target="https://www.ereading.club/chapter.php/1002275/18/" />
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent</title>
		<author>
			<persName><forename type="first">A</forename><surname>Bordes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Bottou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Gallinari</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<biblScope unit="page" from="1737" to="1754" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">The Tradeoffs of Large Scale Learning</title>
		<author>
			<persName><forename type="first">L</forename><surname>Bottou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Bousquet</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in Neural Information Processing Systems</title>
		<imprint>
			<biblScope unit="volume">20</biblScope>
			<biblScope unit="page" from="161" to="168" />
			<date type="published" when="2008">2008</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">On-line Learning for Very Large Datasets</title>
		<author>
			<persName><forename type="first">L</forename><surname>Bottou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Lecun</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Applied Stochastic Models in Business and Industry</title>
		<imprint>
			<biblScope unit="volume">21</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="137" to="151" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Application of Ensemble Methods of Strengthening in Search of Legal Information</title>
		<author>
			<persName><forename type="first">N</forename><surname>Boyko</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Kmetyk-Podubinska</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Andrusiak</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-030-82014-5_13</idno>
		<ptr target="https://doi.org/10.1007/978-3-030-82014-5_13" />
	</analytic>
	<monogr>
		<title level="j">Lecture Notes on Data Engineering and Communications Technologies</title>
		<imprint>
			<biblScope unit="volume">77</biblScope>
			<biblScope unit="page" from="188" to="200" />
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">Concentration Inequalities and Empirical Processes Theory Applied to the Analysis of Learning Algorithms</title>
		<author>
			<persName><forename type="first">O</forename><surname>Bousquet</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2002">2002</date>
			<pubPlace>Palaiseau, France</pubPlace>
		</imprint>
		<respStmt>
			<orgName>Ecole Polytechnique</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Thèse de doctorat</note>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Training Linear SVMs in Linear Time</title>
		<author>
			<persName><forename type="first">T</forename><surname>Joachims</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 12th ACM SIGKDD</title>
				<meeting>the 12th ACM SIGKDD</meeting>
		<imprint>
			<date type="published" when="2006">2006</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Evaluation of the effectiveness of different image skeletonization methods in biometric security systems</title>
		<author>
			<persName><forename type="first">M</forename><surname>Nazarkevych</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Dmytruk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Hrytsyk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Vozna</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Kuza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Shevchuk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Sheketa</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">International Journal of Sensors Wireless Communications and Control</title>
		<imprint>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="issue">5</biblScope>
			<biblScope unit="page" from="542" to="552" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Lafferty</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mccallum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Pereira</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of ICML</title>
				<meeting>ICML</meeting>
		<imprint>
			<date type="published" when="2001">2001</date>
			<biblScope unit="page" from="282" to="289" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">RCV1: A New Benchmark Collection for Text Categorization Research</title>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">D</forename><surname>Lewis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">G</forename><surname>Rose</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Li</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="page" from="361" to="397" />
			<date type="published" when="2004">2004</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Trust region Newton methods for large-scale logistic regression</title>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">J</forename><surname>Lin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">C</forename><surname>Weng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">S</forename><surname>Keerthi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of ICML</title>
				<meeting>ICML</meeting>
		<imprint>
			<date type="published" when="2007">2007</date>
			<biblScope unit="page" from="561" to="568" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Some applications of concentration inequalities to Statistics</title>
		<author>
			<persName><forename type="first">P</forename><surname>Massart</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Annales de la Faculté des Sciences de Toulouse series</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="issue">9</biblScope>
			<biblScope unit="page" from="245" to="303" />
			<date type="published" when="2000">2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">SVM optimization: inverse dependence on training set size</title>
		<author>
			<persName><forename type="first">S</forename><surname>Shalev-Shwartz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Srebro</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the ICML</title>
				<meeting>the ICML</meeting>
		<imprint>
			<date type="published" when="2008">2008</date>
			<biblScope unit="page" from="928" to="935" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent</title>
		<author>
			<persName><forename type="first">W</forename><surname>Xu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of Machine Learning Research</title>
		<imprint>
			<date type="published" when="2010">2010</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
