<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>O. Zaritskyi);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Intelligent Analysis of Sports Data in the Tasks of Forming Effective Sports Teams</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Oleh Zaritskyi</string-name>
          <email>oleh.zaritskyi@knu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Danyil Pylypovych</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ihor Miroshnychenko</string-name>
          <email>ihor.miroshnychenko@knu.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Taras Shevchenko National University of Kyiv</institution>
          ,
          <addr-line>64/13, Volodymyrska Street, Kyiv, 01601</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <volume>000</volume>
      <fpage>9</fpage>
      <lpage>0009</lpage>
      <abstract>
        <p>The article discusses topical issues of applying data mining methods in the tasks of forming effective sports teams. The methods of decision trees, linear regression, and the SVM model, as well as the ensemble method for analyzing the correspondence of player characteristics to their position, which allows for identifying the key factors that determine the best place for a player on the field, are considered. The possibilities of artificial intelligence in team coaches' decision-making tasks to build a positional strategy are analyzed, which will significantly increase the team's overall efficiency and inform its composition for a specific model of the opponent's game. The article provides recommendations for processing sports data arrays and establishing suitable models to optimize their efficiency and productivity. A comparative analysis of the real and predicted ratings of specific players has confirmed the high accuracy of the developed models, with MAE and MRSE estimates within 1-3% of the actual values. The authors have proposed for the first time a scientific approach specifically to the tasks of forming effective team compositions, taking into account statistical data, and not only predicting game results, as in most existing studies. The authors also provide practical recommendations for tuning models in terms of their effectiveness, specifically in the tasks of analyzing sports data sets, which can be used as a starting point in research on statistical data of team sports. The analysis of the developed models, identification of the best model for such tasks, and interpretation of the results obtained can become the basis for building automated analysis and forecasting systems in sports teams for the respective analytical departments of the teams. The authors also envisage additional research on the correlation between predicted effective teams and match results, which could provide additional indirect evidence of the effectiveness of the proposed forecasting methods.</p>
      </abstract>
      <kwd-group>
        <kwd>Machine learning</kwd>
        <kwd>sports teams</kwd>
        <kwd>decision trees</kwd>
        <kwd>artificial intelligence</kwd>
        <kwd>game strategy 1</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Correct positioning of players is a fundamental basis for success in team sports. It allows for the
maximization of the individual strengths of each athlete, creating a balanced system where everyone
plays a clearly defined role. In football, for example, positioning determines the tactical scheme of the
game, and in basketball, it allows you to effectively distribute tasks between players of different builds
and skills. Incorrect placement of athletes can lead to an imbalance in the team, ineffective use of
individual players' talents, and, as a result, a deterioration in overall results. High-level coaches pay
special attention to positional strategy, often adapting it to a specific opponent to achieve maximum
efficiency.</p>
      <p>
        As an example, the authors consider football teams, due to the maximum variability of positions
in this sport. In football, each position on the field requires a unique set of characteristics that
determine the effectiveness of the player [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For example, attackers must have high speed and the
ability to complete attacks, while midfielders need good passing technique and field vision. Defenders,
in turn, must have high indicators of selection and physical strength, and goalkeepers must have
quick reactions and hand skills.
      </p>
      <p>
        Correct positioning of players is critical for successful team play. Incorrect use of a player can lead
to a decrease in his effectiveness and negatively affect the team's results. Traditionally, coaches make
decisions about player positions based on their own experience and observations [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. However,
modern data analysis methods allow automating this process and making it more objective, using
mathematical models and machine learning algorithms.
      </p>
      <p>
        Using classification algorithms for decision-making, such as decision trees, can help determine
which player characteristics are most important for each position. This allows not only to improve
the distribution of players, but also to identify potential changes in their roles on the field and to build
a team for a specific opponent's playing style [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Machine learning opens up new possibilities for analyzing sports data, allowing you to find
patterns that are difficult to detect using traditional methods. One approach is to use classification
and regression algorithms [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to help predict which position a player will perform best. Machine
learning algorithms can process large amounts of data, including player statistics, physical
characteristics, technical skills, and even performance history, and use this data to build models that
automatically determine the suitability of a player for a particular position [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ].
      </p>
      <p>Decision trees are one of the most effective methods for this task, as they provide a clear
interpretation of the decisions made. Decision trees can reveal which characteristics are most
important for each position, and how they have changed over the years. This not only helps clubs in
choosing tactics and building their squads, but also individual players in understanding their own
strengths and weaknesses.</p>
      <p>
        Decision trees are one of the most popular methods in machine learning due to their simplicity,
interpretability, and efficiency in handling large data sets with a relatively small number of variables.
For football position analysis, this method is an ideal choice for several reasons [
        <xref ref-type="bibr" rid="ref7 ref8 ref9">7, 8, 9</xref>
        ]:
      </p>
      <sec id="sec-1-1">
        <title>Ease of interpretation the decision tree builds a hierarchical structure of decisions, which allows you to clearly understand which characteristics of the player are most significant for</title>
      </sec>
      <sec id="sec-1-2">
        <title>Flexibility in working with different types of data the method works well with numerical</title>
      </sec>
      <sec id="sec-1-3">
        <title>Determination of the most important characteristics</title>
        <p>the decision tree model automatically
selects the most relevant variables, which helps to understand which factors most affect the
Robustness to noise in the data</p>
        <p>the method works well even in the presence of some
1.
2.
3.
4.
5.
determining his position.
distribution of players.
incorrect or missing values.</p>
      </sec>
      <sec id="sec-1-4">
        <title>Visualization the results of the model could be easily presented in the form of a tree, which makes it accessible for understanding even by non-specialists in the field of machine learning and, accordingly, the administrations of the teams making decisions.</title>
        <p>
          Linear regression is a simple model for predicting a numerical value (regression) that tries to find
the best line describing the relationship between independent variables (features) and the target
variable. It is used to predict a quantitative (continuous) target variable based on one or more
independent variables. The model has the following form [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]:
        </p>
        <p>̂ =  0 +  1 1 +  2 2 + ⋯ +    
where:  ̂ is the predicted value,  is the free term (intercept),  1,  2, ...,   are the coefficients
=

1

 =1</p>
        <p>The main advantages of the model are simplicity and speed of training, easy interpretation - each
weight shows the influence of a feature, and it works well with linear dependencies. However, the
model also has disadvantages, such as sensitivity to multicollinearity and outliers, and is not suitable
for complex (non-linear) dependencies.</p>
        <p>SVM is a powerful algorithm for classification and regression tasks that searches for a hyperplane
(or boundary) that best separates classes with a maximum margin. For nonlinear problems, kernels
(weights) of the model.
actual values:</p>
        <p>
          The model is trained by minimizing the mean square error (MSE) between the predictions and the
are used to transform the feature space into a higher dimension. It is used for classification (mostly)
and regression, especially when the boundaries between classes are complex [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
        <p>SVM searches for the optimal hyperplane that separates classes as much as possible. In the case of
a nonlinear boundary, a kernel trick (e.g., RBF kernel) is used to transform the input features into a
higher dimension. For classification tasks, an optimization problem with constraints is solved:
1
min ‖ ‖2   ℎ    ( ∙   +  ) ≥ 1
 , 2</p>
        <p>The main advantages are efficiency with a high number of features, the ability to work with
nonlinear dependencies, and resistance to overfitting (especially with the right choice of
regularization). However, the model has its drawbacks in the form of slowness with large datasets,
the need for fine-tuning of parameters (e.g., C, gamma), and is less interpretable, especially when
using kernels.</p>
        <p>Ensemble methods combine the predictions of several weaker models (e.g., decision trees) to create
a more powerful model. Designed to improve accuracy by combining multiple models (base models
• Bagging (Bootstrap Aggregating), e.g., Random Forest - trains many trees on different
subsamples of data and combines the results (voting or average).
• Boosting, e.g., Gradient Boosting, XGBoost trains models sequentially, each subsequent
one focuses on the errors of the previous one.
• Voting/Stacking - combining predictions of several different models (SVM, decision tree,
logistic regression, etc.)</p>
        <p>
          The advantages of such models are high accuracy, the ability to work well with large amounts of
data and complex dependencies, and they are less prone to overfitting than individual trees. However,
such models have the disadvantages of being difficult to interpret, slower to learn (especially
Boosting), and can be resource-intensive [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>This study uses decision trees, linear regression, support vector machine, and an ensemble voting
type method to analyze the correspondence between player characteristics and position, which allows
for the identification of the key factors that determine the best position of a player on the field.</p>
        <p>
          A review of previous studies shows a wide range of applications of machine learning in sports
analytics. One study [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] focused on predicting sports injuries using deep learning, which used the
permutation entropy method to detect hidden patterns in athletes' physiological data. The achieved
accuracy of 92% demonstrates significant potential for improving sports medicine.
        </p>
        <p>
          Another study [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] analyzes the ability to predict football match results using machine learning
algorithms such as LightGBM and AdaBoost. It found that the overall prediction accuracy is about
52.8%, and predicting draws remains a particularly challenging task.
        </p>
        <p>
          Authors of the study [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] focus on automated player performance assessment, which uses
classification methods to identify key characteristics of football players. It was found that the
algorithms can partially imitate expert assessments, but have limited accuracy in predicting the
outcome of a match at 63.4%.
        </p>
        <p>
          Paper [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] is devoted to machine learning approaches in sports analysis, in particular, their
strengths and weaknesses in the analysis of sports events. It is determined that one of the key factors
in improving the accuracy of forecasts is the availability of open data.
        </p>
        <p>
          In the field of physical education, the use of XGBoost in combination with reinforcement learning
to personalize training has been studied [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. A standardized data collection system has been
introduced, which allows increasing the efficiency of the educational process by 46.42%.
        </p>
        <p>As the analysis of research in the field of sports analytics has shown, most of the works are devoted
to the prediction of match results. Despite the significant contribution of previous research in the
field of sports analytics, none of them focus directly on determining the key characteristics of football
players depending on their position on the field. The study of the authors of this article puts this
aspect in the spotlight, examining the characteristics of football players over several years (FIFA15
21). This approach allows us to trace trends in the requirements for different positions.</p>
        <p>To analyze the importance of characteristics, the above methods are used to help identify the
factors that most strongly influence the rating of players at each position. The authors first considered
and proposed an approach to automatically determine the optimal positions of players based on their
attributes, which can be useful for coaches and analysts when forming team compositions and
developing game strategies.</p>
        <p>Thus, the authors' research further developed traditional sports data analysis, which was expanded
with modern machine learning methods and offers a mathematically based approach to assessing
player positions when forming balanced team lineups from the perspective of their maximum
effectiveness.</p>
        <p>The goal of the work is to improve the quality of team formation, taking into account the
characteristics of each player, in order to increase the effectiveness of matches by maximizing the use
of all the strengths of specific players by applying methods of Data Mining.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Exploratory data analysis of sports team datasets</title>
      <p>Working with sports team datasets has several important features:
• Multidimensionality of data - in sports, diverse information is collected: physical
indicators of players, tactical parameters, match statistics, player ratings, GPS tracker data,
etc.
• The need for real-time analysis - many decisions are made directly during the game or
training based on operational data.
• Seasonality and cyclicity - the data have a pronounced periodicity (pre-season training,
regular season, playoffs), which affects their interpretation.
• Contextuality - statistics should be considered in the context of the opponent, weather
conditions, injuries, and the team's playing style.
• Interdependence of indicators - individual player data is inextricably linked to team
results.
• Use of predictive analytics - to predict results, optimize the lineup, and prevent injuries.
• The need for visualization - complex datasets often require a visual representation for
coaches and players.</p>
      <p>Modern sports teams often have separate analytics departments that work with specialized tools
to collect, process, and analyze this data to make strategic decisions.</p>
      <sec id="sec-2-1">
        <title>2.1. Dataset characteristics</title>
        <p>
          The analysis uses the FIFA 21 Complete Player Dataset, which contains detailed information about
football players in the FIFA 21 game. This dataset was created based on official statistics and player
ratings provided by the game developers EA Sports [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ].
        </p>
        <p>The dataset presents a large number of parameters, including the player's overall rating, his
individual characteristics, the positions he can play in, and the ratings for each of these positions.
This dataset is popular in football analytics, as it contains structured information about real-world
player attributes and could be used for various studies in the field of sports data analysis.</p>
        <p>This dataset allows for in-depth analysis of the relationship between physical, technical, and
tactical characteristics (independent variables, Table 1) of players and their performance in different
positions (dependent or target variables, Table 2). In our study, the dataset has been used to build
machine-learning
choice on the field.</p>
        <p>These characteristics (Table 1) have been used to build a decision tree model that allows us to
variables (Table 2).</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Data preprocessing</title>
        <p>
          Before building the model, data preprocessing was performed, taking into account the specifics of
sports team datasets to ensure their correctness and quality [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. The main steps of this stage
included:
1. Removing missing values all records containing missing values in key characteristics
(speed, shooting, passing, dribbling, defense, fitness) were removed, as they could affect
the accuracy of the model.
2. Converting ratings to numeric format the ratings of players at each position in the
original data were stored as text values (e.g., "90+3"), which could create problems when
using them in the analysis. To eliminate this drawback, the "+" symbol and all additional
values were removed, leaving only the numeric part.
3. Selecting relevant variables from all available player characteristics, only those used in
the model were selected: speed, shooting, passing, dribbling, defense, fitness, as well as
player ratings at different positions.
        </p>
        <p>In the original dataset, player positions are represented as text values, which can contain multiple
position options for a single player (e.g., "CM, CDM"). To correctly use this data in the model, the
following steps were performed:
1. Primary position allocation if a player had multiple positions, the first one listed was
used, as it is the primary one in the game.
2. Position conversion into categories positions were grouped into more generalized
categories:
• Forwards: ST, CF, LW, RW.
• Midfielders: CAM, CM, CDM, LM, RM.
• Defenders: CB, LB, RB.
3. Conversion into factor variables for use in machine learning algorithms, positions were
coded as factor variables.</p>
        <p>
          This conversion made the model more generalized and convenient for analyzing player
characteristics in terms of primary positions [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ].
        </p>
        <p>Since the chosen decision tree method is not sensitive to the scale of the data, normalization was
not a mandatory step. Decision trees work by comparing and splitting data using thresholds. Key
reasons for decision trees being scale-insensitive:
• Decision trees make decisions based on comparing values (greater than/less than), not
their absolute values.
• When building a tree, the algorithm looks for optimal split points for each feature,
regardless of its scale.
• The metrics used to choose the best split (e.g., entropy, Gini index) are independent of the
scale of the data.</p>
        <p>This property makes decision trees particularly useful when working with heterogeneous data in
sports analytics, where metrics can have different units of measurement and value ranges. In this
study, all numerical characteristics such as speed, shooting, passing, dribbling, defense, and fitness
were kept in a comparable range (0 to 100), which made additional normalization unnecessary.
Therefore, all numerical values were used in their original form without scaling.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Building a models</title>
      <sec id="sec-3-1">
        <title>3.1. Model training and validation</title>
        <p>After defining the target and independent variables, the decision tree model was trained. The rpart()
a certain position. The rpart() function from the rpart package in R is used to build decision trees. It
is used for classification and regression problems using the CART (Classification and Regression
Trees) algorithm.</p>
        <p>
          The model parameters (Fig.1) were adjusted with the to provide an optimal
balance between accuracy and complexity: cp = 0.001 (complexity parameter) allows for deeper
trees and prevents overtraining, maxdepth = 10 defines the maximum depth and minsplit = 5
provides a sufficient number of observations for node splitting [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ].
        </p>
        <p>A linear regression model was also trained using the lm() function and a support vector method
using the svm() function. For these methods, a formula was used where we defined the dependent
and independent variables. Obviously, the independent variable will be the position rating, and the
dependent variables are the characteristics under consideration. For the support vector method,
normalization was also performed to speed up the model execution, using the scale function
parameter. After running all three models, we also perform the ensemble voting method. The soft
voting method was used because hard voting does not work with regression problems.</p>
        <p>Information was obtained about the linear regression models and the support vector method
(Tab. 2, 3). The output of the linear regression model was demonstrated on the example of the fullback
position. The model explains the target variable very well (R² &gt; 0.98), and all features are significant.
The main contribution to the forecast is made by defending (0.52) and passing (0.197), and shooting
has a negative, but very small effect - this may indicate a correlation with other variables.</p>
        <p>
          The output of the support vector method model showed that the SVM model for regression has an
RBF kernel, i.e., a nonlinear relationship between the features and the target variable is expected. The
- perhaps they can be
optimized through cross-validation. The number of support vectors 4071 is quite a lot, but with a
large set, as in our case, this is normal [
          <xref ref-type="bibr" rid="ref23 ref24">23, 24</xref>
          ].
        </p>
        <p>
          The model was trained on a training dataset that was 80% of the total dataset, using independent
variables (player characteristics) to predict player ratings at a given position. Accuracy was assessed
using mean absolute error (MAE) and root mean square error (RMSE), which quantified the difference
between predicted and actual player ratings. In addition, a feature importance analysis was performed
to determine which factors had the greatest impact on player performance at each position [
          <xref ref-type="bibr" rid="ref25 ref26">25, 26</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Model accuracy assessment</title>
        <p>After building the above models, an analysis of the importance of the characteristics that affect the
player's rating was conducted for each position. For this, information on the importance of variables
was used, which is automatically calculated by the "rpart" function in the "variable.importance"
parameter for the decision tree, for linear regression, the model coefficients are extracted from each
player's position, and then the absolute value of the coefficients is taken, and for the support vector
method, the DALEX library and the model_parts() function are used, which implements the feature
clipping method.</p>
        <p>For each built model, it was checked whether it contained important variables
(model$variable.importance). If the decision tree did not have significant branches, such a case was
ignored. Data on the importance of the characteristics were converted into a tabular format
(data.frame), where the name of the characteristic and the value of its influence were stored.</p>
        <p>In linear regression, the weight (coefficient) shows how much the target variable changes when
the characteristic changes by 1 (other things being equal). The larger the absolute value of the
coefficient, the stronger the influence of the characteristic on the forecast. This method is
straightforward and interpretable in a linear model, importance = |coefficient|.</p>
        <p>The feature clipping method means that each feature is shuffled randomly in turn, and the mean
square error (RMSE) of the model result is estimated. If the shuffle of a feature greatly increases the
error, then it is important. This is a black-box method that works even for complex/nonlinear models
(like SVM with an RBF kernel).</p>
        <p>
          Each entry was accompanied by information about the player position for which the models were
built. All the results obtained were combined into a common dataset for further analysis. The
importance of the features was presented in the form of a bar chart (Fig. 4), which allows you to easily
compare the importance of each feature for different positions and each model. For this, the ggplot2
library [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] was used, which displayed the variables and their importance, grouping the data by player
positions.
        </p>
        <p>To provide a detailed overview of model performance across different field positions, a comparison
was conducted between predicted and actual ratings using the MAE and RMSE metrics for each
method. The results for all positions and models are summarized in the following tables. This
breakdown allows a side-by-side evaluation of how each model performs under different positional
requirements and highlights the stability and strengths of each approach.</p>
        <p>The accuracy of the model for different positions (Tab. 5) indicates relatively small deviations (up
to 3%) of the predicted values from the test values, which confirms good generalization.</p>
        <p>MAE indicates the average absolute difference between predicted and actual ratings. For example,
the actual by an average of 1.94 points (for a 100-point scale). RMSE gives more weight to large errors
because it uses squared deviations. RMSE is always slightly larger than MAE, because it is more
sensitive to large deviations. The support vector model coped best with this task, even better than the
ensemble method, which indicates that if the parameters are tuned in the best way, this model will
be even better than the others, but it still shows better accuracy results. The accuracy of the model is
1.11 means that the error in the rating prediction is only about 1
point. The best prediction for central defenders (CB, MAE = 0.73) - the model most accurately
determines the rating for this position. The worst prediction is for central midfielders (CM, MAE =
1.11). Possible reasons: greater influence of specific characteristics that the model does not take into
account, or more significant differences between players in this position. The overall error is uniform
1.0 MAE), indicating the stability of the model in prediction.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Algorithm of Intelligent Team Formation and Feedback Loop</title>
        <p>To implement an effective system for intelligent team formation using machine learning methods, it
is necessary to structure the process into clear, iterative stages. This enables not only the initial
construction of the team based on player attributes, but also the refinement of the system based on
match results and tactical feedback. The proposed algorithm consists of the following steps (Fig. 2):</p>
        <p>Gathering structured and relevant information about players, including their physical, technical,
and tactical characteristics, as well as historical performance data. This step forms the foundation for
any further analytical modeling.
effectiveness at different positions based on collected characteristics. Each model is trained on
historical data and tuned for accuracy.</p>
        <p>Models are validated using metrics such as Mean Absolute Error (MAE) and Root Mean Square
Error (RMSE). The best-performing model is selected based on generalization ability and accuracy
across positions.</p>
        <p>Using the predicted ratings and selected model, a team composition is formed that best fits the
intended tactical scheme (e.g., 4-2-3-1, 4-1-4-1). The model selects players with the highest predicted
performance for each role.</p>
        <p>The formed team participates in a game or simulation. The effectiveness of the team is observed
in real conditions, allowing the assessment of how well the predicted strengths align with actual
outcomes.</p>
        <p>After the match, the actual performance of the team and individual players is analyzed.
Discrepancies between predicted and real effectiveness are noted, and misalignments in role
suitability are identified.</p>
        <p>Based on
postfeature importance, tuning hyperparameters, and possibly updating the dataset with new
observations, thus creating a feedback loop that increases model precision over time.</p>
        <p>Such an iterative cycle ensures not only high initial accuracy but also the ability to adapt to
changing team dynamics, individual player development, and evolving strategies. The feedback loop
(from Step 6 to Step 2) is a key mechanism for continuous improvement.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Analysis of results</title>
      <p>Only decision trees have visualization as such, which is a plus for this model, but what is the point of
this visualization if it is more inaccurate than others and has different results, so the visualization
output for the CB position was demonstrated, and the importance of characteristics will be best seen
on the general graph for the three models.</p>
      <p>As a result of the study of positions and their influence on the quality of the game (Fig. 4).</p>
      <p>Let's start the analysis in the order of the positions in the picture. The attacking midfielder position
has 3 important characteristics: dribbling, passing, and shooting. The support vector method says that
there is no characteristic that does not affect the assessment, while the decision tree says about two
characteristics: defense and physics, and linear regression only defense, that they do not affect. That
is, a more qualitative model found the influence of these two factors on the quality of the game in
this position. That is, the support vector method model says that in this position, the player must be
a universal player and have all the qualities in his arsenal.</p>
      <p>The central defender position has only one most important characteristic, which is defense, but
the physical data indicator also has a good influence. However, the physical indicator has a greater
influence on the opinion of linear regression and decision trees than the support vector method, but
this method revealed at least some need for passing skills, unlike other models.</p>
      <p>Now let's consider the position of the defensive midfielder. According to all models, the most
important indicator is defense. The decision tree believes that to the same degree, the indicators of
physics, shooting, and passing are good indicators of importance in approximately equal degrees.
Linear regression says that the indicators of speed and shooting are unnecessary, and the indicator
of passing is very influential, but not the most important, and the indicator of physics also has a good
influence. The support vector method says that the indicator of passing has a good influence, and the
indicator of physics has a good influence; it considers all other characteristics not very influential,
but it would also be nice to have them. Therefore, we can say that the defensive midfielder is the same
defender, but must be able to start the attack with passes, and also be more versatile than the central
defender.</p>
      <p>The center forward position is quite interesting because the models disagree on which metric is
most important. The decision tree and linear regression model say that dribbling is most important,
and the support vector method says that shooting, but it should be noted that this model says that
dribbling has a very strong influence, and the other two models, that shooting also has a strong
influence, so to speak, said the opposite. Also, all the models said that passing is also an important
metric. The other metrics are also important, but the linear regression says that defending in this
position is completely unnecessary, which is quite interesting.</p>
      <p>For the central midfielder position, all three models ranked passing ability as the most important
characteristic, and dribbling as the second most important characteristic. The support vector method
again emphasizes the versatility of players, and the main characteristic of the least important is
defense. The linear regression model also ranked them in the same order of importance as the support
vector method, but gave them a lower score, especially speed, which it considers unimportant. The
decision tree, on the other hand, gave importance only to shooting, so the decision tree is quite
different from other models in taking into account secondary characteristics for this position in terms
of importance.</p>
      <p>The full-back position also has the most important indicator according to the three models, and
this is obviously defense. The support vector method and the linear regression model highlight
passing ability, but in general, consider all other characteristics not very important; however, it is
desirable to have, except for one characteristic of hitting, the linear regression model considers this
indicator unnecessary. The decision tree highlights the indicators of physics and hitting, which are
very different from the linear regression model. This model also says that passing and speed are not
particularly important, which is also different from the other two models. It is interesting that all the
indicators are so different from each other in different models, but in the dribbling indicator, there is
absolute equality.</p>
      <p>At the winger position, all models said that dribbling was the most important attribute, along with
two important attributes, passing and shooting, but the support vector method gave them more
importance. Basically, all models said the same about their top importance attributes, but unlike the
support vector method, linear regression model, and decision tree, the models said that the physics
and defense attributes were completely unnecessary, especially the decision tree.</p>
      <p>The last position is striker, where three models say that the main characteristic is hitting, and it is
also important to be able to beat opponents. The support vector method says that other characteristics
are quite important to an approximately equal extent. The linear regression model says that defense
and passing are completely unnecessary, and speed is not very necessary; it only highlights physics,
and more than others. The decision tree considers physics unimportant, and speed is not particularly
important, also speed is not particularly important, and the others are important.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>The results of modeling player ratings by position largely coincide with expert assessments. Notably,
speed appeared auxiliary rather than key in decision trees. For attackers, shooting and dribbling are
most important; for midfielders, versatility and passing; for defenders, defensive skills.</p>
      <p>Decision trees proved interpretable but less accurate (MAE 1.76 2.07), suitable mainly for
explanatory analysis. Linear regression performed better (MAE 0.82 1.3) and is fast and simple,
though less effective for nonlinear patterns. Support Vector Machine (SVM) achieved the highest
accuracy (MAE 0.73 1.11), showing strong generalization and ability to model complex relations, but
requires more computational resources and is less interpretable. Ensemble methods yielded moderate
results and were outperformed by SVM, except for the striker position.</p>
      <p>SVM appears most promising for assisting in role selection, especially at early stages of a player's
career or when reassigning roles. For example, a player with high speed and shooting may be
wellsuited for a striker role, while strong passing and dribbling may indicate a midfielder.</p>
      <p>It is important to note that these models rely on predefined quantitative characteristics and do not
account for tactical context, individual play style, or team cohesion. Model accuracy also varies by
position predictions for central defenders were more precise than for midfielders.</p>
      <p>Therefore, machine learning should be seen as a complementary tool, best used in tandem with
expert tactical analysis. Continued refinement of models based on match data can support more
accurate lineup decisions and personalized strategies.</p>
      <p>Future directions include improving the integration of playing style into model tuning and
identifying statistical correlations between model-generated lineups and actual match outcomes. This
would contribute to developing automated, intelligent systems for sports analytics. Further
advancements could also reduce player acquisition costs, support talent scouting, and optimize
resource allocation across infrastructure and social initiatives.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <sec id="sec-6-1">
        <title>The authors have not employed any Generative AI tools.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Osadets</surname>
          </string-name>
          ,
          <article-title>Features of tactical training of football players, Youth and the market 7 (</article-title>
          <year>2015</year>
          )
          <article-title>126 130</article-title>
          . URL: http://nbuv.gov.ua/UJRN/Mir_
          <year>2015</year>
          _
          <volume>7</volume>
          _
          <fpage>26</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V.</given-names>
            <surname>Khomenko</surname>
          </string-name>
          ,
          <article-title>Modern tactical constructions of the game of leading European football clubs in 2018, SI</article-title>
          <volume>2</volume>
          (
          <issue>12</issue>
          ) (
          <year>2024</year>
          ) 59
          <fpage>70</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S. G.</given-names>
            <surname>Pryimak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. V.</given-names>
            <surname>Zavorotynskyi</surname>
          </string-name>
          ,
          <article-title>Decision trees and their application for classification of students of different groups of sports and pedagogical improvement</article-title>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Fedetskyi</surname>
          </string-name>
          ,
          <article-title>Method of sigmoid deviations and regression scale in modeling the technical fitness of football players, Physical education, sport and health culture in modern society 3 (35) (</article-title>
          <year>2016</year>
          ) 104
          <fpage>109</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Martsenyuk</surname>
          </string-name>
          , et al.,
          <article-title>Analysis of methods for detecting disinformation in social networks using machine learning</article-title>
          ,
          <source>Cybersecurity: Education, Science, Technology</source>
          <volume>2</volume>
          (
          <issue>22</issue>
          ) (
          <year>2023</year>
          ) 148
          <fpage>155</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Astrakhantsev</surname>
          </string-name>
          , et al.,
          <article-title>Investigation of the effectiveness of machine learning algorithms for traffic classification in mobile networks</article-title>
          ,
          <source>Problems of Telecommunications</source>
          <volume>1</volume>
          (
          <issue>30</issue>
          ) (
          <year>2022</year>
          ) 3
          <fpage>17</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Guliyev</surname>
          </string-name>
          ,
          <article-title>Research of methods for constructing decision trees for the implementation of the random forest algorithm in the medical field</article-title>
          ,
          <source>Measuring and Computing Devices in Technological Processes</source>
          <volume>1</volume>
          (
          <year>2025</year>
          ) 36
          <fpage>43</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A. O.</given-names>
            <surname>Otroshchenko</surname>
          </string-name>
          ,
          <article-title>Business analysis and modeling of decision-making processes in project management</article-title>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolomiiets</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Miroshnychenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ziuziun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Datsenko</surname>
          </string-name>
          , T. Kmytiuk,
          <article-title>Development of Project Management Models for Information Systems to Improve Website SEO Metrics, in: XI International Scientific Conference "Information Technology and Implementation" (IT&amp;I 2024)</article-title>
          , Vol.
          <volume>3909</volume>
          ,
          <year>2024</year>
          , pp.
          <fpage>334</fpage>
          <lpage>345</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3909</volume>
          /Paper_27.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>O. M.</given-names>
            <surname>Berezky</surname>
          </string-name>
          , et al.,
          <article-title>Application of the linear regression method for the analysis of quantitative characteristics of cytological images</article-title>
          ,
          <source>Ukrainian Journal of Information Technology</source>
          <volume>3</volume>
          (
          <issue>1</issue>
          ) (
          <year>2021</year>
          ) 73
          <fpage>77</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>O. I.</given-names>
            <surname>Sheremet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O. V.</given-names>
            <surname>Sadovoi</surname>
          </string-name>
          ,
          <article-title>Support Vector Method (SVM)</article-title>
          ,
          <source>Mathematical Modeling</source>
          <volume>1</volume>
          (
          <year>2013</year>
          ) 13
          <fpage>17</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N.</given-names>
            <surname>Koshkina</surname>
          </string-name>
          ,
          <article-title>On increasing the accuracy of JPHIDE steganogram detection</article-title>
          ,
          <source>Physical and Mathematical Modeling and Information Technologies</source>
          <volume>32</volume>
          (
          <year>2021</year>
          ) 170
          <fpage>174</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>W.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>Predicting sports injuries using machine learning: Risk factors and early warning systems</article-title>
          ,
          <source>Molecular &amp; Cellular Biomechanics</source>
          <volume>22</volume>
          (
          <year>2025</year>
          )
          <article-title>335</article-title>
          . doi:
          <volume>10</volume>
          .62617/
          <year>mcb335</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>E.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <source>Sports Betting: An Application of Machine Learning to the Game Prediction, Applied and Computational Engineering</source>
          <volume>132</volume>
          (
          <year>2025</year>
          )
          <fpage>104</fpage>
          118. doi:
          <volume>10</volume>
          .54254/
          <fpage>2755</fpage>
          -
          <lpage>2721</lpage>
          /
          <year>2024</year>
          .20626.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <article-title>An, Enhancing Sports Team Management Through Machine Learning</article-title>
          , IEEE Access (
          <year>2025</year>
          ). doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2025</year>
          .
          <volume>3551889</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Obradovi</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Ke o, Sports Results Prediction Model Using Machine Learning</article-title>
          ,
          <source>SAR Journal - Science and Research</source>
          (
          <year>2024</year>
          )
          <fpage>184</fpage>
          189. doi:
          <volume>10</volume>
          .18421/SAR73-03.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <article-title>Exploration of machine learning based on big data in sports models and physical education teaching</article-title>
          ,
          <source>Molecular &amp; Cellular Biomechanics</source>
          <volume>22</volume>
          (
          <year>2025</year>
          )
          <article-title>940</article-title>
          . doi:
          <volume>10</volume>
          .62617/
          <year>mcb940</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <article-title>FIFA 21 complete player dataset</article-title>
          . URL: https://www.kaggle.com/datasets/stefanoleone992/fifa21
          <article-title>-complete-player-dataset?resource=download&amp;select=players_15.csv</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>V. B.</given-names>
            <surname>Ilnytskyi</surname>
          </string-name>
          ,
          <article-title>Development of a recommender system for audio content using machine learning methods without a teacher</article-title>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Tristan</surname>
          </string-name>
          ,
          <article-title>Research of methods for analyzing customer feedback on employees for the IS of a product company</article-title>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>V. V.</given-names>
            <surname>Aksenov</surname>
          </string-name>
          ,
          <article-title>Research of methods and means of automated team formation for work in IT projects</article-title>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <article-title>rpart function Rdocumentation</article-title>
          . URL: https://www.rdocumentation.org/packages/rpart/versions/4.1.24/topics/rpart.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <article-title>lm function Rdocumentation</article-title>
          . URL: https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/lm.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <article-title>svm function RDocumentation</article-title>
          . URL: https://www.rdocumentation.org/packages/e1071/versions/1.7-16/topics/svm.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>B.</given-names>
            <surname>Fu</surname>
          </string-name>
          , et al.,
          <article-title>Predictive modeling for durability characteristics of blended cement concrete utilizing machine learning algorithms</article-title>
          ,
          <source>Case Studies in Construction Materials</source>
          <volume>22</volume>
          (
          <year>2025</year>
          )
          <article-title>e04209</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cheng</surname>
          </string-name>
          , et al.,
          <article-title>A robust framework for accurate land surface temperature retrieval: Integrating split-window into knowledge-guided machine learning approach</article-title>
          ,
          <source>Remote Sensing of Environment</source>
          <volume>318</volume>
          (
          <year>2025</year>
          )
          <fpage>114609</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <article-title>Create Elegant Data Visualisations Using the Grammar of Graphics ggplot2</article-title>
          . URL: https://ggplot2.tidyverse.org/.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>