NataliiaVKuznietsova natalia-kpi@ukr.net National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute"

ave. Beresteiskyi 37 03056 Kyiv

Claude Bernard Lyon 1 University

43 boulevard du 11 Novembre 1918 69622 Villeurbanne cedex

IlliaOKvashuk illiakvashuk@gmail.com National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute"

ave. Beresteiskyi 37 03056 Kyiv

AnnaOChemanova ankachemanova@gmail.com National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute"

ave. Beresteiskyi 37 03056 Kyiv

Information Technologies and Security

November 30 2023 Kyiv Ukraine

1613-0073 F9BDE4D9224EEE88D12F425A5AA7E3F3 GROBID - A machine learning software for extracting information from scholarly documents Car-insurance 1 Generalized linear models 2 Scorecard 3 Survival models 4 Claims forecasting 5 1

In this paper, several car insurance claims problems are analyzed and solved via existing statistical models implementation for real-world datasets. The first problem which was studied is the problem of measuring the probability of a claim for a specific policy. This problem is solved by using a set of families of generalized linear models with an additional approach to analyze data by utilizing survival models. The best generalized linear model is then chosen according to statistical criteria. The second problem considers distinct classes of policies. A number of claims and prices are forecasted for the different groups. Same approach as for the first problem, generalized linear models are used and the best model is chosen according to statistical criterion. The third problem is the problem of scorecard generation. A brief interpretation and result of the built scorecard is also provided.

Introduction

Usually, the insurance activity is aimed to protect the property interests of individuals and legal entities in the event at the expense of monetary funds, which are formed from the insurance premiums paid by policyholders. One of the main conditions for the effective functioning of the insurance market is the reliability of its participants -insurance companies. Supporting the ability of insurance companies operating in the market to fulfill their obligations promptly and as a whole. That is their financial stability which is a special starting point for the actual manifestation and implementation of the insurance function. The current financial state of the insurance companies requires the search for new forms and methods of increasing their competitiveness and financial stability. They need to create special decision-support systems for more effective assessment of the policies, more precise forecasting of the probability of claims, evaluate the possible losses and develop more flexible conditions for insurance policy evaluation.

The variety of risk manifestation forms and the frequency and complexity of the consequences of their implementation determine the need for an in-depth analysis of possible risks and economicmathematical justification of the financial policy of insurance companies. For every car insurance company importance of proper policy selection for a given client cannot be overestimated. The insurance premium is formed according to the client's expectation to be prone to raise claims and the size of those claims [1][2][3].

Information for the determination of terms and conditions of policies can be separated into two parts: data concerning a driver and a car. Age, driving, and length of insurance policy are the values that define a driver part of the information. However, some aspects like driver's habits are hard to collect, describe and analyze. On the other hand, information about cars can be specified and collected concerning some technical criteria [2]. It can range from the car type to a quantity of cylinders or safety bags. A practical task is to model and forecast claims with information about cars being available in abundance, hence requiring selections and filtering in search of its most relevant parts.

A completely separate issue is creating models that can be used to predict some aspects of a claim based on selected data. One of the most important tasks is to predict the probability of a claim for a specific case. Insurance firms need to have a proper model for predicting and forecasting claims for different clients. Meanwhile, those methods should be easily interpreted and thus explained to clients or regulators about key factors that affect the terms and conditions of policies.

Problem statement

This work is concentrated on solving the main problems, which appear in the insurance field. The first and foremost task is that the claim expectation should be forecasted for a given client. It could be measured by the claim's probability. Companies need a way to approximate the chances of claims to properly form policies' terms for a given client. This task requires taking into account the client's data and forming a decision based on it.

The second task is forecasting the number of claims for each group. The importance of this task is quite understandable while it is a part of company policy selection. By grouping clients by aggregating values, groups can be created. For these groups, the number of claims can be estimated and the models for forecasting can be built. The approach can follow two possible scenarios: modeling only the number of claims or total spending on a group.

Third task the model creation, which is usually paired with interpretation. This interpretation can provide valuable insight into what values increase the probability of the claim. This allows us to create scorecards that can be built to provide an easy tool to make decisions directly from data provided by a client. The main objective of this study is to define not only the probability and cost (value) of each claim but also the subset of the most damaged cases.

Methods

The appropriate approach usually depends on the task but the most important is that it is determined by the flow of data extraction and preparation. The same method can be applied to the same data but different approaches and pre-processing techniques may affect the results. For example, [3] provides us with the flow and handling of data and objectives very similar for use in this work. Data is collected on an open platform. Claims are analyzed and the number of which match our task is predicted. However, due to the dataset restriction, the preprocessing was added which yielded comparatively stuffiest results but lacked interpretability due to PCA usage.

Generalized linear models

Generalized linear models (GLM) were the main tools used during our research. They provided a unified framework for modeling and forecasting the target variables [3]. Due to the variable's nature and the different tasks that were tackled, the number of family distributions was used to deal with the problems from different sides and selections of the most fitting.

The generalized linear model is an extension of a simple linear regression model. A linear relationship between variables is the simplest case for researching links between factors. However, this is not true for most real-world processes where the relationship is more complicated than linear. In this case linking function is introduced. There are a number of different families that were used in the research.

The general way of writing down the generalized linear model is as follows:

𝑋𝛽 = 𝑔(𝜇),

where X denotes the independent variables and 𝛽 is a parameters vector 𝑔 is a link function to transform the scale of dependent variable 𝜇 to suit a linear relationship. Generalized models can be used for discrete or continuous variables which provides it with a significant advantage.

Logistic regression (LR) is a statistical method that is used for classification values into different categories. In the scope of the research, the logistic regression was used for modeling claim probability for the one police.

𝑋𝛽 = 𝑙𝑜𝑔𝑖𝑡(𝜇) = ln

Normal or Gaussian generalized model uses an identity link function which is the same as simple linear regression. 𝑋𝛽 = 𝜇. Poisson regression is a statistical model that is used when the dependent variable is a count of occurrence. Its link function is following: 𝑋𝛽 = ln(𝜇).

Survival models

To predict claims or similar events like death or accidents, survival models can be used. They can be utilized when the outcome can be traced along some period of time [4].

The simplest form of survival model is a table with all events noted with timestamp of occurrences. It may give a significant insight into the time periods when most events occur.

Scorecards

Scorecards are special tables constructed in a way to provide scores for every feature, summing up the scores for a record, the total points can be estimated. It is possible to move records to one of the preselected categories by assigning levels to the score.

Scorecards are powerful practical tools that can be used to fast identify policies with high risks [4]. Scorecards are built by using Weight of evidence -WoE, Information value -IV, and Population Stability Index -PSI.

𝑊𝑜𝐸 = ln

Other methods and models

The prediction of the insurance field is huge and rich with many approaches and methods that are effective for forecasting the probability of claims [5][6][7][8][9]. Some methods cover not only the same objective as the current study but are also applied to handling more financially oriented data, missing data, and combining results of the several models [5][6][7][8][9]. Let's make a brief overview of these methods and present results in a general table Table 1.

Decision Tree

A decision tree (DT) and its variation is a family of classification methods that are built on a tree structure for handling the decision-making process based on binary decisions on each step. This allows to apply of the method to data with non-linear relations between features and target variables.

There are several extensions of the basic model: random forest, CART models as part of multivariable trees. An example of research is in the work [22]. The random tree is used for classification tasks so a direct comparison of this method with the regression family of methods doesn't seem to be direct. There are tasks like determination of whether the claim will happen at all which can be approached by both methods but with prediction of continuous variable only one method could be used.

The simplest model is straightforward: each node checks features and directs the pipeline to one of two possible branches till the final is reached. However, this model is not suitable for complex data since it tends to overfit and variable selection can be biased.

One of the very popular extensions that was also covered by work [22] is CART or Classification and Regression Trees. It overcomes the limitation of the original model by allowing to model and predict regression variables without restricting original capabilities for categorical methods.

Another method is Random Forest. It combines several decision trees which in turn can be regressive together and via weighting of their output comes up with a single decision. It can be seen as a statistical-machine learning algorithm.

A further development that might not be so widespread in the Insurance topic but noteworthy is multivariable trees which use multivariable values for response variables.

Machine Learning

Support Vector Machines (SVM) is a method dedicated to providing solutions for both classification and regression problems. It is a supervised learning algorithm in which the idea is based on a hyperplane. This hyperplane of space of fewer dimensions is target one and is used for decisionmaking and boundary creation for target point separation.

One of the SVM key features is that in cases when it is not possible to find a plane in the current domain it can transfer inputs to higher dimensions in order to find a hyperplane in a new, higher dimension. This allows us to overcome obstacles that the target dimension possesses. SVM can be modified to solve regression tasks in [23].

Dataset

Necessary data for model creation were obtained from a Car Insurance dataset provided on a Kaggle web site [24]. The mentioned dataset is oriented on technical aspects of the car with most variables featuring physical parts of a machine for which police is formed.

Dataset consists of two parts which were used for training and testing. It contained 58592 and 39063 records for each part respectively. Each record is a unique policy with information about the owner of the policy and the car. The dataset has information about whether there was a claim during the upcoming 6 months for the insurance. This was a target value during the first stage of modeling. Additionally, the dataset had information about a range of different features with the total number of variables being equal to 44.

For further research, the grouping by several variables has been made with the aim of forming groups of special clients and policies for which modeling was made.

It's important to note that the dataset doesn't contain information about financial data. There is no information about the price of cars and insurance premiums for a policy.

Modeling Results

Modeling and forecasting have been done using generalized linear models for binomial, gaussian, and Poisson types. Scorecard was also generated to assist in decision-making and interpretation of the results.

Modeling for the probability of claim was done by building two models -binomial and Gaussian. The comparison presented in Table 2 has shown that Gaussian performs significantly better. From 44 variables several were selected based on correlogram and common sense:

 age_of_car -how old is the car;  policy_tenure -the length of the policy up to date;  area_cluster -the area where most driving by the policy holder is done;  make -the car's manufacturer;  atr -synthesized variable based on the car's features: extra safety bags, lamps, etc.  ncap_rating -rating the car's safety given by the agency.

The target relationship is then represented by the following formula:

𝑖𝑠_𝑐𝑙𝑎𝑖𝑚 = 𝑔(𝑘 + 𝑘 × 𝑎𝑔𝑒_𝑜𝑓_𝑐𝑎𝑟 + 𝑘 × 𝑝𝑜𝑙𝑖𝑐𝑦_𝑡𝑒𝑛𝑢𝑟𝑒 + 𝑘 × 𝑎𝑟𝑒𝑎_𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑘 × 𝑚𝑎𝑘𝑒 + 𝑘 × 𝑎𝑡𝑟 + 𝑘 × 𝑛𝑐𝑎𝑝_𝑟𝑎𝑡𝑖𝑛𝑔 ).

It can be seen that no significant outliers in the data by judging of the distribution of the predicted values. The maximum claim probability for the whole dataset according to the model is not bigger than 0.2. This can be interpreted as uncertainty in the provided data. There are examples of claims availability and absence for the records with match all key features. All together it undermines the meaning of concentrating on one record.

The confusion matrix further highlighted the problem of such an approach. With a threshold of 0.1 it was apparent that models underperform (binomial) which is presented in Table 3 and the confusion matrix for the normal distribution which is presented in Table 4. In the next stage the modelling was made based on survival theory. It is possible to construct a survival model where each claim is treated as the death of a member of the population. We will count the length of the policy as a measure of time. Thus, the claims population "survives" during the policy length interval. It was decided that high-quality prognoses cannot be derived from existing data when claims prediction is done in the scope of the simple policy.

Let's build a Cox proportional hazards model:

𝑐𝑜𝑥𝑝ℎ(𝑓𝑜𝑟𝑚𝑢𝑙𝑎 = 𝑆𝑢𝑟𝑣(𝑝𝑜𝑙𝑖𝑐𝑦_𝑡𝑒𝑛𝑢𝑟𝑒, 𝑖𝑠_𝑐𝑙𝑎𝑖𝑚) ~ 𝐹 (𝑎𝑔𝑒 + 𝑎𝑟𝑒𝑎 + 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛_𝑑𝑒𝑛𝑠𝑖𝑡𝑦, 𝑑𝑎𝑡𝑎 = 𝑐𝑎𝑟_𝑖𝑛𝑠𝑢𝑟𝑎𝑛𝑐𝑒_𝑡𝑖𝑏𝑏𝑙𝑒)),

where n = 58592, number of events = 3748.

It can be seen however that length of policy indeed has an effect on the claims number amounts but this observation is rather trivial and cannot be used to make a decision since only short-range policies should be preferred (Figure 1). Therefore, the relationship between the length of the policy and the frequency of lawsuits was revealed. At the moment of time 1, 1.6 and 1.7 year duration there is a sharp increase in claims. It is possible to perform separation and in the future to focus on the threshold values found. Also from the survival model is easier to determine the duration of the most risky policies and to define the possible new policies politics. The calculation of individual cases (a claim for each policy separately) showed the absence of parameters and characteristics that would accurately indicate the onset of a claim. All probabilities for each policy lie between 0.001 and 0.12. In this case, a decision was made to proceed to the consideration of individual segments.

Grouping of data by segment, manufacturer, and machine brand was performed. Thus, we have moved from looking at an individual car to the segment as a whole, where individual characteristics are of little importance.

Two values can be calculated for segments:

1. The number of claims in the segment.

2. Amount of payment by segments.

It was decided to implement a further approach to working with groups. The Poisson generalized linear model was chosen as the model to forecast the number of cases. It showed a high level of accuracy.

The equation for modelling relationships was presented in the such way: 𝑒𝑣𝑎𝑙𝑢𝑎𝑡𝑒𝑑_𝑁 = 𝑔(𝑡𝑜𝑡𝑎𝑙 + 𝑠𝑒𝑔𝑚𝑒𝑛𝑡 + 𝑎𝑟𝑒𝑎_𝑐𝑙𝑢𝑠𝑡𝑒𝑟 + 𝑎𝑖𝑟𝑏𝑎𝑔𝑠 + +𝑚𝑎𝑘𝑒 + 𝑝𝑜𝑙𝑖𝑐𝑦_𝑡𝑒𝑛𝑢𝑟𝑒 + 𝑎𝑔𝑒_𝑜𝑓_𝑐𝑎𝑟). As can be seen in Figures 3 and 4 the claims' number prediction across groups has a higher quality degree. This also shows that despite the low ability to predict each unique case, prediction of the group is a much easier task. Another approach was chosen for dealing with the group. It was about forecasting the price of all cars for which claims were issued. The gaussian model was used as the most appropriate. This also showed significant accuracy (Table 6).

Additionally, in the dataset, the car's price was missing data. For this model, the following approach was used: 1. To find the average price for every class. 2. Adjust it according to the attribute feature. 3. To group price per category to create a new feature -total (price).

The equation for modeling is as follows: 𝑝𝑎𝑖𝑑_𝑝𝑟𝑖𝑐𝑒 = 𝑔(𝑡𝑜𝑡𝑎𝑙 + 𝑔𝑟𝑜𝑢𝑝_𝑝𝑟𝑖𝑐𝑒 + 𝑠𝑒𝑔𝑚𝑒𝑛𝑡 + 𝑝𝑜𝑙𝑖𝑐𝑦_𝑡𝑒𝑛𝑢𝑟𝑒 + + 𝑎𝑔𝑒_𝑜𝑓_𝑐𝑎𝑟). We need to understand which variables and intervals for these variables are the most significant in the aim of our insurance task. Information value (IV) is one of the most useful techniques for selecting important variables in a predictive model. This helps to rank the variables based on their importance. On Figure 5 it is presented how many claims cases were and how they correlated in accordance to different values of the car's age. Finally, the scorecard was built (Table 7). It provided information about values that are associated with high risk of a claim for this dataset. Non-significant values have been filtered out. The remaining variables describe continuous data -age of policyholder and policy tenure for which binning is made. Categorical variables were also presented in the work -area of clusters which were named in the initial dataset and ranges from C1 to C22 and variables that related to technical aspects: rear mirror availability and functionality, brakes type, and transmission type.

Conclusion

Today car insurance companies require a lot of information to decide policies and conditions [5].

Even though a vast amount of information can be collected it doesn't guarantee the ability to create a model that can predict a claim for a specific policy with a significant level of accuracy due to the randomness the of claim's nature. Some special cases can be chosen, less or more prone to claims cases can be selected but it doesn't allow to make a robust prediction according to the results. From a built model for probability prediction, the gaussian generalized model has been chosen. It shows that claims' nature cannot be determined based on some specific features or its combinations since for same key variables. There are examples of policies with and without claims. Obtained values show a high level of centering which doesn't allow to select intervals for confident claim selection and hence undermines the usefulness of such an approach. The problem of single-claim prediction is the hardest one. For the claims risk management, we need to forecast the probability of each claim, of each type of claim, and to develop a special scoring card in an understandable and easily interpretable manner with the key features automatically.

More promising are results for a group of claims where policies are selected and combined under the same group with similar features. Such groups have a higher degree of an accuracy and can be modeled and forecasted with respect to number of claims or total cars' price for which claims have been made. Overall, the results show a low ability to predict specific cases but relatively high confidence in forecasting in big groups.

It is worth noting that different methods like Random Forest could perform better with the task of predicting claims per observation which can be examined in consequent researchers.

Finally, the scorecard is a high-quality tool to make decisions for clients directly. It is not only easy to interpret but to use. We used the scorecard to determine in an understandable and easily interpretable manner the key features. It yields great results on the grouped data and provides valuable insights about the tendencies. It is also useful to implement the scorecards instrument as a good tool for telecom and different finance for the big data tasks [25,26] where it is needed to evaluate some scores and influence of characteristics as well.

.𝐼𝑉 = ∑ (%𝑜𝑓𝑛𝑜𝑛 − 𝑒𝑣𝑒𝑛𝑡𝑠 − %𝑜𝑓𝑒𝑣𝑒𝑛𝑡𝑠) ⋅ 𝑊𝑂𝐸. 𝑃𝑆𝐼 = % 𝑜𝑓 𝑟𝑒𝑐𝑜𝑟𝑑𝑠 𝑏𝑎𝑠𝑒𝑑 𝑜𝑛 𝑠𝑐𝑜𝑟𝑖𝑛𝑔 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 𝑖𝑛 𝑆𝑐𝑜𝑟𝑖𝑛𝑔 𝑆𝑎𝑚𝑝𝑙𝑒 (𝐴) − % 𝑜𝑓 𝑟𝑒𝑐𝑜𝑟𝑑𝑠 𝑏𝑎𝑠𝑒𝑑 𝑜𝑛 𝑠𝑐𝑜𝑟𝑖𝑛𝑔 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒 𝑖𝑛 𝑇𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑆𝑎𝑚𝑝𝑙𝑒 (𝐵) * 𝑙𝑛(𝐴/𝐵).

Figure 1 :1Figure 1: Survival model for the insurance policies

Figure 2 :2Figure 2: Real (black) and estimated (green) values plotted together

Figure 3 :3Figure 3: Value of claimed cars (black) and estimated (green)

Figure 4 :4Figure 4: Against segments to which cars belong A-Utility

Figure 5 .5Figure 5. Informational value of the variable age_of_car

Table 11Comparison of different methods used for the insurance fieldArticle & yearPurpose algorithmsAlgorithmsPerformanceThe BestmetricsModel(Smith et al.Classification to PredictDecision tree (DT),Accuracy ROCNeural2000) [9]Customer RetentionNeural NetworksNetworksPatterns(NN)(Günther et al.Classification to predictLogistic regressionROCLogistic2014) [10]the risk of leavingand GAMSregression(WeerasingheClassification to predictLR, DT, NNPrecision RecallNeuralandthe number of claimsSpecificitynetworksWijegunasekara(low, fair, or high)2016) [11](Fang et al.Regression to forecastRandom ForestR-squaresRandom2016) [12]insurance customer(RF), LR,DT SupportRMSEForestprofitabilityVector Machines(SVM), GradientBoosting (GB)(Subudhi andClassification to predictDecision trees,SensitivitySVMPanigrahi 2017)insurance fraudSVM, MultilayerSpecificity[13]Perceptron (MLP)Accuracy(Mau et al. 2018)Classification to predictRandom ForestAccuracy AUCRF[14]churn, retention, andROC F-scorecross-selling(Jing et al. 2018)Classification to predictNaive Bayes,AccuracyBoth have[15]claims occurrenceBayesian, Networkthe sameaccuracy(Kowshalya andClassification to predictJ48, RF, NaiveAccuracyRandomNandhini 2018)insurance fraud andBayesPrecision RecallForest[16]percentage of premiumamount(Sabbeh 2018)Classification to predictRF, AdaBoost, MLP,AccuracyAdaBoost[17]churn problemStochastic GB,SVM, K-nearestNeighbor (KNN),DT, Naive Bayes,LR, LinearDiscriminantAnalysis (LDA)(Stucki 2019)Classification to predictLR, RF, KNN, AdaAccuracy F-Random[18]churn and retentionBoosting Trees, NNScore AUCForest(Dewi et al.Regression to predictRandom forestMSERandom2019) [19]claims severityForest(Pesantez-Classification to predictXGBoost, LogisticSensitivityXGBoostNarvaez et al.claims occurrenceregressionSpecificity2019) [20]AccuracyRMSE ROC(Abdelhadi et al.Classification to predictJ48, NN, XGBoost,Accuracy ROCXGBoost2020) [21]claims occurrenceNaive Bayes

Table 22Models' comparisonModelResidualsAICBinomial3475.5828.34Gaussian2730027364

Table 3 Confusion3matrix (binomial)Actual \ predictions01050039480513161587Table 4Confusion matrix (normal)Actual \ predictions01051791305313371377

Table 55ModelResidualsAICPoisson629.271425.6Normal1425.65952.5

Table 66ResultModelResidualsAICNormal8.4797e+115886.3

Table 77Scorecard for a claim's predictionNumberVariableBinningScoreofinterval0age_of_policyholder[-inf ~ 0.384615384615385)0.741age_of_policyholder[0.384615384615385 ~ 0.442307692307692)0.062age_of_policyholder[0.442307692307692 ~ 0.490384615384615)-0.513age_of_policyholder[0.490384615384615 ~ 0.634615384615385)0.254age_of_policyholder[0.634615384615385 ~ inf)-0.840area_clusterC17,C20,C9,C7,C1,C10,C152.611area_clusterC16,C13,C5,C12,C61.132area_clusterC11,C3,C2,C8-0.793area_clusterC4,C19,C14,C22,C21,C18-2.130policy_tenure[-inf ~ 0.211309751692924)5.31policy_tenure[0.211309751692924 ~ 0.813392835491761)1.492policy_tenure[0.813392835491761 ~ inf)-3.860is_day_night_rear_view_mirrorNo01is_day_night_rear_view_mirrorYes0.260steering_typeManual,Power0.051steering_typeElectric0.170rear_brakes_typeDrum0.051rear_brakes_typeDisc0.260is_tpmsNo0.051is_tpmsYes0.260make[-inf ~ 2)0.051make[2 ~ inf)0.190transmission_typeManual0.05

Applying CDMA technique to network-on-M XinWang TapaniAhonen JariNurmi ;Denuit XMarechal SPitrebois J Actuarial modelling of claim counts: risk classification, credibility and bonus-malus systems Walhin Wiley 2007 Unravelling the predictive power of telematics data in car insurance pricing RVerbelen KAntonio GClaeskens 10.1111/rssc.12283 Journal of the Royal Statistical Society, Series C (Applied Statistics) 67 5 2018 Generalized linear models JAshworthNelder RW MWedderburn 10.2307/2344614 Journal of the Royal Statistical Society: Series A (General) 135 3 1972 NVKuznietsova PIBidyuk Theory and practice of financial risk analysis: systemic approach

Kyiv

Lira-K 2020 Predictive Modeling of Insurance Claims Using Machine Learning Approach for Different Types of Motor Vehicles VSelvakumar DKSatpathi PT VPraveen Kumar VVHaragopal 10.13189/ujaf.2021.090101 Universal Journal of Accounting and Finance 9 1 2021 Combining Predictions of Auto Insurance Claims CYe LZhang MHan YYu BZhao YYang 10.3390/econometrics10020019 Econometrics 10 2 19 2022 Claim Amount Forecasting and Pricing of Automobile Insurance Based on the BP Neural Network WYu GGuan JLi QWang XXie YZhang YHuang XYu CCui 10.1155/2021/6616121 Hindawi Complexity 2021 Improving Imbalanced Data Classification in Auto Insurance by the Data Level Approaches MHanafy RMing 10.14569/IJACSA.2021.0120656 IJACSA) International Journal of Advanced Computer Science and Applications 12 6 2021 An analysis of customer retention and insurance claim patterns using data mining: a case study KASmith RJWillis MBrooks 10.1057/palgrave.jors.2600941 Journal of the Operational Research Society 51 Modelling and predicting customer churn from an insurance company C.-CGünther IFTvete KAas GISandnes ØBorgan 10.1080/03461238.2011.636502 Scandinavian Actuarial Journal 2014 1 2014 A Comparative Study of Data Mining Algorithms in the Prediction of Auto Insurance Claims KP M L PWeerasinghe MCWijegunasekara European International Journal of Science and Technology 5 1 January, 2016 Customer profitability forecasting using Big Data Analytics: A case study of the insurance industry KFang YJiang MSong 10.1016/j.cie.2016.09.011 Computers & Industrial Engineering 2016 Use of Optimized Fuzzy C-Means Clustering and Supervised Classifiers for Automobile Insurance Fraud Detection SSubudhi SPanigrahi 10.1016/j.jksuci.2017.09.010 2017 Journal of King Saud University -Computer and Information Sciences Forecasting the next likely purchase events of insurance customers: A case study on the value of data-rich multichannel environments SMau IPletikosa JWagner 10.1108/IJBM-11-2016-0180 International Journal of Bank Marketing 36 6 2018 Research on Probability-based Learning Application on Car Insurance Data LJing WZhao KSharma RFeng 10.2991/macmc-17.2018.14 proceedings of the 2017 4th International Conference on Machinery, Materials and Computer (MACMC 2017) the 2017 4th International Conference on Machinery, Materials and Computer (MACMC 2017)

Amsterdam

Atlantis Press 2018 Predicting fraudulent claims in automobile insurance GKowshalya MNandhini 10.1109/ICICCT.2018.8473034 Proceedings of the 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT) the 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT)

Coimbatore, India

April 20-21 Machine-learning techniques for customer retention: A comparative study SFSabbeh International Journal of Advanced Computer Science and Applications 9 2 2018 OStucki Predicting the Customer Churn with Machine Learning Methods: Case: Private Insurance Customer Data

Lappeenranta, Finland

2019 LUT University Master's dissertation Analysis Accuracy of Random Forest Model for Big Data -A Case Study of Claim Severity Prediction in Car Insurance KCDewi HMurfi SAbdullah 10.1109/ICSITech46713.2019.8987520 Paper presented at 2019 5th International Conference on Science in Information Technology (ICSITech)

Yogyakarta, Indonesia

October 23-24 Predicting Motor Insurance Claims Using Telematics Data-XGBoost versus Logistic Regression JPesantez-Narvaez MGuillen MAlcañiz 10.3390/risks7020070 Risks 7 2 70 2019 A proposed model to predict auto insurance claims using machine learning techniques SAbdelhadi KElbahnasy MAbdelsalam Journal of Theoretical and Applied Information Technology 30th 98 22 November 2020 Predictive analytics of insurance claims using multivaria te decision trees ZQuan EAValdez 10.1515/demo-2018-0022 Depend. Model 6 2018 Motor Insurance Claim Status Prediction using Machine Learning Techniques EAlamir TUrgessa AHunegnaw TGopikrishna 10.14569/IJACSA.2021.0120354 IJACSA) International Journal of Advanced Computer Science and Applications 12 3 2021 <author> <persName><surname>Kaggle</surname></persName> </author> <ptr target="https://www.kaggle.com/datasets/ifteshanajnin/carinsuranceclaimprediction-classification?resource=download" /> <imprint/> </monogr> </biblStruct> <biblStruct xml:id="b24"> <analytic> <title level="a" type="main">Data Mining Methods, Models and Solutions for Big Data Cases in Telecommunication Industry NKuznietsova PBidyuk MKuznietsova 10.1007/978-3-030-82014-5_8 Lecture Notes on Data Engineering and Communications Technologiesthis link is disabled 77 2022 Analysis and Development of Mathematical Models for Assessing Investment Risks in Financial Markets NKuznietsova EBateiko CEUR Workshop Proceeding 3503