-

Algan, Y., & Cahuc, P. (

Determinants of social trust: analysis using machine learning methods

Tamara Merkulova

tamara.merkulova@karazin.ua

Hanna Bohdanova

hanna.bohdanova@gmail.com

2015

5 1 1 6

This paper presents results of testing individual-based and society-based hypotheses of interpersonal trust and clarifying the relationship between institutional trust an individual and societal characteristics on the latest data of the World Values Surveys (2017-2021) using machine learning methods. The initial sample size consisted of 70,867 respondents. These data were used to develop models of interpersonal and institutional trust. Factors that can be considered as determinants of social trust were studied using classification models (for both interpersonal and institutional trust) and cluster analysis (for trust in government). Classification allows recognizing the class (a level of trust) to which the respondent belongs according to a range of factors (predictors). We defined 2 classes in accordance with the responses: people who trust in strangers (government) or don't trust. Classification models were developed with various sets of predictors (determinants of trust): individual characteristics, societal indicators, and mixed composition of determinants. The best results for interpersonal trust as well as for trust in government were obtained in classification models with mixed composition sets of predictors. As a result of cluster analysis, it was clarified what individual and societal characteristics were associated with the high or low level of trust in government. The results of this research can to a certain extent serve as arguments in favor of the multilevel approach to social trust determinants, taking into account the essential role of individual and societal factors for both interpersonal and institutional trust.

Interpersonal trust Institutional trust Machine Learning Clustering Classification models

I Introduction

The study of trust, its origins, and its relationship with the development of society and the economy is a broad area of interdisciplinary researches that are carried out within the framework of various scientific schools. Social trust is often referred to as the keystone of social capital (Newton, 2004), (Rothstein & Stolle, 2008) and considered as a powerful resource for socio-economic development, increasing stability, fairness, and harmony in society (Roth, 2006), (Bjornskov, How does social trust affect economic growth?, 2012), (Algan & Cahuc, 2013).

As is known, social trust has 2 types: interpersonal trust and institutional trust. Interpersonal trust is presented as in-group trust (interpersonal trust between members of a group, for instance, family members, friends, colleagues, etc.) and trust to strangers, which is considered as generalized trust (Kwon, 2019).

Studies of factors that determinate interpersonal trust are based on the ideas provided by individual-oriented theory and the society-based theory (Algan & Cahuc, 2013), (Delhey & Newton, 2005), (Kwon, 2019). The first one considers interpersonal trust as an individual property that is determined by individual characteristics such as education, gender, age, income, etc. The social-based theory assumes that interpersonal trust is a property of society and depends on social, economic, cultural, national, historical, and other factors, that characterize society as a whole.

These theories both have arguments pro and contra, that have been obtained in numerous researches. As it was noted in (Newton, 2004) although many investigations focus on social trust at an individual level, social trust isn’t closely associated with individual characteristics, such as sex, income, education, etc. The authors showed on the data of the third wave of the World Values Study that social trust has a close relationship with a range of societal indicators that are related to the development of democracy and sustainability.

Study of factors that underpin generalized trust in society have revealed at macrolevel 4 indicators that influence trust: economic inequality, civic participation, ethnic homogeneity, and institutional quality (Delhey & Newton, 2005), (Charron & Rothstein, 2014), (Rothstein & Uslaner, 2005), (Roth, 2006). In (Rothstein & Uslaner, 2005), (Roth, 2006) income inequality is considered an essential determinant of the low level of interpersonal trust. The high-trust countries are at the same time high-income countries, they have good governance, low level of income inequality and ethnic homogeneity (Delhey & Newton, 2005). This combination of factors is presented most impressively in the Nordic countries. The analyses of regions in Europe (Charron & Rothstein, 2014) shown, that the quality of institutions is the most essential factor that determines a regional dispersion of trust within a country. At the same time, economic inequality, civic participation, and ethnic homogeneity are not very important to explain a variation in trust.

Data presented in (Tsai, Laczko, & Bjørnskov, 2011) don’t support the hypothesis that social diversity (ethnic, linguistic, and religious) leads to a decrease in the level of trust, at least in the short term. The results highlight the complex interaction of many factors that determine generalized trust in society. The arguments in favor of the positive influence of the state on trust are discussed in (Robbins & Blaine, 2011). The state creates an environment that can enhance social trust, in particular, the public allocation of resources and property rights institutions have a positive effect on generalized trust.

The researches that test individual-based hypotheses of social trust provide evidence that interpersonal trust is associated with individual characteristics. In (Adwere -Boamah & Hufstedler, 2015) the authors present regression models in favor of the assumption, that education and sex are essential factors of interpersonal trust. The study (Almakaeva, Welzel, & Ponarin, 2018) revealed that human empowerment could be considered as a moderator of individual-level determinants of trust.

The researches devoted to trust include the study of cultural, religious, moral factors that can be essential determinants of social trust. The influence of Protestant tradition is discussed in (Delhey & Newton, 2005). The results of statistical analysis in favor of the assumption that religion is a significant factor are presented in (Uslaner, 2002).

Institutional trust shows whether citizens are confident in institutes. Citizens evaluate institutes according to their expectations of effectiveness and fairness that institutions should demonstrate.

“Citizens expect institutions to perform efficiently, effectively, fairly, and ethically in accordance with the roles assigned to them by law or with social norms in the eyes of citizens” ( (Kwon, 2019), p.28).

Thus, trust of citizens to institutions is based a) on the ability of institutions to perform their functions assigned to them in accordance with law and social norms (competence of institutions), b) on their acceptance of institutional operations from moral criteria.

Therefore, institutional trust has 2 dimensions: the competence dimension is associated with the efficiency and effectiveness of institutions (this can be presented by macroeconomic indicators), the value dimension includes fairness, transparency, non corruption, and other moral values (Kwon, 2019). Trust in government is one of the most important types of institutional trust from the perspectives of the legitimacy of government and other political institutions (Knah, 2016).

Thus, the influence of various individual and group (social) characteristics on social trust hasn’t been completely clarified and requires further research. Machine learning gives the tools to study this problem on the big data provided by the World Values Surveys, which include a direct question on trust in strangers and trust in government.

Our tasks include testing individual-based and society-based hypotheses of interpersonal trust and clarifying the relationship between institutional trust an individual and societal characteristics on the latest data of the World Values Surveys.

II Methodology and Data

This study uses data from the World Values Survey (the World Values Survey, 20172021). The World Values Survey (WVS) is an international research program that analyzes a wide range of indicators across social, political, economic, religious and cultural groups. This project evaluates the impact of values on the social, political and economic development of countries. Waves of research are repeated every 5 years. In this study, we used the data of the 7th wave, which took place in 80 countries of the world in 2017-2021.

The data from this study was used to build models of interpersonal and institutional trust. The initial sample size consisted of 70,867 respondents. Hypotheses about the determinants of these types of trust were tested using machine learning methods.

Interpersonal trust

At the first stage, when constructing models of interpersonal trust (Generally speaking, would you say that most people can be trusted or that you need to be very careful in dealing with people?), only individual characteristics were used as predictors: "Sex", "Age", "Education","Satisfaction_with_life" (on a scale from 1, which means you are “completely dissatisfied”, to 10, which means you are “completely satisfied”), "Employment_status" (data on this issue was binarized: it has a value of 1 if the respondent works (full-time, part-time, self-employed), and a value of 0 if they do not work (a retiree, a student, a housewife, etc.)), "Satisfaction_with_financial_situation_of_household" (scale score on which 1 means you are “completely dissatisfied” and 10 means you are “completely satisfied”), "Marriage", "Religion" (How important is God in your life? Please use this scale to indicate. 10 means “very important” and 1 means “not at all important.”).

Classification models were used to identify the presence of a relationship between individual characteristics and interpersonal trust.

Then, we expanded the range of predictors by adding factors that can be considered characteristics of society and institutions: "Corruption" (How would you place your views on corruption in your country on a 10-point scale where 1 means “there is no corruption in my country” and 10 means “there is abundant corruption in my country”), "Migration" (How would you evaluate the impact of the people from other countries who come to live in [your country] - the immigrants on the development of [your country]?), "Security" (Could you tell me how secure do you feel these days?), "Democratically" (How important is it for you to live in a country that is governed democratically?).

After that, we compared the quality of classification models constructed for two sets of predictors.

Institutional trust

The study used an indicator of trust to government (How much confidence you have in the government: is it a great deal of confidence, quite a lot of confidence, not very much confidence or none at all?). The following hypotheses were tested: 1) Institutional trust is dependent on individual characteristics; 2) Institutional trust is dependent on the characteristics of society and the quality of institutions; 3) Institutional trust is dependent on a mixed composition of predictors.

The same set of individual characteristics was used for both models of interpersonal trust and institutional trust. The following institutional-related features were used: "Corruption", "Security", and “Democracy". These features reflect citizen’s opinions on the degree of realization of said feature in their country.

We also added another indicator to the characteristics of society and the quality of institutions - "Ethnic_group". By definition, this feature is described as “the ethnic group of the respondent is indicated. Answer options – 1. White, 2. Black, 3. South Asian Indian, Pakistani, etc., 4. East Asian Chinese, Japanese, etc., 5. Arabic, Central Asian, 6. Other”.

For institutional trust, classification and clustering models were built, in order to identify the relationship between said trust and the identified predictors.

Data processing and analysis were performed using Python.

III Results and analysis 1. Interpersonal trust. Classification problem.

Data classification is the process of analyzing structured or unstructured data and organizing it into categories based on file type, contents, and other metadata (Bowles, 2015).

The most common machine learning methods for classification are Logistic regression, Naive Bayes classifier, Support vector machines, k-nearest neighbor, Neural networks. (Horwood, 1994), (MacKay, 2005).

1.1. Classification problem for interpersonal trust and individual characteristics.

To solve the classification problem for interpersonal trust and individual characteristics, we built a machine learning model. In this model, eight individual characteristics were used as predictors: "Sex", "Age", "Education","Satisfaction_with_life", "Employment_status", "Satisfaction_with_financial_situation_of_household", "Marriage", "Religion".

In the original data set, some respondents declined answering some questions. This resulted in missing data, so after excluding such cases, 65039 responses remained in the data set.

For each classification problem, the original dataset was divided into training (80%) and test (20%) sets. The training sample in this model contains data about 52,031 respondents, the test sample - 13,008 respondents.

We used 5 machine learning methods for modeling and the resulting models were compared in terms of accuracy.

Accuracy in machine learning refers to one of the metrics for evaluating classification models, which is used to determine which model is best for identifying relationships and patterns between variables in a dataset based on input or training data. The accuracy of the model is calculated as follows:

For binary classification, accuracy can also be calculated in terms of positives and negatives as follows:

Here TP – True Positive (true positive is an outcome where the model correctly predicts the positive class)

TN – True Negative (true negative is an outcome where the model correctly predicts the negative class)

FP – False Positive (false positive is an outcome where the model incorrectly predicts the positive class)

FN – False Negative (false negative is an outcome where the model incorrectly predicts the negative class.)

Given that the exact nature of error is irrelevant, we can restrict ourselves to only considering accuracy as our performance metric.

Note that all the methods used to build the classification model gave close estimates of accuracy (77% - 78.5%). Neural network classifier showed the best accuracy (78.5%) on the test set (see Table 1).

When applying the logistic regression model, the significance of the coefficients for the variables was tested (Table 2). We used p-value estimates on regression coefficients to test the null hypothesis that the coefficients are zero. All p-values were higher than the pvalue threshold of 0.1, which means that all exogenous variables affect the endogenous variable in some way. Here the endogenous variable is interpersonal trust. P-value 0.0670 0.000 0.000

1.2. Classification problem for interpersonal trust and mixed composition of predictors

In this task, individual characteristics and characteristics of society and institutions were used as predictors of interpersonal trust. These are: "Sex", "Age", "Education", "Satisfaction_with_life", "Employment_status", "Satisfaction_with_financial_situation_of_household", "Marriage", "Religion" and "Corruption", "Migration", "Security", "Democracy". After excluding missing data points, 13,608 responses remained to build this model.

Calculations have shown that all exogenous variables are significant in terms of influence on the endogenous variable, since their p-values are close to zero (see Table 3). The best accuracy estimate (80%) was shown by the Support Vector Machines classifier (see Table 4).

The ROC curve is a chart of the number of correctly classified positive examples versus the number of incorrectly classified negative examples (when varying model threshold as an implicit variable). A quantifiable measure of a ROC curve estimate is an Area Under Curve (AUC) estimate. This estimate can be obtained directly by calculating the area under the polyhedron bounded from the right and bottom by the coordinate axes and from the top left by the experimentally obtained points. One can calculate the AUC, for example, using the numerical trapezoidal method: ∫ ( ) ∑ ( )

ROC curve of the binary logistic regression model we obtained, is shown in the figure = 0,72.

Note that all methods show a higher accuracy of models with a mixed composition of predictors, than that with only individual features.

2. Institutional trust. Classification problem for Government Trust and Personality 2.1. Classification task with a set of individual characteristics of the respondents.

Trust in government is one of the most important indicators of institutional trust. In this section, we used the same data set of individual characteristics as for the interpersonal trust models. The sample includes 63,360 respondents.

We built several machine learning models with this feature set. Let’s discuss the first one, namely a Logistic Regression model.

The p-value of the sex variable turned out to be higher than 0.1, so we excluded the gender variable from the predictors of institutional trust due to the fact that it has no effect on trust in the government (Table 5). x e

S 0.2224 e g

A 0.000

The accuracy of the models built by different methods is very low (Table 4), which casts doubt on the suitability of these models (Table 6) (Idris, 2016).

For binary logistic regression, the default threshold is 0.5. In many problems, a much better result may be obtained by adjusting the threshold. We conducted such an analysis and found that the logistic regression model shows the best accuracy at a threshold of 0.47 (Fig. 2).

All methods give higher accuracy of the models in comparison with the results from the previous section (Table 8). Although this level of accuracy is still insufficient.

2.3. Classification Problem for Institutional Trust and Mixed Composition of Predictors

In this section, we examined the relationship of institutional trust with individual and indicators associated with society and institutions. The data set size consists of 13,556 responses.

We excluded the following variables: “Sex”, “Employment_status”, and “Ethnic_group”, since the p-value of these indicators turned out to be higher than 0.1 (Table 9). Table 9 (continued). P-value coefficients in the logistic regression model for institutional trust and mixed composition of predictors (characteristics of society and institutions). PV-avraialubeles 0.0tirrouponC011 0.0itiragonM00 0.0itrceuyS00 0.0itllrecacaoyDm036 1.0tichnE00,irrcaogp_ubA tlirseaannCA1.0tichnE00lracgo_pukB 1.0ticnhE00trsaougp_E ,iissaeenhnCA1.0,tJsaaeeecpnticnhE00troupg_ouhS ,iiIsaanndnA1.0,ititsaaceknPticnhE00itreup_gohW

This set of predictors provides a significant increase in the accuracy of the models for all methods (Table 10).

The ROC curve of binary logistic regression is shown in Figure 3. AUC is 0.772. Fig. 3. ROC curve for a logistic regression model for interpersonal trust and mixed composition of predictors

3. Institutional trust. Clustering problem.

Cluster analysis in Data Mining allows one to find a group of objects that are similar to each other in a cluster, but differ from objects in other clusters. In our study, we applied this method to identify differences in the values of the characteristics of responses that belong to different clusters according to the criterion of trust in the government.

Methods such as "Elbow Method" or "Silhouette Method" can be used to determine the number of clusters (Rousseeuw, 1987). The Elbow method consists of graphically displaying the relationship between the number of clusters and the sum of squares within the cluster (Within Cluster Sum of Squares, WCSS), then select the number of clusters in which the WCSS change begins to level out (Figure 4).

As you can see in Figure 2, these are points 2, 4, 7. To refine the result, we will apply the silhouette method.

The silhouette value represents a measure of how similar a data point is to its own cluster when compared to all other clusters (Figure 5).

Fig 5. Graphic implementation of the Silhouette method.

To select the optimal number of clusters using this method, one needs to select the maximum value of this indicator. As you can see in the figure, the optimal number of clusters is 2.

We used K-means based clustering algorithm to partition the data into clusters. Initially, we included a full set of factors as features, the individual characteristics and characteristics of society and the quality of institutions. Then, we excluded factors with weak variability, and only eight factors remained: “Trust_the_government”, “Employment_status”, “Marriage”, “Corruption”, “Religion”, “Migration”, “Ethnic group_East Asian Chinese, Japanese, etc”, “ Ethnic group_White ”.

Cluster centroid are presented in Table 11.

It is important to take a closer look at the differences in the average values of factors. The first cluster includes respondents with low trust in the government. They are of the White ethnic group and are more religious. The respondents with higher confidence in the government belong to the “East Asian Chinese, Japanese, etc” ethnic group and are less religious than the respondents in the first cluster. For the rest of the indicators, differences in the mean values of the clusters of such scale are not visible.

IV Conclusions

The results of modeling can be summarized in the following conclusions.

Interpersonal trust. Classification models allow recognizing the class (a level of trust) to which the object belongs according to a range of factors (predictors). We defined 2 classes in accordance with the responses (people who trust to strangers or don’t trust).

Classification problem was solved using 2 sets of predictors: individual characteristics ("Sex", "Age", "Education","Satisfaction_with_life", "Employment_status", "Satisfaction_with_financial_situation_of_household", "Marriage", "Religion") and the mixed composition, that includes, in addition to individual, also societal characteristics ("Corruption", "Migration", "Security", "Democracy"). In both cases of predictors sets all the 5 machine learning methods gave close sufficient estimates of the accuracy of models. But the mixed composition allowed to increase accuracy of classification from 77% 78.5% (for individual set models) to 78,3% - 80% (mixed composition models).

Trust in government. Trust in government is one of the most important indicators of institutional trust. The classification problem was solved using 3 sets of predictors: individual characteristics, societal indicators, and the mixed composition. All the predictors in these sets were the same as in interpersonal trust models.

As it was expected, using the first set haven’t led to satisfactory models: all the machine learning methods gave very low accuracy (about 60%). Therefore, the assumption that institutional trust can’t be only explained at an individual level was verified for the case of trust in government.

However, classification models with the only societal characteristics didn’t proved satisfactory results too. Despite this, this set of predictors increased the accuracy of models (to 68%) it didn’t reach the acceptable value.

Finally, the only mixed composition models showed a higher estimate of accuracy. The best results (76.7%) were provided by the Support Vector Machines and K-Nearest Neighbors methods. This value is already high enough to recognize the simulation results as quite satisfactory.

We used the K-means-based clustering algorithm to partition the data into clusters. Cluster analysis of eight factors (“Trust_the_government”, “Employment_status”, “Marriage”, “Corruption”, “Religion”, “Migration”, “Ethnic group_East Asian Chinese, Japanese, etc.”, “ Ethnic group_White ”) divided the set of respondents into 2 clusters. It is important to emphasize the differences between clusters in the average values of factors. First of all, there is a significant gap between clusters in the factor “Trust in government”.

The first cluster includes respondents with low trust in the government. They are of the White ethnic group and they are more religious. The second cluster includes respondents with high confidence in the government belong to the “East Asian Chinese, Japanese, etc.” ethnic group and are less religious than the respondents in the first cluster. For the rest of the indicators, differences in the mean values of the clusters of such scale are not visible.

The results of this research can to a certain extent serve as arguments in favor of the multilevel approach to social trust determinants, taking into account the essential role of individual and societal factors for both interpersonal and institutional trust.