Application of SAS Text Miner for the analysis of citizens' appeals in the system of social protection and social security⋆ Józef Korbicz1 , Oleksii Sholokhov2, ∗, Roman Koval3, Oleksii Zarudnyi3 1 University of Zielona Góra, 9 Licealna Street, Zielona Góra, 65-417, Republic of Poland 2 Taras Shevchenko National University of Kyiv, 64/13 Volodymyrska Street, Kyiv, 01601, Ukraine 3 Institute of Telecommunications and Global Information Space of the National Academy of Sciences of Ukraine, 13 Chokolovsky Blvd., Kyiv, 03186, Ukraine Abstract Issues of social protection and social security have always been among the most urgent for all, without exception, social strata. In the conditions of the war, this sphere acquired special importance. After all, the effectiveness of the state policy of social protection and social security depends not only on the well-being of citizens and the balanced development of society, but also on ensuring national security. During the war, the amount of spending on social protection and social security increased significantly and will continue to increase, despite the limited budgetary funding. Therefore, special attention needs to be paid to the targeting of funds for social protection and social security, as well as control over the targeting of state assistance. In the conditions of war, conducting sociological research, surveys, and personal reception of citizens becomes much more difficult. Taking into account the fact that a significant number of the population uses various social networks, digital platforms of state institutions and organizations, etc., the research of the online environment becomes a promising direction of work with citizens' appeals. Therefore, having information from Internet sources, it is possible to investigate problems that are significant for different social groups, to analyze the moods and expectations of the population. But at present, there are practically no software products in the social security system designed to analyze textual information presented in citizens' appeals. The work proposes a method of building an analytical model for the study of social protection and social security problems that require special attention from the state, using means of analyzing textual information from Internet sources and building classification models. Keywords Text clustering, linguistic rules, intelligent data analysis, social protection and social security, information technology 1. Problems of automation and processing of citizens' appeals in the social sphere Information and analytical activity in the conditions of deepening digitalization of society is becoming an increasingly important component of the system of social protection and social security, which in turn, as noted by domestic and foreign experts [14-16], requires its constant modernization, introduction of modern models, methods and information technology. The introduction of the "Unified Information System of the Social Sphere" [17] was a new step towards the end-to-end digitalization of the pension system and social protection of the population. The purpose of the introduction of the System is to "ensure integral automation of processes in the social 8th International Scientific and Practical Conference Applied Information Systems and Technologies in the Digital Society AISTDS’2024, 2024, October 1, Kyiv, Ukraine * Corresponding author. J.Korbicz@issi.uz.zgora.pl (J. Korbicz) gyroalex@knu.ua (O. Sholokhov); roman.koval.science@gmail.com (R. Koval); oleksii.zarudnyi@gmail.com (O. Zarudnyi) 2338-9598-800 (J. Korbicz); 0000-0002-8676-3724 (O. Sholokhov); 0009-0003-3821-3378 (R. Koval); 0009-0008-7462-3899 (O. Zarudnyi) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings sphere by optimizing and developing electronic information interaction of the subjects of the Unified System aimed at ensuring transparency of the social sphere, digitalization of the social support market and increasing the level of its availability for persons who need it" [17 ]. The development of the Unified Information System of the Social Sphere [1] involves the creation of a unified information and reference environment for recipients of social support. An important place is occupied by the subsystem of working with citizens' appeals, because only in January- September 2024, the Pension Fund of Ukraine registered 504,856 appeals from citizens on issues, of which 229,537 (or 45.5 percent) were electronic appeals [2]. Therefore, the issue of developing methods, models, information technologies for the analysis of textual information from citizens' electronic appeals to institutions of social protection and social security, Internet sources, identifying issues that are most important for those who need state support, is urgent and of practical importance. [18-20]. 2. Statement of the research problem The paper proposes a method of using text analytics tools to build an analytical model for the classification of text information in the task of analyzing citizens' appeals to the Pension Fund of Ukraine. 3. Methods and results In the course of the study, the practical task of determining the need for social protection and social security of residents of different regions of Ukraine and refugees was considered. SAS Text Miner tools [21-23] were used to analyze text information. Incoming information is electronic appeals from citizens that have arrived at the web portal of electronic services of the Pension Fund of Ukraine and the state institution "Government Contact Center [2]. The materials of Internet publications, different in subject matter and audience, both state and non-state, were also examined, from which 162 were selected (names of sources and references to them are presented in Table 1. Table 1 List of Internet sources, information from which was used for analysis Texts N Name of the source Resource address number https://www.ukrinform.ua/ 1 UkrInform 50 rubric-society 2 Public. News https://suspilne.media 25 Website of the international scientific publication 3 "Financial and credit activity: problems of theory https://fkd.net.ua 7 and practice" The newspaper "Government Courier" is the official https://ukurier.gov.ua/uk/a 4 printed publication of the Cabinet of Ministers of 30 rticles Ukraine. The official website of the Kyiv Regional Council of 5 http://korps.com.ua 5 Professional Unions 6 Official website of the National Bank of Ukraine https://knpf.bank.gov.ua 10 7 The official site of the magazine "Forbes Ukraine" https://forbes.ua 15 Website of the electronic publication "Sudovo- 8 https://sud.ua 20 yuridychna Gazeta" Based on the analysis of texts related to issues of social protection and social security posted on the specified Internet resources and in electronic applications, six clusters were obtained. The first cluster includes texts that contain issues related to the pension reform. The most characteristic words and phrases for this cluster were: "reform", "insurance payments", "insurance experience", "mandatory pension savings". The second cluster includes words and phrases describing the issue of accrual and payment of pensions and social benefits by the Pension Fund of Ukraine: "timely payment of pensions", "voluntary contributions to pension insurance", "minimum pension", "indexation of pensions", "increase of pensions", "housing subsidy", "financing of current payments", "recalculation of pensions for working pensioners". The third cluster summarizes the problems of social protection of internally displaced persons. The most characteristic are such words and phrases as "IDPs", "identification", "liberated territories", "payments to displaced persons", "inhabitants of the occupied Crimea", "UN World Food Program", "temporarily uncontrolled territories". The fourth cluster includes words and phrases describing problems related to losses due to military conflict: "military serviceman", "policeman", "combat zone", "missing person", "loss of breadwinner", "family members of the deceased" ". For the fifth cluster, the issues of social protection and social security of refugees are "relevant", in particular, "pension abroad", "work outside Ukraine", "proportional calculation of insurance experience", "insurance experience received in other countries". The sixth cluster summarizes issues related to the victims of the accident at the Chernobyl NPP: "accident", "ChNPP", "Chernobyl". Based on the preliminary analysis of the texts of the appeals, a corpus of texts was formed, a fragment of which is given in the table. 2. Table 2 Frequency matrix of terms for the corpus of texts, built on the basis of the corpus of texts formed from electronic appeals of citizens Number of mentions in the document: Marking Term d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 t1 court 1 0 0 0 0 0 1 2 0 0 t2 allowances 1 0 1 1 0 0 1 0 2 0 t3 military 0 1 0 0 2 1 0 0 0 0 t4 monetary support 0 1 0 0 1 0 0 0 2 0 t5 pension 0 1 0 1 2 2 1 0 1 1 law enforcement t6 0 1 0 0 1 0 0 0 0 0 officers t7 the former 0 1 0 0 1 0 0 0 0 0 t8 accident 0 0 1 1 0 0 0 0 0 1 Chernobyl Nuclear t9 0 0 1 2 0 0 0 0 0 1 Power Plant t10 Ukraine 0 0 1 0 0 1 0 0 0 0 t11 received 0 0 0 1 1 0 0 0 0 0 t12 service 0 0 0 0 1 1 0 0 0 0 To solve the problem of reducing the dimensionality and sparsity of the frequency matrix of the corpus of texts, the method of singular distribution (SVD) was used [3-5]. After all, documents usually use a fairly small set of terms that describe a certain subject area. Therefore, if in the diagonal matrix of singular values (S) we leave exactly k of the first diagonal elements, and assign the value zero to the rest, then the use of the SVD method gives an optimal approximation. In the diagonal matrix of singular values S, the values are ordered, namely, 𝑠𝑠1 ≥ 𝑠𝑠2 ≥ … ≥ 𝑠𝑠𝑘𝑘 , that is, if you leave the first two values, then assign the value zero to the others. On the basis of the obtained matrix S, it is possible to calculate the percentage contribution of the dimension described by the corresponding singular value to the explanation of the data. On the basis of the obtained matrix S, it is possible to calculate the amount in percent that the corresponding dimension, which is described by the corresponding singular value, contributes to the explanation of the data (table 3). The value of the column "Percentage of value contribution to the explanation of data variability" is calculated as the value of "Square of the singular value" divided by the sum of the values of the squares of the singular values, multiplied by 100%. As can be seen from the obtained results, table 3, if only the two basic dimensions are left, a total of 66.16% of the data variability will be explained. Table 3 Analysis of the obtained singular values The percentage of Cumulative Measurement Singular value value contribution to Singular value value of number square the explanation of deposit interest data variability 1 5.1435 26.45 45.61 45.61 2 3.4526 11.92 20.55 66.16 3 2.7696 7.67 13.23 79.38 4 2.3736 5.63 9.71 89,11 5 1.7711 3.13 5.41 94.51 6 1.2251 1.5008 2.58 97.09 7 1,029 1.0588 1.82 98.92 8 0.684 0.4678 0.81 99.73 9 0.371 0.1376 0.23 99.96 10 0.1352 0.0182 0.03 100 In this case, all documents can be located in two-dimensional space and determine the clusters that they form according to the degree of similarity and belonging to a certain topic (Fig. 1). Figure 1: Location of terms in two-dimensional space. As can be seen from fig. 1, the first dimension explains 45.61% of the data variability; the second dimension explains 20.55% of the data variability. As a result, three thematic clusters were formed, which included documents based on the similarity of the use of terms [6-9]. The SAS Text Miner system was used in this study. When using the SAS Text Miner software, a technological project is built in which the following steps are performed: 1. Loading data. 2. Text parsing. 3. Text filtering. 4. Text clustering. The technological process of analyzing the corpus of texts for the purpose of their clustering is presented in fig. 2. Figure 2: Technological process of text corpus analysis in the SAS Text Miner system. The constructed rules for the corresponding clusters are generated in the form of the following program code: F_TextCluster_cluster_ =1 :: (OR , "reform" , "insurance" , (AND, (OR, "payments", "seniority") ) , "accumulation" , (AND, (OR, "pensionable", "mandatory") ) F_TextCluster_cluster_ =2 :: (OR , "voluntary" , (AND, (OR, "payments" , "pension")) , "timely" , (AND, (OR, "contributions" , "pension" , "insurance", "recalculation")) , "pension" , (AND, (OR, "minimum" , "index" , "increment")) , "subsidy" , (AND, (OR, "residential")) , "current" , (AND, (OR, "payment" , "funding")) F_TextCluster_cluster_ =3 :: (OR , "identification" , (AND, (OR, "refugee" , "displaced person". "payments")) , "resident" , (AND, (OR, "Crimea" , "uncontrolled" , "territory" , "temporary")) , "UN" , (AND, (OR, "global" , "food" , "program")) F_TextCluster_cluster_ =4 :: (OR , (AND, (OR, "serviceman" , "military", "policeman")) , "zone" , (AND, (OR, "combat" , "actions")) , (AND, (OR, "missing" , "missing")) , "deceased" , (AND, (OR, "loss" , "breadwinner" , "members" , "family")) F_TextCluster_cluster_ =5 :: (OR , "pension" , (AND, (OR, "border", "borders", "others", "countries")) , "experience" , (AND, (OR, "calculation" , "insurance" , "proportional")) F_TextCluster_cluster_ =6 :: (OR , "accident" , (AND, (OR, "CHAES" , "nuclear" , "power plant")) , "Chernobyl")))) The statistical characteristics of the built classification model based on linguistic rules were calculated separately for the training and test data sets: the ratio is 70% for training and 30% for testing, i.e. 114 and 48 texts, respectively. The results are summarized in Table 3. Table 3 Statistical characteristics of the classification model of the studied texts Data set Statistics training Test TP (True Positive) 30 11 TN (True Negative) 67 26 FP (false positive) 10 6 FN (false negative) 7 5 MISC,% (proportion of incorrectly 15 23 classified values) Ginny 0.82 0.71 ROC 0.79 0.67 The image of the ROC curve for the text information classification model based on linguistic rules is presented in Fig. 3. ROC- characteristics of the model on the training set ROC- characteristics of the model on the test set The reference line is 50 for 50 percent of the occurrence of the event Figure 3: ROC curve for the built classification model based on linguistic rules. The constructed linguistic rules were used to cluster news texts that were published on the Internet from September 2023 to September 2024. In general, about 10,000 tons were unloaded and processed. texts on social protection and social security of Ukrainians. After clustering the texts, the number of texts belonging to contributors from a certain region was calculated for each cluster. The obtained values were normalized on a scale from 0 to 100 according to formula (1): 𝑛𝑛𝑖𝑖 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑖𝑖 = |∀ , (1) max(𝑛𝑛𝑖𝑖 𝑖𝑖) where 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑖𝑖 – the popularity of the texts of the corresponding cluster for the i-th region, 𝑛𝑛𝑖𝑖 – the number of texts by region, max(𝑛𝑛𝑖𝑖 |∀ 𝑖𝑖) – maximum number texts by all regions. The results of the calculations are presented in Table 4. Table 4 Results of cluster analysis of textual information on issues of social protection and social security by regions of Ukraine Popularity of the texts of the corresponding cluster Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 (pension (accrual (problems (issues (issues of (issues reform) and of social related to social related to Name of the region payment of protection losses due protection victims of pensions of to and social the accident and social internally military security at the benefits by displaced conflict) of Chernobyl the Pension persons) refugees) nuclear Fund of power Ukraine) plant) Vinnytsia region 94 65 24 72 79 Volyn region 87 57 20 100 63 the city of Kyiv 82 49 32 37 26 33 the city of - - - - - - Sevastopol Dnipropetrovsk 58 39 43 33 14 region Donetsk region 27 32 59 37 Zhytomyr region 94 73 19 34 62 88 Transcarpathian 67 45 29 40 75 region Zaporizhzhia 58 39 90 30 region Ivano-Frankivsk 87 66 24 63 72 region Kyiv region 84 42 28 37 37 100 Kirovohrad region 92 88 32 73 46 Autonomous - 1 1 - - - Republic of Crimea Luhansk region 22 8 Lviv region 73 45 20 60 57 Mykolayiv region 76 70 64 47 18 Odesa region 32 24 27 13 13 Poltava region 75 63 32 73 42 77 Rivne region 100 64 17 81 100 Sumy region 92 100 52 43 30 Ternopil region 50 56 24 63 Kharkiv region 47 35 100 15 9 Kherson region 71 62 89 Khmelnytskyi 87 55 28 78 73 region Cherkasy region 87 47 30 50 55 74 Chernihiv region 81 58 24 50 29 Chernivtsi region 83 31 25 1 61 The results of the analysis presented in the table can be visualized using SAS tools Enterprise Guide 7.1 (fig. 4-9). Figure 4: Cluster 1 - popularity of texts on " Pension reform " by regions of Ukraine. Figure 5: Cluster 2 - the popularity of texts on the topic "Questions related to the pension fund in general" by regions of Ukraine. Figure 6: Cluster 3 - popularity of texts on the topic "Problems related to IDPs" by regions of Ukraine. . Figure 7: Cluster 4 - the popularity of texts on the topic "Issues related to the military and police" by regions of Ukraine. Figure 8: Cluster 5 - the popularity of texts on the topic "Questions regarding the payment of pensions abroad" by regions of Ukraine. Figure 9: Cluster 6 - the popularity of texts on the topic "Issues related to pensions for victims of the accident at the ChAES" by regions of Ukraine. 4. Declaration on Generative AI The authors have not employed any Generative AI tools. 5. Conclusion The proposed method of textual information analysis using text tools mining designed for automated processing of large volumes of texts on a certain topic. The use of text analytics allows you to deepen your knowledge of the subject area by using unstructured data. In this study, the problem of dimensionality and sparsity of the frequency matrix of the corpus of texts is solved using the key theorem of linear algebra - the singular matrix decomposition (SVD) method. Pre-executed. frequency weighting operation, which helped to partially solve the problem of unevenness of high- frequency terms, making them less influential. This made it possible to obtain results of classification of textual information of high quality. Therefore, the use of intellectual analysis of large volumes of textual data allows to identify the most important problems that require a priority solution, to find out for which categories of the population they are most relevant. The obtained results can be further used during the planning of social expenditures of budgets of different levels, in the model of actuarial calculations, during the planning of social expenditures of budgets of various levels. The proposed approach can improve the quality of forecasts in modern conditions, when there is no complete information about the investigated process or phenomenon or the information is distorted. References [1] Shapovalova T. The concept and content of social protection and social security of the population in modern Ukraine. Economic analysis. 2022. Volume 32. No. 3. P. 123-130. https://doi.org/10.35774/econa2022.03.123 (ukr) [2] Gren T. I. Peculiarities of implementation of the policy of social protection of territories in war conditions. Academic notes of TNU named after V.I. Vernadskyi. Series: Public management and administration. 2022. Volume 33 (72) No. 6. P. 81-84. https://doi.org/10.32782/TNU-2663- 6468/2022.6/13 (ukr) [3] Expenditures on social assistance. URL: https://mof.gov.ua/uk/expenditures_on_social_assistance (ukr) [4] Smush-Kulesha M. Fedorova A., Moysa B. Social rights in Ukraine during the war. Report on needs assessment. Council of Europe. 2022, 64 p. URL : https://rm.coe.int/needs-assessment-ua- 2/1680a9b408 (ukr) [5] On the approval of the Regulation on the Unified Information System of the Social Sphere. Resolution of the Cabinet of Ministers of Ukraine dated April 14, 2021 No. 404. URL: https://zakon.rada.gov.ua/laws/show/404-2021-п#Text (ukr) [6] Report on appeals of citizens for 9 months of 2024. URL: https://www.pfu.gov.ua/2167929-zvit- pro-zvernennya-gromadyan-za-9-misyatsiv-2024-roku/ (ukr) [7] Sharma S., JainRole A. Role of sentiment analysis in social media security and analytics. WIREs Data Mining and Knowledge Discovery: Vol. 10, Issue 5. https://doi.org/10.1002/widm.1366 [8] Shkurko O. IN. Types of linguistic text analysis: teaching. manual Dnipro: Univ. Alfred Nobel, 2018. 119 p. (ukr) [9] Perebijnis V. AND. Statistical methods for linguists: training. manual Vinnytsia: Nova Kniga, 2013. 176 p. (ukr) [10] Lande D. IN. Elements of computer linguistics in legal informatics. Kyiv: NDIIP National Academy of Sciences of Ukraine, 2014. 168 p. (ukr) [11] Find the information that matters using natural language processing (NLP). URL: https://www.sas.com/ru_ua/software/visual-text-analytics.html [12] Survey of Text Mining I: Clustering, Classification, and Retrieval / Ed. by MW Berry. Springer, 2003. 261 p. [13] Aggarwal CC, Zhai C. Mining Text Data. Springer, 2012. 527 p. [14] Text Cluster Node Results. URL: https://documentation.sas.com/?docsetId=tmref&docsetTarget=n1d7r58qug6sefn162cu6cqx0nq 4.htm&docsetVersion=14.3&locale=en [15] Emerging Technologies of Text Mining: Techniques and Applications / Ed. by HA Do Prado, E. Ferneda. Idea Group Reference, 2007. 358 p. [16] Valls Martínez, MdC, Santos-Jaén, JM, Amin, F.-u., Martín-Cervantes, PA Pensions, Aging and Social Security Research: Literature Review and Global Trends. Mathematics 2021, No. 9, 3258. https://doi.org/10.3390/math9243258 [17] Social Protection Systems. Ed. E. Schüring, M. Loewe. Elgar Publishing. 2021. 776 p. https://doi.org/10.4337/9781839109119 [18] Official website of the Ministry of Digital Transformation of Ukraine. URL : https://thedigital.gov.ua (ukr) [19] On the approval of the Regulation on the Unified Information System of the Social Sphere. Resolution of the Cabinet of Ministers of Ukraine dated April 14, 2021 No. 404. URL : https://zakon.rada.gov.ua/laws/show/404-2021-п#Text (ukr) [20] Gladun A. Ya., Rogushina Yu. IN. Data mining : searching for knowledge in data: a tutorial. Kyiv: ADEF-Ukraine, 2016. 451 p. (ukr) [21] Lytvyn V.V., Pasichnyk V.V., Nikolskyi Yu.V. Analysis of data and knowledge: training. manual Lviv: Magnolia 2006, 2017. 276 p. (ukr) [22] Analysis and processing of data flows by means of computational intelligence: monograph / Ye. IN. Bodyanskyi et al. Lviv: View of Lviv. polytechnics, 2016. 235 p. (ukr) [23] Text analytics using SAS Text Miner: course notes. NC.: SAS Institute, 2014. 218 p. [24] Getting Started with SAS® Text Miner 12.1 URL: https://support.sas.com/documentation/onlinedoc/txtminer/12.1/tmgs.pdf [25] Matignon R. Data Mining Using SAS Enterprise Miner. URL: https://www.amazon.com/Data- Mining-Using-Enterprise-Miner/dp/0470149019 [26] Sharma S., JainRole A. Role of sentiment analysis in social media security and analytics. WIREs Data Mining and Knowledge Discovery: Vol. 10, Issue 5. https://doi.org/10.1002/widm.1366 [27] Find the information that matters using natural language processing (NLP). URL: https://www.sas.com/ru_ua/software/visual-text-analytics.html