=Paper=
{{Paper
|id=Vol-2392/paper13
|storemode=property
|title=Experimental Study of Information Technology for Detecting the Electronic Mass Media PR-effect based on Statistical Analysis
|pdfUrl=https://ceur-ws.org/Vol-2392/paper13.pdf
|volume=Vol-2392
|authors=Oleg Hatyan,Myroslav Ryabyy,Andriy Fesenko,Vitaliy Kyschenko,Madina Bauyrzhan,Anton Petrov
|dblpUrl=https://dblp.org/rec/conf/coapsn/HatyanRFKBP19
}}
==Experimental Study of Information Technology for Detecting the Electronic Mass Media PR-effect based on Statistical Analysis==
Experimental Study of Information Technology for Detecting the Electronic Mass Media PR-effect based on Statistical Analysis Oleg Hatyan1 [0000-0003-2754-6938], Myroslav Ryabyy1 [0000-0002-9651-9135], Andriy Fesenko2[0000-0001-5154-5324], Vitaliy Kyschenko3 [0000-0003-4281-7812] and Madina Bauyrzhan4 [0000-0002-8287-4283], Anton Petrov 3[0000-0003-3731-4276] 1 Interregional Academy of Personal Management, Kyiv, Ukraine 2 Taras Shevchenko National University of Kyiv, Kyiv, Ukraine 3 National Aviation University, Kyiv, Ukraine 4 Satbayev University, Almaty, Kazakhstan oleg.hatyan@gmail.com, m.o.ryabyy@gmail.com, aafesenko@gmail.com, vitaliy.kyschenko@ukr.net, madina890218@gmail.com, anton.a.petrov@gmail.com Abstract. Today the dynamics and amount of modern information flows creat- ed by the Internet media content make a significant load of work for an expert while preparing an analytical conclusion on certain subject, industry and gen- eral trends. In the paper, the estimation of the limit values for detecting the PR- effect of electronic mass media integrated index is experimentally substantiated by conducting comparative and statistical analysis of numerical sequences of linguistic statistics for sets of messages which are intended a priori for purely "unbiased" reporting and documents with vocabulary characteristical for infor- mation influence (direct action commercial texts). Keywords: Numerical Sequence, Statistical Analysis, Limit Value, Information Influence, PR-effect Index. 1 Introduction The dynamics and amount of modern information flows created by the Internet media content (news feeds, blogs, micro blogs, audio and video streams etc.), make a signif- icant load of work for an expert while preparing an analytical conclusion on certain subject, industry and general trends. In addition, as proved by the works of scientists and specialists devoted to the information space [1] (including our own [2]), today the nature of presenting information is somewhat different from merely reporting. Signif- icant amount possesses the features of emotivity and manipulation. In copywriting, articles with such properties are called "direct action commercial texts". Thus, the bias of information sources, with the use of modern text construction and sequencing tech- nology (target is the influence level), is capable of carrying hidden objectives of PR- effect aimed at the target audience. Doubtlessly, this is a factor that greatly increases the error rate in preparation and decision making. Thus, the use of automated detection of "PR-effect" by applying an automated algo- rithm for the initial analysis of a given text and/or a load of texts, which would allow to pre-evaluate and decompose the flow of messages into those representing collec- tions of documents with the vocabulary characteristical for “PR-effect” and the oth- ers, which, in our opinion, is an essential tool for analytical work and information support of decision-making. Such a technological method was proposed by us in [2]. The idea is to calculate the "PR-effect" index for the thematic information flow (TIF) within a certain time span via the mathematical expectation of the daily TIF "PR-effect" index estimation: PRI (MTIPt ) M PRI (MTIPt )i , (1) where TIF "PR-effect" index for the i-th day makes 1/3 of the number of messages with the features of "non-eventivity", "emotionality", "manipulativity" to the total number of messages of the thematic vector MTIP: ( M TIPti PEV (M TIPti )) PEM (M TIPti ) PMA (M TIPti ) PRI ( M TIPt )i (2) 3 M TIPti M TIPti Pr ( M TIPti ) P (m ) 1 r (3) where r={EV, EM, MA} is the index for the characteristic with linguistic features, respectively {"eventivity", "emotionality", "manipulativity"}; |MTIP∆ti| is the power of a set of TIF messages for the i-th day. Based on the construction, the value of PRI(MTIP∆t) is always positive and falls within the range between 0 and 1. The threshold value α is set as the criterion for making a decision on the presence of "PR-effect" in the TIF. Thus if PRI (MTIPt ) , (4) then such a TIF represented by the thematic vector MTIP∆t possesses features of "PR-effect". Consequently, the given value α sets the basis for the automated decision-making on detecting the PR-effect. Therefore, the experimental substantiation of its estimate value is significantly relevant. 2 Analysis of recent research and publications Nowadays, research and evaluation of information flows (IFs), including those creat- ed by electronic media, is mostly carried out within convention-analytical research. The respective criteria for assessing the quality of quantitative content analyses based on the requirements of national and foreign methodologists is suggested by Ivanov O.V. [3]. Fedorenko R.M. considers the main objects, threats and negative factors of ensuring information security in the military domain, characterizes the ex- isting level of information space monitoring automation, suggests the use of content monitoring methods in order to increase the efficiency of providing information secu- rity of Ukraine [4]. Among the approaches to solving the problem of assessing IF texts created by elec- tronic media, the methodology for assessing the quality of messages’ resonance should be noted, which is the original development of one of the leading national consulting companies NOKS FISHES [5], and can be divided into the next 4 steps . Step 1. Encoding the text array and giving the subject of research resonance charac- teristics Text characteristics S – size E – genre R – distribution (advertizing or not) Reference characteristics I – subject/object Tmc – indirect reference Tmk – the nature of speaker’s reference Tme – the nature of independent experts’ reference Ta – author’s reference tone value Ts – event reference tone value To – total reference tone value C – IF relevance Mass media characteristics O – printing V – type P – periodicity G – geography M – maginality Step 2. Intermediate characteristics of reference quality Saturation - reflects the proportion of influence on the quality of the company's in- formation drive of its subjective representation in the media and the representation of the company's speakers in the infodrive. = (I + Tmk + Tme) * Tmc= (subjectivity + total speakers’ and independent experts’ reference characteristics) * indirect reference. K(tl) Logical presence coefficient – reflects the proportion (%) of the company’s infodrive logical presence in the text array. Includes the tone value and subjective saturation of reference for both the given company’s and the other companies of the given trend = (logical presence coeff. + emotional presence coeff) / 2 =[K(l)+K(t)] / 2 K(l) – subjective representation coeff. = D of the given object / Σ (D of all the objects in the publication) K(t) – emotional representation coeff. = |To| of the given object / Σ (|To| of all the objects in the publication) Pg – potential influence of mass media rate Step 3. Integrated characteristics of the reference quality W – media’s loyalty to IF and the demand for media’s IF Includes the qualitative characteristics of media’s perception of the company's info- drive or shows the editorial policy vector for the company's infodrive = image * abstract * distribution type * author’s tone value * (text array size + publi- cation type) * reference type * saturation * logical presence coeff. Y – Event saturation of the client’s media area Includes qualitative characteristics of the text array’s event filling by the company’s infodrive = image * abstract * eventivity * event tone value * (text array size + publication type) * reference type * saturation * logical presence coeff. Z – Probable influence – informational drive (ID) towards the media’s audience Determines the probability of reporting the company's information to the general au- dience = marginality * eventivity * potential influence of mass media coeff. * (7 – publica- tion size) * reference type * subjectivity * logical presence coeff. Step 4. Rating R – integrated resonance quality index If W < 0, Y > 0, |W|>|Y| , β = - √ W2 - Y2 + Z2 If W < 0, Y > 0, |W|<|Y| , β = √ Y2 - W2 + Z2 If W < 0, Y < 0, β = - √ W2 + Y2 + Z2 If Y < 0, W > 0, |W|<|Y| , β = - √ Y2 - W2 + Z2 If Y < 0, W > 0, |W|>|Y| , β = √ W2 - Y2 + Z2 If Y > 0, W > 0, β = √ W2 + Y2 + Z2 However, these tools, as well as the previously mentioned areas of research, are based on methods of expert evaluation, which in the general case does not exclude the sys- temic error problem mentioned above. In previous works, by analyzing the trends of the TIFs created while categorizing the general IF and the synthesis of linguistic statics (both standalone messages and calcu- lated for a specific TIF load that had the characteristic features of PR-effect) [2] we: formulated and experimentally proved the hypothesis about the difference between the trends of thematic information flows of unbiased reporting and "PR-effect" in the characteristic space of linguistic features; developed a method for decomposing the general information flow into the constituent thematic streams; indicated a way of distinguishing thematic streams with linguistic features of "PR-effect"; empirically established a value for the "PR-effect" index of the thematic flow; presented the em- pirical studies data, which allow to assess the limit conditions for the decision-making on the "PR-effect" index of the thematic flow. In this paper, we will focus on the detailed substantiation of the "PR-effect" index and the estimation of the criterion α for the decision-making on "PR-effect" in the TIF (4). Thus, the purpose of this paper is to formally present the technology for detecting the "PR-effect" of electronic media by: conducting a comparative analysis for sets of messages that are a priori intended for "unbiased" reporting, and documents with the vocabulary characteristic for the "PR-effect"; conducting statistical analysis of the numeric sequences of estimates of PRI(MTIP∆t)i and experimental sets of data (doc- uments) of both types; setting (substantiation) of the limit conditions for estimation of the value α of the method "PR-effect" detection, which is the task of the present work. The subject of the study is two groups of information messages. One of them is to be focused on "unbiased" reporting, the other one contains vocabulary characteristic for the "PR-effect". The object of the study is the boundary criterion α of the "PR-effect" index. 3 The main material of research study To conduct a study to determine the limit value of the criterion α we performed the following: 1. In the system of information space analysis (SISA) [6] a data bank was created (conventional name SLOVAR). This way, for the created bank, we received all the necessary tools for calculating the estimates for the presence of the "PR-effect", namely trained by expert evaluation algorithms for calculating linguistic features, "eventivity", "emotionality", and "manipulativity" for each document are represented by a block diagram (Figure 1). The presented algorithm is a formalized basis of de- tecting the "PR-effect" of IF of electronic mass media. The algorithm is designed to evaluate texts in both Ukrainian and Russian. Fig. 1. Algorithm for the detection of the "PR-effect" based on determining the linguistic sta- tistics of a given message 2. Two groups of messages were formed that correspond to specifics of "unbiased" reporting and containing the vocabulary characteristic for the "PR-effect". Hypotheti- cally, texts correspondent with "unbiased" reporting should be articles of encyclope- dic dictionaries. Therefore, we created the first group of documents based on the electronic version of the Great Encyclopedic Dictionary [7] (source: https://www.vslovar.ru/). The dictionary contains more than 80,000 articles, including about 20,000 biographical ones (further denoted as С1). At the same time, 58283 articles from the specified electronic resource, the text size of which is more than 77 characters, was collected as the entry to the data bank. Another group of documents that have the vocabulary characteristic for "PR-effect" was selected from the most popular Google search engine professional copywriter sites (examples of direct action texts from the portfolio and currently relevant as SEO texts on customer sites) of Ukrainian, Belarusian and Russian segments of the Internet with a total of 279 docu- ments (denoted as С2): - "Copywriter Dmitry Kot" - http: //www.mastertext.spb.ru/index.html; - "Protect. Professional copywriting. " Company. http://protext.by/; - "Marmore." Copywriter Marina Greben ". http://marmore-text.ru/; - "PromoText. Copywriting. " Company. https://promotext.com.ua/ 3. Both sets of documents were entered into the SLOVAR data bank; for each docu- ment the linguistic features of "eventivity", "emotionality", "manipulativity", and also an estimation of the presence of "PR-effect" - PRI(C1,2) (PRI(MTIP∆t)i were calcu- lated using the SISA tools and the formula (2). Thus obtained individual numerical sequences for C1 and C2 were arranged in the form of CSV-tables suitable for further statistical analysis by the relevant software (MS Excel and Statistica 6.0). Notes: 1. As for our choice of texts, which contain the vocabulary characteristic for the "PR- effect" of the product of professional copywriting. We believe that a quote describing the purpose of copywriting, taken from the section "Professional look" site of one of the Ukrainian copywriting companies, is quite revealing (source: http://cookiezz.com.ua/copyright), “While writing texts we try to make them as sim- ple for the client to read as possible. At the same time, they should be informative and aimed at SEO site promotion. By ordering a single text item of a full site content de- velopment from our company, you can be certain in receiving the content of high quality, that will rise the site’s distribution and make already existing clients by from you.” 2. When constructing a numerical sequence of estimates of PRI(C1) (by the expres- sion (2) - PRI(MTIP∆t)i ) for C1, we came to the conclusion that among the pre- selected 58283 vocabulary articles of the Great Encyclopedic Dictionary (GED), a significant portion is made by short texts . This way, only 11.7% of the documents of the GED (6821 articles), are text of more than 512 characters. At the same time, the number of such documents is indicative for our study and they formed the basis for constructing a numerical sequence of estimates PRI(C2) (by the expression (2) - PRI(MTIP∆t)i ) for C1. Taking into account the above, the algorithm for the experimental determination of the limit values of the value α for the "PR-effect" detection can be represented by a block diagram (Figure 2). Fig. 2. Algorithm for determining the limit value of α for "PR-effect" detection Thus, for the purpose of visual comparison and estimation of the type and nature of the probability distribution FC1 of the random values obtained for C1 and FC2 for C2 frequency distribution histograms of estimates PRI(C1,2)i for each document in the corresponding group are constructed. Given that the construction of the set of values PRI(C1,2)i, at i 1, n where n is approximated by a continuous function in the range between 0 and 1, and the number of class intervals calculated according to the Sturges rule [8]: n=1+log2N, meaning that for the sequence C1: n=1+log(6821)=13,736 intervals (scale step 0,07), C2: n=1+log(277)=9,114 intervals (scale step 0,11). Resulting histograms with a scale step 0,1, constructed for sequenc- es C1 and C2 using the software Statistica 6.0 are presented on the Figure 3. However, to improve the detail of the qualitative estimation of the experimental data distribution, we set the scale step of 0.025 (40 intervals) for both sequences. Grouping the values of the numerical sequences C1 and C2 in the established class intervals allowed to determine the frequencies of the corresponding estimates PRI(C1,2)i (Table 1). Fig. 3. Histograms of distribution functions for the sequences C1 and C2 using Statistica 6.0 As the next step, the histograms of the frequency distribution of the values PRI(C1,2) are built. To do so, the resulting numerical series are normalized at the power of the corresponding sets of documents (samples C1 and C2). Table 1. Frequency values of the PR-effect index for numerical sequences C1 and C2 Resulting histograms are presented at Figures 4 and 5. The light color represents an integral distribution function of the coefficient PRI(C1,2)i. Fig. 4. Table of the normalized frequencies of the coeff. PRI(C1)i for GED articles Fig. 5. Table of the normalized frequencies of the coeff. PRI(C2)i for copywriting articles and distribution histogram. As it can be seen from the figures (qualitative analysis of probability distribution for FC1 and FC2): 1. Probability distributions both of FC1 for C1 (three modes clearly expressed) and FC2 for C2 are not Gaussian (that is, they have a non-parametric form), and therefore the use of standard statistical estimates for the calculation of numerical characteristics (e.g. mean value or deviation) of the distribution function, as well as the use of Stu- dent's criterion to confirm the hypothesis regarding homogeneity of the samples, will not be correct; 2. 83,03% of C2 (set of texts by copywriters - Fig. 5) received an estimation of PR- effect index at PRI(C2)>0,975, while the same level of evaluation corresponded to only 3,01% of the documents in C1 (texts from GED - Fig. 4); 3. Certain number of documents (18.33%) from C1 (GED texts - Fig. 4) have an esti- mate of PRI(C1)>0,75, which is due to the literary nature of the information presenta- tion in some of the encyclopedic articles. These texts include: - biographies of outstanding artists, historians, and politicians, for example: Fig. 6. Example of biography - description of historical events and countries that have directly participated in them or had geopolitical influence [11], for example: Fig. 7. Example of historical event - some definitions, for example: Fig. 8. Example of definition The descriptive statistics given in Table 2 also indicate a non-parametric distribution of FC1 and FC2. Table 2. Descriptive statistics of the distribution of PR-effect index for numerical sequences C1 (FC1) and C2 (FC2) Sequence Statistics C1 C2 Mean value (µ) 0,433882 0,958077 Standard error (σ2) 0,003577 0,007551 Median (Me) 0,44 1 Mode (Mo) 0 1 Standard deviation (s) 0,295416 0,125679 Expected mean square (D) 0,087271 0,015795 Excess (Ex) -0,9799 19,92078 ExCrit (α=0,05) 0,814 0,818 Assymetry (As) 0,167379 -4,26553 AsCrit (α=0,05) 0,130 0,230 Interval 1 0,8167 Minimun (min) 0 0,1833 Maximum (max) 1 1 Intervals (n) 6821 277 Reliability level (95,0%) 0,007012 0,014865 Thus, given that for C1 and C2 |AsC1| > Ascrit, and |Ex C1| > Excrit [9], this can be concluded on the zeroth hypothesis Н(0) deviation: the distribution of the estimation of the PR-effect index PRI(C1,2) for C1 and C2 corresponds with the Gaussian (nor- mal) distribution. In addition, the distinctive negative excess and the right-hand asymmetry of C1 and a significant positive excess and left-hand asymmetry of C2 is present. In addition, the median value is at the level of 0.44 for C1 and 1 for C2, based on which we will subsequently obtain the desired limit value of α for detecting "PR- effect". The hypothesis of the homogeneity of independent samples C1 and C2 is then checked. Theoretical distribution functions FC1 and FC2 are unknown, that is, they belong to the same general set. Then, the verified zeroth hypothesis Н(0) is presented as: FC1 = FC2 unlike the oppo- site hypothesis Н(1): FC1 ≠ FC2, while FC1 and FC2 are considered continuous. To test Н(0) the statistics of the non-parametric Kolmagorov-Smirnov criterion is applied: n1n2 ' max FC1 FC2 , n1 n2 where FC1 and FC2 are empirical distribution functions with two samples of volumes n1 and n2. Hypothesis Н(0) is not confirmed, if the given statistic value λʹ is more than the criti- cal λʹcr, ie λʹ > λʹcr, otherwise it is considered to be confirmed. The hypothesis Н(0) testing with the Kolmagorov-Smirnov criterion is conducted using the Statistica 6.0 software. Fig. 9. Results of testing the hypothesis Н(0) with the Kolmogorov-Smirnov criterion The Kolmagorov-Smirnov criterion has shown that the distribution functions FC1 and FC2 do not belong to the same general set (that is, the hypothesis Н(0) deviates) with an error rate of 5% (Figure 9). In other words, there is the 95% confidence in a significant deviation between the experimental data C1 (messages that correspond to the "unbiased" reporting) and C2 (documents that have characteristics of PR-effect vocabulary) PR-effect index values PRI(C1,2)i, as shown by the Box-and-whisker diagram (Figure 10) [10]. Fig. 10. Box–and–whisker diagram for experimental data from C1 and C2 Additionally, we obtained the updated median of experimental data sequences: for C1 MeC1 = 0,433866 and for C2, MeC2 = 0,958204, based on which we calculate the limit value of α for "PR-effect" detection as an average value of the medians. Then: α =( MeC1+MeC2)/2=(0,433866 + 0,958204)/2=0,696035. At this, the value α = 0,696035 is taken as the limit value of the decision-making cri- terion for the presence of "PR-effect" (4), with an error rate of 5%. 4 Conclusions To summarize, as a result of the study, we gave a formal statement for detecting "PR- effect" by electronic media (the algorithm is depicted as a flowchart in Figure 1). An experimental substantiation and qualitative analysis of the PR-effect index were per- formed, which showed the difference in the distribution of the probabilities of its frequencies for both experimental samples (C1 for the articles of the Great Encyclo- pedic Dictionary, C2 for articles by copywriters) from the Gaussian one (having a non-parametric form). By conducting statistical analysis of the numerical sequences of experimental data (C1 and C2), the hypothesis of the difference in the messages of "unbiased" reporting (C1) and those that contain characteristic vocabulary of PR- effect (C2) in the characteristic space of linguistic features was confirmed. The limit value of the decision-making criterion for the presence of "PR-effect" (4) α = 0.696035, with an error rate of 5%, was obtained. At the same time, both qualitative and statistical analysis of the experimental material showed a certain difference in the value of the integral PR-effect index for the characteristic groups of documents. Therefore, as the tasks for further research, we will aim at developing an algorithm for automatic grouping of an arbitrary set of messages in the characteristic space of linguistic features based on the PR-effect index. References 1. Zrazhevskaya, N.I., Mogilko, S.V.: The technique and methods of manipulliations in In- ternet publications (for example, Internet-newspapers "Press-Center", "Antenna"). Scien- tific Papers of the Institute of Journalism, T. 31, April - June, p.118-122 (2008). 2. Ryaby, M., Khatyan, O., Bagatsky, S.: The method of revealing PR-impact through the Internet media. Information security, T.21, No. 3, p. 294-300 (2015). 3. Ivanov, O.V.: Quantitative analysis of the text or the production of numeric artifacts: au- dit of content analytical research. Sociological sciences, Vol. 148, p. 11-15 (2013). 4. Fedorenko, R.M.: Content-monitoring of the information space as a factor in providing information to the state in the military sphere. Modern protection of information №2, p. 21-25 (2015). 5. Media research and reputation analysis of NOKs FISHES company. February 9, 2016 http://www.slideshare.net/mark_kanarsky/nok-presentation-new-03 6. Khatyan, O.A.: The algorithm for building the "day" of the Internet media. Information security of man, society, state, No. 2 (18), p. 110-123 (2015). 7. Prokhorov, A.M.: Large Encyclopedic Dictionary; St. Petersburg: No-rint, 1452 p. (1999) 8. Sturges, H.A.: The choice of a class interval. JASA. v.21. p. 65-66 (1926). 9. Koichubekov B.K. Biostatistics : training, Almaty: Evero, 154 p. (2014). 10. Tikhomirov, A., Kinash, N., Gnatyuk, S., Trufanov, A. et al: Network Society: Aggregate Topological Models, Communications in Computer and Information Science. Verlag: Springer International Publ, vol. 487, рр. 415-421 (2014). 11. Danik Yu., Hryschuk R., Gnatyuk S.: Synergistic effects of information and cybernetic interaction in civil aviation, Aviation, vol. 20, №3, рр. 137-144 (2016).