1. Introduction

Methods of data analysis to study the efectiveness of scientific journal promotion

Olha V. Korotun

Tetiana A. Vakaliuk

0 1 2 3

Tetiana M. Nikitchuk

tnikitchuk@ukr.net 3

Mariia O. Korotun

3 0 Academy of Cognitive and Natural Sciences , 54 Universytetskyi Ave., Kryvyi Rih, 50086 , Ukraine 1 Institute for Digitalisation of Education of the NAES of Ukraine , 9 M. Berlynskoho Str., Kyiv, 04060 , Ukraine 2 Kryvyi Rih State Pedagogical University , 54 Universytetskyi Ave., Kryvyi Rih, 50086 , Ukraine 3 Zhytomyr Polytechnic State University , 103 Chudnivska Str., Zhytomyr, 10005 , Ukraine

245 259

The article is devoted to studying the efectiveness of promoting the scientific journal “Journal of Edge Computing” among the scientific community. The article uses statistical methods and machine learning to analyse data collected during the distribution of invitations to review the journal. The purpose of the study was to conduct a comprehensive analysis of the data to determine the efectiveness of sending letters to foreign and domestic researchers with an invitation to view the journal's page, to interest them in the list of research areas of the journal and to invite them to publish in it in the future. The article describes in detail the stages of the analysis, from data collection and cleaning to model building and interpretation of the results. The results allowed us to identify the countries whose researchers are most interested in the journal to focus on them in the future. The study results may be useful for other scientific journals seeking to expand the geography of countries and attract new authors to publish in their scientific journals.

eol>edge computing Journal of Edge Computing data analysis R programming language

1. Introduction

In today’s world of technology, a new paradigm of edge computing is becoming increasingly widespread, aimed at moving computing resources closer to the data source. This is due to the speed of data processing, reduced network load, continuity of data processing and eficient use of computing resources. This trend helps to open up new opportunities in various areas of human life. The purpose of this study is to conduct a comprehensive analysis of the collected data from the scientific journal “Journal of Edge Computing” [1] to identify critical trends for reaching and attracting foreign and domestic researchers for further visits and publication in the journal and to outline the prospects for research in this area.

Since this journal was established relatively recently, the authors of the article had the idea to analyse the letters of invitation sent to scholars to visit and view the website of this journal (https://acnsci.org/jec) and the feedback received from them. Such an analysis would allow for the collection and interpretation of the information received about the interest of scholars. It is essential to understand the breadth of the audience, namely who viewed the journal, how many of them were from which countries, and the reasons for this, as well as the dificulties that could arise in this regard. This will help to make the journal’s content more relevant and exciting for many scholars and encourage them to publish in the journal.

2. Literature review

If we consider studies using data analysis, there is a relatively large number of them. In [2], based on the qualitative data analysis of the following stages: data collection, data cleaning, data presentation and results determination, as well as quantitative data analysis using correlation and regression, a study on the use of social science textbooks with augmented reality (AR) based on Android in secondary school is presented. The authors note that such technology will significantly improve the educational process and make it more interesting, practical and interactive.

Katahira [3] highlight the widespread problem of data heterogeneity in various scientific fields based on their clustering. They describe it as the process of dividing data into several groups (clusters) so that objects in one cluster are more similar to each other than objects from diferent clusters. As a result, we can reveal hidden structures in the data and better understand the reasons for diferences in observations.

The R/LinekdCharts tool for simplifying the analysis of large amounts of data is described in [4]. Its advantages include interactive visualisation (creation of interconnected charts), eficiency (creating visualisations with minimal code), and flexibility (overview charts, detailed graphs). Data analysis using the R programming language and many practical examples are presented in [5]. RStudio is used as an integrated development environment (IDE). Davis [6] also describes data analysis methods using the R programming language. The authors cover exploratory data analysis, spatial data analysis, statistics and modelling, and efective presentation of results.

3. Methodology

The preparatory stage of analysing a particular data set involves finding the main metrics of descriptive statistics, namely calculating the median and arithmetic mean, finding quartiles and interquartile ranges, and calculating covariance and correlation coeficients.

At the main stage of the study, we will use data mining methods and machine learning algorithms to build a regression model and cluster the data. Regression is supervised learning, while clustering represents unsupervised learning of the built models.

A regression model allows you to predict the relationship between the dependent variable and the independent variable. This can be done by building a basic regression model, namely a linear model, according to the following formula ( 1 ).

= 0 + 11 + 22 + · · · + + where is the dependent variable, (1, 2, ..., ) are independent variables, and is the random deviation.

This model will make it possible to predict the number of visits (dependent variable – visitors) based on the emails sent out (independent variable – mailings).

To assess the quality of the model, we use the coeficient of determination ( 2). This statistical measure will allow us to understand how well the model fits the data set under study. It is calculated using the following formula ( 2 ): ( 1 ) ( 2 ) ∑︀ (pr − av)2 2 = =1 ∑︀ (fact − av)2 =1 where is the predicted value of the dependent variable; is the average value of the dependent variable; is the actual value of the dependent variable. Measured from 0 to 1, the closer to 1, the better the model is built. In addition to the coeficient of determination, there are other metrics for evaluating models, such as AIC, BIC, RMSE.

Researchers use various clustering methods to divide data into several groups. Let us consider the main ones and describe their features: k-means method, in which each object in the dataset belongs to only one cluster, the number of clusters must be determined in advance, for example, this can be done using the elbow method; hierarchical clustering allows you to build a hierarchical structure of clusters, the number of clusters can be determined by the built dendrogram; DBSCAN (Density-Based Spatial Clustering of Applications with Noise ) method builds clusters based on data density, separating areas of clusters with low data density, after building a visualisation of the clustering performed by this method, we can determine the number of clusters formed in the existing data set.

For the present study, we chose the k-means method, which allows us to divide the available data into several clusters based on their similarity. This method is based on minimising the sum of the squared distances between the data and the found cluster centre ( 3 ).

∑︁ (, ())2 =1 where is the metric, is the -th data object, () is the centre of the cluster to which is assigned at the -th iteration. Let us present the algorithm of the iterative k-means clustering method (algorithm 1). ( 3 ) Algorithm 1 K-Algorithm for clustering data using the k-means method.

Require: Data points = {1, . . . , }, number of clusters Ensure: Cluster assignments and centroids 1: Initialize centroids { 1, . . . , } randomly from the data points 2: while centroids have changed do 3: // Assign points to the nearest centroid 4: for each point ∈ do 5: [] ← arg min || − ||2 {Assign to the closest centroid} 6: end for 7: // Update centroids 8: for = 1 to do 9: ∈ : [] = } {Points assigned to cluster } 10: |1| ∑︀∈ {Recalculate centroid of cluster } ← { ← 11: end for 12: end while 13: return Cluster assignments and centroids

The described mathematical apparatus is available in the R language, as there are specially built-in functions that will facilitate the implementation of this study and allow us to draw certain conclusions.

4. Results

To analyse the generated data set, we will apply the following generally accepted scheme consisting of the following steps (figure 1).

According to the above scheme, let’s describe the work that needs to be done at each stage of data analysis.

1. Data collection includes identifying the purpose, sources, and the actual data collection process.

To obtain the data, we selected scientific articles from the open collection of published scientific research ScienceDirect in similar areas of the JEC journal and collected email addresses of scientists from around the world, namely, their names, article titles, and countries, which allowed us to form a data table (figure 2). This process was carried out daily during June 2024 and added an average of about 60-70 records. 2. Data cleaning involves detecting and eliminating errors in the data and processing missing values.

If necessary, data standardisation can be performed. For the obtained data set, the analysis was carried out in the RStudio environment using the R programming language. data collection

data cleaner formulation of conclusions and recommendations

interpretation and visualization

of results

3. The descriptive analysis includes an overview of the available data in the dataset, determination of its structure, data visualisation, and calculation of statistical indicators (measures of central tendency, standard deviation, quartiles, interquartile range, etc.). 4. Analytical analysis is the stage where the machine learning methods used, namely regression, classification, clustering, etc., are usually presented. 5. Interpretation and visualisation of the results include determining the main conclusions after analysing the data set and building the necessary results graphs for clarity. 6. At the stage of formulating conclusions and recommendations, the results of the data analysis are described, and recommendations for further data analysis are written.

The study analysed the efectiveness of sending letters to scientists and the feedback they received in the form of reviewing the journal using the following steps presented in table 1.

Let us describe in detail the stages of data analysis of the obtained dataset for the scientific journal “Journal of Edge Computing”.

In the first stage, it was necessary to form the required dataset; for this purpose, a table with many records was built from the data collected daily in the above form. At the end of the month, we manually cleaned the data set and deleted those records with undeliverable email addresses from the table. The ifnal dataset for further analysis is shown in the form of a data table consisting of four columns, namely: country number, country name (country), number of letters sent to scientists from this country (mailings) and number of journal views received from this country (visitors), the first records of the resulting table are shown in figure 3.

We compare how many letters were sent to diferent countries, which is

aimed at identifying those countries that should be focused on in the future

Comparison of countries by the number of letters sent Comparison of countries by the number of journal views Calculation of the conversion rate Use of statistical significance tests We compare how many scientists from diferent countries have viewed the journal, which is aimed at identifying those countries that are most interested in the journal’s research topics We will calculate the conversion rate for each country in order to assess the

efectiveness of the letters sent to scientists from diferent countries

We use statistical significance tests to find out whether there is a statistically significant diference between the number of views or conversion rates in diferent countries Using machine learning Machine learning models can be used to predict how many scientists from a models (building a lin- particular country will view a journal if an email is sent to them ear regression model) Country segmentation (data clustering) Countries can be segmented based on their distribution into groups with similar characteristics, for example, the number of letters sent or the number of views received

Let us load and view the obtained data set in the RStudio environment using the R programming language. For this, we install and connect the necessary packages for working with data.

Let us use the dim() and print() functions of the R language to get the following information about the data set: the table consists of 4 columns and 53 rows; let us look at the first records of the table to check if the required data set is loaded correctly into the dset variable (figure 4). From the figures, we can see that the loaded set in dset matches the built data table.

№ 1 2 3 4 5 6

Now let us move on to consider the data set; you need to determine its internal structure, this we use the str() diagnostic function pay special attention to the data types; in our case, the data types and the values entered into the table cells are the same, no further manipulations are required.

In the next stage, we will conduct a descriptive analysis of the available data set and display its descriptive statistics. The presentation of such statistics will contribute to a better understanding of the available data. For numerical columns of data, the minimum and maximum values will be displayed, which allows you to understand the range of data values, median, mean, and quartiles of data. For this, we use the summary() function, which provides basic statistical information for each data set column (figure 5).

If you look at the result, you can see that in the mailings and visitors columns, the values of the first and 3rd quartiles are small and close, which means that most of the values in these columns are in this range, and this is confirmed by the interquartile range for these columns (figure 6).

Let us look at the maximum values of these columns. We see that they are significantly large, indicating that there are so-called outliers in the data (abnormally large values in the available data set). In order to visually verify the preliminary results of the study, we will build dot plots of the data distribution by country based on the number of letters sent and received journal visits figure 7. Looking at the obtained diagrams in figure 8-9, it was found that the most significant emissions in terms of the number of sent letters are observed in China, India and the United States, which may be due to a suficiently large number of users since these countries are the largest in terms of population in this dataset, and the peculiarities of their active behaviour on the network and access to it. In terms of the results of log views, outliers were found in India, Nigeria, the United States, and Ukraine. India, Nigeria, and the United States are countries with large populations, which automatically increases the potential scientific audience of the journal, and there is also a diferent rate of growth in the level of education and interest in scientific research. As for Ukraine, the scientific validity of the research, the accessible form of presentation, and the credibility of the authors of publications can contribute to the journal’s popularity.

In order to understand the distribution of data in mailings and visitors, it is advisable to build boxplots (a box with a moustache) that reflect the distribution of data in these columns and their variability and asymmetry (figure 10).

Let us describe the result of visualising the boxplots, which is a refinement of the result obtained earlier using the summary() function: pay attention to the location of the box, as it is not in the centre of the graph, but at the bottom, which indicates the asymmetry of the distribution of emails sent and views received; the middle line in the boxes reflects the median of the data, we can see that for mailings it is slightly higher than the average value, and for visitors, on the contrary, slightly lower; the edges of the boxes representing the first and third quartiles confirm the small inter-quartile range calculated earlier, i.e. how scattered the data is in the set itself, so the values for mailings and visitors do not difer significantly; next, we describe the upper whiskers of the box, which extend to the maximum values, in mailings it is about 15, in visitors – about 12; the points on the graph that are located well above the upper whiskers are called outliers, in mailings outliers are the values at 23, 73, 75 and 186, in visitors – at 41, 53, 130 and 195. So, we can summarise that both of these columns have similar values, their distributions are similar and skewed, outliers are present, and we need to calculate the standard deviation for these columns to determine the variability of the data.

Let us calculate the standard deviation for mailings and visitors. This will give you an idea of how far a typical value in a column is from the mean (figure 11) if you don’t take into account outliers.

The standard deviations obtained are small, meaning we have a low level of data variability. Let us see if the value of sending emails and the number of visits to the magazine change in tandem, so let us calculate the covariance coeficient (figure 12).

The value of this coeficient is relatively high, indicating a strong relationship between the values of mailings and visitors, i.e. they change in the same direction, and an increase in visitors accompanies an increase in mailings. Let us check how strongly the number of mailings and the number of visits to the journal are related by calculating the Spearman correlation coeficient (figure 13).

The obtained correlation coeficient is closer to 1, which indicates a strong linear relationship, so we can build a model of the relationship between the dependent variable visitors and the independent variable mailings using linear regression, which is one of the most common machine learning methods (figure 14).

Let us write the formula of the resulting model in the form of the following equation: = 0.2 * + 10.53 ( 4 )

The built model can predict future visits to the journal’s page depending on the number of letters sent to scientists in a particular country.

Let us calculate the conversion rate for each country as the number of letters sent by the number of visits to the journal’s page – this will allow us to assess the efectiveness of the mailing to each country; for this rate, we will add another column “Conversion” to the data set (figure 15).

Let us take a look at the list of countries with a conversion rate of more than 100% (figure 16).

Thus, these are scientists from those countries who spent much more time sending letters than received a positive result in the form of journal views, which should be considered in the future.

As a follow-up to the above analysis, it is advisable to use a statistical significance test, such as a t-test, to compare the mean values of the two columns (mailings, visitors) to determine whether there is a statistically significant diference between the number of visits and the number of mailings or conversion rates in diferent countries. In R, there is a particular function t.test() that displays the t-statistic (t), degree of freedom (df), p-value, and confidence interval. If we apply the t-test to mailings and visitors, we get the following result in figure 17.

The calculations allow us to draw the following conclusion: the value of the t-statistic (t=-0.41427) is small, which indicates the acceptance of the null hypothesis, namely, the closeness of the average values of the two columns and the absence of a significant diference between them, which is also confirmed by the value of the p-value coeficient.

To segment the countries by the number of emails sent and the number of visits to the magazine, we will use k-means clustering of the data set. This is one of the most common, simple, efective, and flexible methods of cluster data analysis and allows grouping data based on their similarity. The clustering will be carried out by three clusters (figure 18).

Let us display the number of countries in each cluster (figure 19).

Look at the countries’ lists in each cluster (figure 20).

After clustering by the k-means method, the following results were obtained, which allowed us to divide the countries into three clusters based on the number of letters sent and visits to the journal. Each cluster has its characteristics, which allow us to conclude the level of interest of scientists from diferent countries in this journal.

Cluster 1 (green triangles) includes countries where the number of letters sent and visits to the journal is small, which indicates a low level of interest of scientists in the journal’s topics; figure 21 shows that there were few mailings and, accordingly, few visits. In the countries of this cluster, the journal’s promotion activities are likely to be inefective, so it may be worth revising the journal’s promotion strategy in these countries or focusing on the audience from other clusters.

The second cluster (blue square), on the contrary, includes countries with abnormally high values either in the number of mailings or in the number of visits, so you can see that in some countries (USA, Ukraine), the interest in the journal is high with a small number of mailings, and in some countries (India, China), on the contrary, it is low, so we do not see the expediency of sending mailings to these countries in the future. For the countries in this cluster, a more detailed analysis is likely to be required to understand the reasons for such deviations from the general trend, and it may be necessary to adjust the strategy for each country separately.

The third cluster (red circles) includes countries with average values that reflect both a high level of interest in the journal, for example, countries such as Austria, the Netherlands, Turkey, Malaysia, etc., and a low level of interest, for example, Australia, Canada, Pakistan, etc. Also, it includes countries such as Poland and Nigeria, to which no mailings were made. The countries in this cluster are the most heterogeneous in terms of interest, so to build an efective journal promotion strategy, it is necessary to further segment this cluster by other criteria, such as language, region, scientific interests, etc.

To sum up, based on a detailed analysis of the number of letters sent and visits to the “Journal of Edge Computing” in diferent countries, we could divide them into groups according to the level of interest in the topics of the scientific publication. This country segmentation will allow the editorial team to define the journal’s target audience more accurately and focus marketing eforts on countries with high interest. However, although the clustering results give us a general idea of the distribution of countries by level of interest, they do not explain why certain countries have a high, medium or low level of interest in the journal. To better determine the reasons for the diference in interest among researchers, additional research is needed to identify the factors influencing users’ decisions to visit the journal and study the scientific publications presented. As for the factors, we can assume the following: cultural peculiarities that may afect the perception of information and the choice of information sources; the level of economic development of the country, respectively, access to information technology; access to the Internet, especially its speed, the cost of access to the network, which may limit users to online resources; language barriers, since these are English-language publications; thematic relevance as the correspondence of the journal’s topics to the interests and needs of scientists from diferent countries; availability of other similar scientific journals.

5. Conclusions

The study conducted a comprehensive analysis of the data set collected from the mailing lists by the scientific journal “Journal of Edge Computing” editors, which was aimed at establishing the efectiveness of electronic invitations sent to foreign and domestic researchers for review and further publication. It revealed the list of countries of residence of researchers who have expressed interest in the journal. We can state that there is a positive correlation, i.e. the more invitations were sent, the higher the number of visits to the journal. In some countries, there is a very high or low interest in the journal, so the analysis highlighted the diferences in the level of involvement of scientists from diferent countries in reviewing the journal. As for the results obtained from the data clustering, countries were grouped into appropriate clusters depending on the level of interest of scientists. For the future, a linear model was built to predict the preliminary result of the interest of scientists from diferent countries, which will allow the journal’s editorial board to attract more interested countries and efectively conduct a marketing campaign to attract scientists. To carry out this study, the R programming language and various statistical methods were used to clean the data, conduct descriptive and analytical data analysis, and build visualisations, which allowed for eficient and detailed analysis with the required results. Thus, the study has shown the efectiveness of the measures taken to send out electronic invitations. Declaration on Generative AI: During the preparation of this work, the authors used Grammarly in order to: Grammar and spelling check. After using this service, the authors reviewed and edited the content as needed and takes full responsibility for the publication’s content.

[1]

T. A.

Vakaliuk ,

S. O.

Semerikov , Introduction to doors Workshops on Edge Computing ( 2021 -2023), Journal of Edge Computing 2 ( 2023 ) 1 - 22 . doi: 10 .55056/jec.618.

[2]

Ratmaningsih ,

Abdulkarim ,

D. S.

Logayah ,

D. N.

Anggraini ,

Sopianingsih ,

F. Y.

Adhitama ,

M. A.

Widiawaty , Android-Based Augmented Reality Technology in the Application of Social Studies Text-books in Schools , Journal of Advanced Research in Applied Sciences and Engineering Technology 48 ( 2024 ) 29 - 50 . doi: 10 .37934/araset.48.1.2950.

[3]

Katahira , Evaluating the predictive performance of subtyping: A criterion for cluster mean-based prediction , Statistics in Medicine 42 ( 2023 ) 1045 - 1065 . doi: 10 .1002/sim.9656.

[4]

Ovchinnikova ,

Anders , Simple but powerful interactive data analysis in R with R/LinkedCharts , Genome Biology 25 ( 2024 ). doi:10.1186/s13059-024-03164-3.

[5]

Imran ,

W. N.

Arifin , T. M. H. T. Mokhtar , Data Analysis in Medicine and Health Using R , 2024 . URL: https://bookdown.org/drki_musa/dataanalysis/.

[6]

Davis , Introduction to Environmental Data Science, 2023 . doi: 10 .1201/9781003317821.