Determination of Conformity of Scientific Reports to the
Conference’s Topics
Pavel A. Kozlov 1, Shahim I. Safin1, Vladimir O. Tolcheev 1
1
  National Research University "Moscow Power Engineering Institute”, Krasnokazarmennaya 17, Moscow,
11250, Russian Federation


                Abstract
                This paper examines the conformity of scientific reports to conference sections. Assumptions
                were formulated and tested about how scientific articles are allocated to conference sections.
                With the help of text mining methods, clusters were identified on which the reports are
                grouped. The analysis of the terminological composition of the obtained clusters is carried out.
                The closeness of the resulting topics (clusters) and the degree of their correspondence with the
                names of sections and specialization of departments are analyzed. The sample for research is
                formed from the proceedings of the interdisciplinary conference "Electronics, Electrical
                Engineering and Energy", which was held at the National Research University in 2020. The
                total sample size is 88 articles. Pay attention to an important feature of the data – our sample
                consists of extremely short text documents, which contain only the titles of reports. In this
                regard, the text size varies from 5 to 20 words (the average size is 9 words). Our research has
                shown a fairly significant discrepancy between expert assessments (the selection of a section
                by an expert) and clusters built using the tool of Data and Text Mining. At the same time the
                following dependence was established - most often the reports were distributed by the place of
                work (or study) of the speaker, i.e. the affiliation to the department determined the section. at
                the same time, the following dependence was established - most often the reports were
                distributed by the place of work (or study) of the speaker, i.e. the affiliation to the department
                determined the section. In addition, we have identified high interdisciplinarity of reports, many
                of which could (with good reason) be reported on several sections at once. Note also that
                analysis of the specialized literature showed a good correspondence of our results with other
                similar studies.

                Keywords 1
                Data and text mining, clustering, K-means, latent semantic analysis, hierarchical cluster
                analysis.

1. Introduction
    Participation in scientific conferences is an important condition for the successful preparation of
qualification works, the implementation of grants, and research work. When choosing conferences,
specialists are guided primarily by the information that organizers provide in information letters about
the main directions (topics, sections), and assess how the section names are related to their professional
interests.
    After the presentation of the report, the final decision on its compliance with the profile of the
conference, as well as assignment to one of the sections, is made by the experts reviewing the work.
However, speaking at a section, the speaker often discovers that his topic is quite different from other
works. This can be useful for broadening one's horizons and exchanging ideas from various subject
areas, but it does not provide an opportunity to discuss issues with leading experts professionally dealing

    Proceedings of VI International Scientific and Practical Conference Distance Learning Technologies (DLT–2021), September 20-22,
2021, Yalta, Crimea
EMAIL: Kozlov.Pavel.Andreevih@yandex.ru (A. 1); safinsi@mail.ru (A. 2); tolcheevvo@mail.ru (A. 3)
ORCID: 0000-0002-9900-9368 (A. 1); 0000-0002-5214-5119 (A. 2); 0000-0002-2320-6533 (A. 3)
             ©️ 2021 Copyright for this paper by its authors.
             Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
             CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                  231
with issues of interest to the author. Moreover, quite often works on related topics are presented in
different (simultaneously passing) sections of the conference, which makes it difficult to participate in
their discussion.
    In this article, using the data mining toolkit (Data and Text Mining), the degree of correspondence
between the topics of the reports and the titles of the sections are studied based on their terminological
proximity [1]. The study is carried out on the example of the analysis of the proceedings of the
conference "Electronics, Electrical Engineering and Energy", which was held at the National Research
University (NRU "MPEI") in 2020 [2].

2. Study description
    The conference "Electronics, Electrical Engineering, and Energy" is interdisciplinary and covers
several subject areas, our attention is focused on the direction "Information Technology". This direction
is represented by 8 sections, which are conducted by the departments included in the Institute of
Information and Computing Technologies of the NRU "MPEI" (ICTI). In our study, 5 sections are
considered, which reflect the main areas of training for students studying at the IHTI [1].
         Section 1. "Mathematical Simulation " (Department of Mathematical and Computer
    Simulation).
         Section 2. "Applied Mathematics" (Department of Applied Mathematics and Artificial
    Intelligence).
         Section 3. "Computer Science and CAD" (Department of Computational Technologies).
         Section 4. "Computing machines, networks and systems" (Department of Computing machines,
    systems and networks).
         Section 5. "Management and Informatics in Technical Systems" (Department of Management
    and Intelligent Technologies).
    The total sample size is 88 articles, the number of reports in sections varies from 7 to 23, while each
section has on average 2 "external" presentations (except for the section "Mathematical Modeling",
which had 8 "external" works). Thus, the topics of the reports for the most part reflect the directions of
the department's research.
    Let's consider the specifics of presenting materials at the conference "Electronics, Electrical
Engineering, and Energy". The authors prepare short abstracts in the following form: "title - text -
literature references". The specificity of the original text documents allows the researcher to choose two
options: use the entire text (i.e. process full-text data) or analyze only the title of the report (i.e. one
sentence). In this paper, it was decided to analyze the titles of reports, the size of which varies from 5
to 20 words (average size is 9 words). Of course, with this approach, the loss of some of the content
information located in the main text is possible. However, it is known [2] that the titles of scientific
papers well reflect the thematic focus and the main idea of the authors, and the quality of clustering-
classification by headings of reports is in good agreement with the results obtained in the analysis of
full-text documents. Moreover, recently the processing of individual sentences (and phrases) has
become an important area of work in the field of Data Mining, due to the high popularity of various
messengers and question-answer systems working with laconic messages.
    In this work, a vector model is used for the mathematical approximation of text documents [3,4]:
                                                      𝑥1𝑗
                                               𝑋𝑗 = [ 𝑥𝑖𝑗 ] (1)
                                                     𝑥𝑁𝑗
Here 𝑁 – the number of terms after removing stop words, lemmatization and cutting off single-
frequency words, 𝑥ij – the weight of term i in document j (𝑗 = 1, . . , 𝑀 −
the number of documents in the sample, 𝑖 = 1, . . , 𝑁 ).
    The matrix model was used to describe the sample X:
                                               𝑥11 . . 𝑥1𝑀
                                       𝑋 = ( . . 𝑥𝑖𝑗         . . ) (2)
                                               𝑥𝑁1 . . 𝑥𝑁𝑀
    To date, many different methods have been developed for calculating the weights of terms
(weighting by word frequency, tf-idf - weighting, tfc - weighting, etc.). In this work, tfc-weighing will

                                                     232
be used, which is a normalized version of the popular and widely used tf-idf-weighing in Data Mining
[2,5]:
                                                        𝑀
                                              𝑓𝑖𝑗 ∗ log( )
                                                        𝑀𝑖
                                  𝑥𝑖𝑗 =                              (3)
                                            𝑁              𝑀 2
                                         √∑𝑖=1[𝑓𝑖𝑗 ∗ log ( )]
                                                           𝑀𝑖
   Here 𝑓𝑖𝑗 - frequency of occurrence of the word i in document j and 𝑀𝑖 – the number of documents
that contain the word i. Summation in the denominator of the fraction is carried out in overall terms of
the document j in which the word i occurs.
   Our study tests the following assumptions:
      the topics of the reports correspond to the title of the conference section,
      the topics of the reports correspond to the specialization of the department,
      the topics of the reports correspond to the subject area to which several sections belong at once
        (there is a significant overlap between the topics of the sections and, as a result, similar reports
        appear in different sections).
   Further, in the work, the clustering and visualization of the titles of the reports are carried out, the
number of clusters is determined, the name of the cluster is given based on the most frequent terms, the
proximity of the resulting clusters and the degree of their correspondence with the names of sections
and departments are analyzed [6,7]. Vector models and tfc-weighting of terms are used to represent text
documents [8].

3. Analysis of Initial Data Using Data Mining Tools
     First of all, we will visualize the initial sample based on metadata, using an expert assessment of
the reports belonging to a section. For visualization, we will apply the method of principal components
(PCA) [3,4].
     The visualization result, shown in Figure 1, shows the absence of obvious clusters corresponding
to the sections and suggests the presence of reports that are "close" to several topics at once, as well as
works that are uncharacteristic (atypical) for the sample under consideration.


Figure 1: Projection of a sample into two-dimensional space with PCA


                                                     233
     Let us formulate the following assumption: each section has a “core” of documents that are quite
close to each other (and, in general, correspond to the section name), and a “periphery” when the texts
differ from the main topics. To test this assumption, we use hierarchical cluster analysis, which allows
us to analyze in more detail the original sample using a dendrogram constructed using the cosine
measure of proximity and combining clusters using a weighted pairwise average [9].
     Figure 2 shows a dendrogram that was obtained using the cosine measure of proximity and with
the union method - weighted pairwise average. The results presented in Figure 2 only partially confirm
the earlier assumption. It can be seen that there are fairly close articles that are combined into (very)
small clusters (from 2 to 5 articles). In general, the source data can be combined into 5 large clusters.
However, the “model” of such clusters will not correspond to our assumption of the presence of a “core”
and “periphery”. The result of the hierarchical cluster analysis and the visualization performed suggest
that the "model" is a combination of small groups of (related) documents distant from each other into a
rather heterogeneous formation (a blurred cloud stretched in the feature space).


Figure 2: Hierarchical clustering of the original sample

    The next study includes the use of data mining methods for the analysis of the initial sample without
taking into account expert assessments. An automatic division of a set of reports into terminologically
close groups (clustering problem) is considered. To carry out clustering, we use the method of K-means
and latent semantic analysis (LSA) [10,11]. As before, we are interested in the distribution of documents
into five clusters (the number of sections) using data mining.
    The clustering results presented in Figures 3 and 4 allow us to conclude the presence of 5 topics of
reports, which corresponds to the initial number of sections.
    Let's analyze the grouping of articles and give a name to the resulting clusters:
    1. Subject "Simulation". High-frequency cluster words – simulation, modeling, model, method.
    2. Subject "Computer systems". High-frequency cluster words - system, method, detection,
         research, implementation, computational.
    3. Subject "Spacecraft control systems". High-frequency words of the cluster - system, control,
         apparatus, analysis, space.
    4. Subject "Algorithms". High-frequency cluster words - algorithm, implementation, research,
         method, analysis.
    5. Subject "Information systems". High-frequency words of the cluster - system, control, decision,
         information, process.


                                                    234
     Only one cluster ("Simulation") coincided with the original section names. The rest of the clusters
turned out to be "interdepartmental" and "intersectional". On the whole, they characterize rather well
the direction of the ICTI activity, reflecting the high intersectionality of research. Along with well-
interpreted constellations, a rather unexpected cluster associated with spacecraft control has also
formed. Among the main specializations of the departments of ICTI, there are no aerospace topics. Let's
analyze the resulting cluster in more detail. It consists of five articles, which were reported in the
sections "Computing machines, networks and systems" and "Control and informatics in technical
systems." The work was carried out by different research teams on the following topics: "research of
data filtering methods", "motion control using solar sensors". When the number of clusters is reduced
to three, reports on "spacecraft" are added to the "Information systems" cluster. To check the "stability"
of the aerospace cluster, it is necessary to analyze the proceedings of the next conferences and
estimate the number of papers submitted annually.


Figure 3: K-Means Clustering


Figure 4: Clustering by LSA method

    Additionally, analyzing the initial distribution of articles, the following conclusions can be drawn:
     The topics of reports in half of the cases (53%) correspond to the title of the conference section.
       Also, here we can talk about the essential interdisciplinarity of the sections, for example, the

                                                    235
        directions "Applied Mathematics" and "Mathematical Simulation" are close. Some topics can
        fit most of the sections, for example, "Neural Networks".
       The themes of the reports correspond to the specialization of the department. For the department
        "Mathematical Simulation" there is a great correspondence between the topics of reports and
        specialization, for other departments a high level of compliance is also revealed. This suggests
        that authors are more likely to report at those sections that their department heads, and not at
        those that are optimal for them in terms of the topic.

4. Conclusions
    The studies carried out show that the initial topics of the reports do not always coincide with the
topics of the section, but most often correspond to the specialization of the department (in particular,
those carried out at the department of research and development). The results obtained are in good
agreement with similar works [12,13,14,15]. So, in [13], using LSA, the correspondence of the titles of
reports and sections of conferences "Mathematical Methods of Pattern Recognition" was checked based
on the analysis of bibliographic descriptions (titles, annotations, keywords). The automatically
constructed clusters differed significantly from the original headings of the conference. It seems that
the distribution of scientific articles by sections of the conference is most often carried out quite
subjectively and is not confirmed by the results that are obtained when using the data mining tools (in
particular, when carrying out automatic clustering) [12,13,15]. Of course, the exclusion of expert
assessments from the process of distributing articles into sections is hardly advisable. However, the use
of combined approaches, including, along with expertise, the use of data mining means, will contribute
to the formation of sections "declared" by the organizers and increase the thematic proximity of the
works reported at one session.

5. References
[1] Collection of abstracts of the XXVI international scientific and technical conference of students
    and graduate students Radio electronics, electrical engineering, and energy.
    https://reepe.mpei.ru/Pages/default.aspx.
[2] G. Salton, The SMART retrieval system: Experiments in automatic document processing.
    Englewood Cliffs, N.J.: Prentice-Hall, 1971. DOI:10.1109/TPC.1972.6591971.
[3] C.C. Aggarwal, Machine Learning for Text. Springer, 2018. 452 р. DOI:10.1007/978-3-319-
    73531-3.
[4] Flach P. Machine learning – The Art and Science of Algorithms that Make Sense of Data.
    Cambridge University Press; 1st edition (September 1, 2012) DOI:10.1017/CBO9780511973000.
[5] K. Chen, Z. Zhang, J. Long, H. Zhang, Turning from TF-IDF to TF-IGM for term weighting in
    text classification, Expert Syst. Appl. 2016, 66, 245–260.
[6] S.R. Nair, G. Gokul, A.A. Vadakkan, A.G. Pillai, M.G. Thushara, Clustering of research
    documents – a survey on semantic analysis and keyword extraction. 6th International Conference
    for        Convergence          in       Technology,       Maharashtra,        India,       2021.
    DOI:10.1109/I2CT51068.2021.9418197.
[7] R. Lakshmi, S. Baskar, Efficient text document clustering with new similarity measures.
    International Journal of Business Intelligence and Data Mining. 2021. V. 18. № 1. РР. 109-126
    DOI:10.1504/IJBIDM.2021.111741.
[8] K. Aas, L. Eikvil. Text Categorization: A Survey. Norwegian Computing Center. Oslo. 1999, p.1–
    37.
[9] P.Kozlov, A. Mokhov, V. Tolcheev, Detection of the thematic groups in scientific publications,
    Russian Advances in Fuzzy Systems and Soft Computing: Selected Contributions to the 8th
    International Conference on Fuzzy Systems, Soft Soft Computing and Intelligent Technologies,
    FSSCIT 2020; Smolensk; Russian Federation; 29 June 2020 до 1 July 2020; Код 165822. Volume
    2782, 2020, Pages 278-285


                                                    236
[10] D. Zeebaree, H. Haron, A. Abdulazeez. Combination of K-means clustering with Genetic
     Algorithm: A review. International Journal of Applied Engineering Research, Volume 12, Number
     24, 2017, pp. 14238-14245.
[11] C. Li, Y. Lu, J. Wu, Y. Zhang, Z. Xia, P. Liu, T. Wang, D. Yu, X. Chen, J. Guo. LDA meets
     Word2Vec: a novel model for academic abstract clustering. Web Conference, 2018, p. 1699-1706.
[12] A.A. Kuzmin, A.A. Aduenko, V.V. Strijov, Thematic Classification for EURO/IFORS Conference
     Using Expert Model, Conference of the International Federation of Operational Research
     Societies, 2014, р. 175.
[13] M. V. Korotchenkov, E. Kh. Khunov, Identification of topics of scientific documents based on
     latent semantic analysis, Radio electronics, electrical engineering, and energy: Abstracts of the
     twenty-seventh international scientific and technical conference of students and graduate students,
     Moscow, NRU "MPEI ", 11-12 March 2021, p. 254.
[14] S. Henry, B.T. McInnes. Literature-based discovery: models, methods, and trends. Journal of
     biomedical informatics, vol.74, 2017, pp.20-32. DOI: 10.1016/j.jbi.2017.08.011.
[15] M. Kamada, M. Isonuma, K. Asatani, I. Sakata, Discovering Interdisciplinarily Spread Knowledge
     in the Academic Literature. DOI:10.1109/ACCESS.2021.3110111.


                                                   237