=Paper=
{{Paper
|id=Vol-2543/spaper12
|storemode=property
|title=Determination of Thematic Proximity of Scientific Journals and Conferences Using Big Data Technologies
|pdfUrl=https://ceur-ws.org/Vol-2543/spaper12.pdf
|volume=Vol-2543
|authors=Alexander Kozitsyn,Sergey Afonin,Dmitry Shachnev
|dblpUrl=https://dblp.org/rec/conf/ssi/KozitsynAS19
}}
==Determination of Thematic Proximity of Scientific Journals and Conferences Using Big Data Technologies==
<pdf width="1500px">https://ceur-ws.org/Vol-2543/spaper12.pdf</pdf>
<pre>
      Determination of Thematic Proximity of Scientific
    Journals and Conferences using Big Data Technologies

 A.S. Kozitsin [0000-0002-8065-9061], S.A. Afonin[0000-0003-3058-9269] and D.A. Shachnev [0000-
                                            0002-5940-9180]


            1 Research Institute of Mechanics, Moscow State University Lomonosov

               alexanderkz@mail.ru, serg@msu.ru,mitya57@gmail.com


        Abstract. The number of journals published in the world is very large. In this
        regard, a software toolkit is needed that will allow thematic links of journals to
        be analyzed. The algorithm developed by the authors and presented in this work
        uses a graph of co-authorship to analyze the thematic proximity of journals. The
        algorithm is insensitive to the language of the journal and selects similar jour-
        nals in different languages, which is difficult to implement for algorithms based
        on the analysis of full-text information. The algorithm was tested in the scien-
        tometric system IAS «ISTINA». In the interface developed for these purposes,
        the user can select one journal that is close to them by subject, and the system
        will automatically generate a selection of journals that may be of interest to the
        user both from the point of view of studying the materials available in them and
        from the point of view of publishing their own articles. In the future, the devel-
        oped algorithm can be adapted to search for related conferences, collections of
        publications and scientific projects. The presence of such a tool will increase
        the publication activity of young employees, increase the citation of articles and
        citation between journals. The results of the algorithm for determining the the-
        matic proximity between journals, collections, conferences and scientific pro-
        jects can also be used to build rules in models for differentiating access to data
        based on domain ontologies.

        Keywords: Thematic Classification, Bibliographic Data, Co-authorship Graph,
        Information Systems..


1       Introduction

The number of currently published scientific journals is very large. For example, in
the information and analytical system (IAS) «ISTINA» [1] more than 70 thousand
scientific journals and another 200 thousand different collections of scientific publica-
tions and conference materials are registered. In this regard, young scientists, graduate
students and students need services that will automatically select the journals that are
most relevant for their scientific interests. To solve this problem, the accumulated
experience of the entire scientific community can be used.

Copyright © 2020 for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                                                                      409


    There are several possible ways to solve the problem of determining the thematic
proximity of journals. The first method is based on the use of thematic analysis of
full-text descriptions of journals, texts of articles published in journals, their annota-
tions and keywords. Based on the results of a full-text thematic analysis of such texts,
it is possible to construct an assessment of the semantic proximity of the interests of
users and publications in the journal. To be able to conduct such an analysis, the user
must describe the area of his scientific interests with the help of keywords or upload
the full texts of his articles to the system. In addition, it is necessary to have fairly
accurately described thematic profiles of all journals or the full texts of articles pub-
lished in these journals. Obtaining sufficiently complete full-text data is a difficult
task, since in many journals open publication of the full texts of articles is not permit-
ted.
    Using only keywords for thematic analysis can give too general results. This is due
to the fact that in many cases the keywords of the article do not characterize its topic,
but the relationship of the article with one of the priority areas for the development of
science, technology and technology in the Russian Federation. For example, the key-
word “Nanotechnology” is found in articles of completely different subjects: “Devel-
opment and production of new nanostructured diamond-like carbon coatings of tribo-
logical purpose”; “The development of new medical nanotechnology for the defeat of
cancer cells in pediatric acute lymphoblastic leukemia”; “The use of radionuclides
and sources of ionizing radiation in nanochemistry, nuclear medicine and for the
study of processes occurring in the environment”; “Development and creation of su-
persensitive field and charge nanostructures for reading and sensor devices of nanoe-
lectronics.”
    In this regard, the use of full-text thematic analysis to solve the above problem in
scientometric systems is difficult to implement.
    An alternative method for assessing the proximity of journals in thematic areas is
to analyze a column of co-authorship of articles published in these journals. When
implementing this method, it is assumed that most authors publish their articles in
thematically related journals. As a result, similar authors often publish similar au-
thors. In contrast to the methods of full-text subject analysis, the approach based on
the use of graphs of co-authorship does not require the availability of full-text infor-
mation about articles, and uses only bibliographic data of articles published in jour-
nals. Such data can be obtained from scientometric systems (for example, IAS
«ISTINA»), or citation systems (for example, WoS).


2      Algorithm for Assessing the Thematic Proximity of Scientific
       Journals

Formally, the problem of assessing the proximity of journals can be formulated as
follows. It is necessary to construct a graph whose vertices are the journals, and the
weights of the edges correspond to their thematic proximity.
   The developed algorithm at the first step for each pair of journals calculates all
pairs of articles published in these journals by one author. If only one pair of articles
410


corresponds to a pair of journals, then such pairs are considered unrelated. If a pair of
journals corresponds to several pairs of articles, then journals are considered to be
connected by an edge with a certain weight.
   In the framework of this work, several methods for determining the weight of an
edge were considered. The simplest method is to determine the weight of the edge
equal to the number of unique authors among the corresponding pairs of articles. The
main disadvantage of this method is the inability to take into account the importance
of the authors for each article. In many cases, articles are written by only one author,
whose last name is put first in her bibliographic description. The remaining co-authors
may be involved in the work on the article slightly, and their main area of ??scientific
activity may not coincide with its subject.
   To test the hypothesis about the importance of the order of authors during a the-
matic analysis, an assessment was made of the proportion of articles in which the
order of authors is determined by the lexicographic order, and not by significance in
the work on the article. From the scientometric system of Moscow State University,
all articles in journals for 2014–2017 with the number of authors from 2 to 7 were
selected for analysis.
   For each of the indicated number of authors, the percentage of articles L was calcu-
lated for which the correct set of authors is determined by the lexicographic order.
The calculation results for a different number of authors are shown in Table 1.

               Table 1. The percentage of articles with lexicographical order.
                                Number of authors                      L
                                       2                             24%
                                       3                             16%
                                       4                              9%
                                       5                              6%
                                       6                              6%
                                       7                              3%

   From the data given in the table, we can conclude that in most cases the main au-
thor is the author, who is indicated first in the bibliographic description. To account
for this fact, a formula for calculating the weight of the edges was developed taking
into account the position of the author in the bibliographic description of the article.
The weight of the author for each article is defined as 1/2 + 1 / (2K) for the first au-
thor and 1 / (2K) for the remaining contributors, where K is the number of co-authors
in the article. The degree of communication for a given author for two journals is
determined from at least the maximums of his weights for the subsets of articles in
each journal. The final weight of the connection edge between the two journals can be
calculated as the sum of the degrees of their connection for all authors.
                                                                                    411


3       Software Implementation and Test Results First Section

When choosing a language for the software implementation of the algorithm, such
features of the algorithm as the large amount of processed data, the need for quick
access to the data stored in the DBMS, the small requirements for the amount of
memory to create temporary data structures, and the absence of the need to conduct a
dialogue with the user were taken into account.
   Given these requirements, PL / SQL was chosen for implementation. The thematic
proximity between the journals is calculated at specified time intervals and stored in
the DBMS tables.Subsequent paragraphs, however, are indented.


Fig. 1. Search for related journals.

   In the interface developed for these purposes [2], the user can select one journal
that is close to the subject, and the system will automatically generate a selection of
journals that may be of interest to the user both from the point of view of studying the
materials contained in them, so and in terms of publishing their own articles (Fig. 1).
412


The web interface is implemented using the open DataTables library [3]. A link has
been added to the information card of each journal to go to a table with a list of the-
matically similar journals. This table shows the names of related journals and
measures of similarity. In addition, in order to be able to quickly assess the authority
of each journal from the list, the table shows the number of publications in this jour-
nal over 5 years (registered in the ISTAINA system), as well as data from the Web of
Science and the Russian Science Citation Index. In order to conveniently navigate the
graph of proximity of journals, the developed interface also implements the ability to
follow links to a list of similar journals directly from each element of the list. By
means of the DataTables library for a quick search by journal names, a mechanism for
quick filtering by part of the journal name is implemented.
   For the convenience of the user, it is possible to add the selected journal to notes,
which can later be viewed, edited, and also used for subsequent searches. Additional-
ly, it is possible to select conferences similar in topic.
   Testing of the developed software implementation of the algorithm was carried out
according to the following methodology. From the obtained results, 200 pairs of log
links were randomly selected. Experts carried out a manual assessment of the coinci-
dence of the topics of the journals with setting points (2 - accurate; 1 - not entirely
accurate; 0 - error). The total score was divided by twice the number of analyzed
bonds. The accuracy rating for this technique was 78%.
   As an example of algorithm errors, one can cite, for example, a list of journals that
are defined as being similar in theme to the publication of Proceedings of the Higher
School of the USSR Ministry of Internal Affairs: Philosophical Sciences; "Logical
research"; "Proceedings of MSTU" MAMI ""; "Logical and philosophical research";
"Bulletin of Moscow University. Series 7: Philosophy. " Such errors may arise as a
result of too broad a subject area of articles accepted in the journal


4      Conclusion

The algorithm described in this paper allows us to automatically evaluate the degree
of thematic proximity of scientific journals based on bibliographic descriptions of
articles and without using full-text versions of articles. It should be noted that the
algorithm is insensitive to the language of the journal and selects similar journals in
different languages, which is difficult to implement for algorithms based on the analy-
sis of full-text information.
   In the future, the developed algorithm can be adapted to search for related confer-
ences, collections of publications and scientific projects. The presence of such a tool
will increase the publication activity of young employees, increase the citation of
articles and citation between journals.
   The results of the algorithm for determining the thematic proximity between jour-
nals, collections, conferences and scientific projects can also be used to build rules in
models of differentiating access to data based on domain ontologies [4].
   This work was supported by the Russian Foundation for Basic Research, project
18-07-01055.Subsequent paragraphs, however, are indented.
                                                                                        413


References
1. Sadovnichiy, V., Vasenin, V.. Intelektualnaya sistema tematicheskogo issledovanija nau-
   kometricheskih dannyh: predposylki sozdanija I metodologija razrabotki. Part 1. Pro-
   gramnaja inzhenerija, 9(2), pp. 51–58 (2018).
2. IAS ISTINA. https://istina.msu.ru, last accessed 2019/11/10.
3. Library datatables. https://datatables.net/, last accessed 2019/11/10.
4. Afonin, S.: Ontology models for access control systems. In: 2018 3rd Russian-Pacific Con-
   ference on Computer Technology and Applications (RPC), pp. 1–6, Vladivostok (2018),
   https://doi.org/10.1109/RPC.2018.8482178.

</pre>