Towards a Recommender System for the Choice of UDC
            Code for Mathematical Articles

    Olga Nevzorova 1[0000-0001-8116-9446] and Damir Almukhametov1[0000-0002-4888-7937]
          1 Kazan Federal University, Kremlevskaja str., 18, Kazan, 420008, Russia

                          [onevzoro, dnlanik]@gmail.com


       Abstract. Authors of scientific papers in the field of mathematics usually use
       the universal decimal classification scheme to search for related articles. UDC
       is a hierarchical classification scheme that allows librarians and editors to speci-
       fy one or more codes for publications. Typically, the classification code identi-
       fies a subject editor who is responsible for the review process for articles sub-
       mitted to scientific journals. In this article, we will explore a new approach to
       assigning UDC code for mathematical work, based on the OntoMathPRO on-
       tology.
           This ontology is an applied ontology for the automatic processing of profes-
       sional mathematical articles in Russian and English. An ontology defines con-
       cepts commonly used in mathematics, as well as an evolving and poorly estab-
       lished vocabulary extracted from contemporary scientific articles. OntoMath-
       PRO covers a wide range of areas of mathematics such as number theory, set
       theory, algebra, analysis, geometry, computation theory, differential equations,
       numerical analysis, probability theory, and statistics. Each class has a textual
       explanation, Russian and English inscriptions, including synonyms.
           We investigated a set of classification functions, which are presented as on-
       tology concepts, and identified the most relevant ones for constructing code
       maps of some UDC codes in the field of mathematics. We found that the code
       maps of the considered UDC codes can be built on the basis of the selected fea-
       tures (method, equation, problem). The values of these features are determined
       using the OntoMathPRO ontology. The constructed code maps allow for suc-
       cessfully assigning the considered UDC codes for publications.

       Keywords: Recommender system, classification of documents, OntoMathPro
       ontology, Universal Decimal Classification.


1      Introduction

Recommender systems are used in a variety of areas [1]. Recommender systems are
classified as content, collaborative, knowledge-based, and hybrid [2]. Collaborative
filtering approaches build a model from a user's past behavior. This model is used to
predict items (or ratings for items) that the user may have an interest in. Content-
based filtering approaches utilize pre-tagged characteristics of an item in order to


Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).


                                               54
recommend additional items with similar properties. Current recommender systems
typically combine one or more approaches into a hybrid system.
   Knowledge-based recommender systems used in science and education are of par-
ticular interest. The classic tasks of such systems are searching related articles, build-
ing recommendations for the study of educational topics [3].
   In this article, we consider recommender systems focused on publishing and the
preparation of scientific publications [4]. Such systems form the digital infrastructure
of electronic scientific journals, including a software platform that implements the
main workflows for managing an electronic journal, and information systems that
support basic and additional services, taking into account, in particular, the specifics
of the subject area of this journal [5]. One of the important problems is the classifica-
tion of articles submitted to the journal.
   Classification of documents with the assignment of code-classifiers is a traditional
way of systematizing and searching knowledge.
   Classifiers are a type of metadata in scientific documents. There are various na-
tional and international universal classification systems. It is being widely used in
Russia such the classification systems as the universal Library-Bibliographic Classifi-
cation (LBC), the State Rubricator of Scientific and Technical Information (SRSTI),
the Universal Decimal Classification (UDC).
   The Universal Decimal Classification (UDC) (www.udc.org/index.html) underlies
the systematization of knowledge presented in libraries, databases and other reposito-
ries of information. UDC is adopted for indexing scientific and technical documents
in most countries of the world. In Russia, the UDC is a mandatory requisite for all
book products and information on natural and technical sciences. At the end of 2019,
this classifier contains about 126,441 codes. The classification is currently translated
into more than 50 languages.
   The classification codes selecting is associated with the analysis of the structure of
the classifier tree and is quite time consuming. In this article, we consider the problem
of automating the selection of the UDC classification code for a mathematical article
based on a special resource – the OntoMathPro ontology for professional mathemat-
ics.


2      Related Work

The interest in the topic of scholarly text classification and recommendation has
grown in recent years. Regarding the classification of scholarly texts according to the
UDC [6], texts are classified by peers based on their keywords. Similarly, biblio-
graphic metadata (title, description and subject tags) can be used to equip texts with
Dewey decimal classification (DDC) to supplement bibliographic records of publica-
tions [7]. The spread of digital resources and their integration into the traditional li-
brary environment has created the need for an automated tool that organizes publica-
tions into library classification schemes.
   A survey of methods, such as content-based, collaborative filtering, graph-based
and hybrid methods can be found in the work of Bai et al. [8]. Analysis of the use of


                                           55
recommendation-as-a-service for academia is presented in the study by Beel et al. [9].
In [10] a comprehensive summary of the state-of-the-art of deep learning based rec-
ommender systems is provided. The machine learning methods are used in scientific
recommender system in various services [11,12]. In [11] the authors investigate the
feasibility of automatically assigning a coarse-grained primary classification using the
MSC scheme, by regarding the problem as a multiclass classification machine learn-
ing task. In [12], a machine learning model for the automatic classification of old
digitized texts from the Slovenian digital library is discussed. The classification of the
UDC of new scientific texts, assigned by human specialists, was used to build a clas-
sification model of the UDC of old digitized texts. This model uses various clustering
algorithms. The authors argue that the best performing classifier was SVM using Tf-
idf (CA 5 0.963). In contrast to these works, for the classification of mathematical
articles, we use a different approach based on the OntoMathPro ontology of profes-
sional mathematics [13].


3      Ontology based Model for Recognition of UDC Code

3.1    The OntoMathPro Ontology

The OntoMathPRO ontology is an applied ontology for automatically processing
professional mathematical articles in Russian and English. The ontology defines the
concepts commonly used in mathematics. The OntoMathPRO ontology covers a wide
range of fields of mathematics such as number theory, set theory, algebra, analysis,
geometry, theory of computation, differential equations, numerical analysis, probabil-
ity theory, and statistics. Each class has a textual explanation, Russian and English
labels including synonyms. Terminological sources used in the development were
classic textbooks, online resources such as Wikipedia and the Cambridge Mathemati-
cal Thesaurus, scientific articles from a scientific journal, such as the journal “Russian
Mathematics (Iz. VUZ)”.
   In the ontology, one could distinguish two taxonomies with respect to ISA-
relationship – a hierarchy of fields of mathematics and a hierarchy of mathematical
knowledge objects. The first one is rather conventional and close to the related part of
the Universal Decimal Classification. The top level of the second taxonomy contains
concepts of three types: i) basic metamathematical concepts, e.g. Set, Operator, Map,
etc; ii) root elements of the concepts related to the particular fields of mathematics,
e.g. Element of Probability Theory or Element of Numerical Analysis; iii) common
scientific concepts: Problem, Method, Statement, Formula, etc. OntoMathPRO de-
fines three types of object properties.
   OntoMathPRO is developed in OWL-DL/RDFS languages. Numerically, OntoMa-
thPRO contains 3 450 classes, 5 object properties, 3 630 subclass-of property instanc-
es, and 1 140 other property instances.


                                           56
3.2    Main Approach
This article examined collections of 1356 mathematical articles published in the jour-
nal “Russian Mathematics (Iz. VUZ)” for 10 years (1999-2009). Each article has at
least one UDC code. In the collection under consideration, the largest number of arti-
cles falls on the UDC code 517 (“Analysis”). 883 articles have this code.
   The approach proposed in this article to the automatic recognition of the UDC code
for a mathematical article is based on the use of the OntoMathPro ontology. As noted
above, the ontology contains basic concepts such as a problem, system, theory, equa-
tion, formula, etc. The key idea of our approach is that the choice of the UDC code is
determined by a certain set of classifying features that the author of the article uses.
These features are represented in the ontology by basic mathematical concepts. And
the task of our research was to select the most relevant features as ontological con-
cepts that determine the choice of the UDC code. We asked the experts to answer the
question, which features are decisive for them when choosing a UDC code for their
scientific works, and we came to the conclusion that the most significant features are
the method, problem and equation.
   Therefore, in this article, we investigate the working hypothesis that the methods,
problems and equations used will be the most relevant features to create a map of the
UDC code in the "Mathematics" domain.


3.3    The Architecture of the Prototype for Assessing the Relevance of
       Classifying Features

The general infrastructure of the workflow can be divided into two main sub-
processes, such as preparing subcollections with highlighted UDC codes (Fig. 1) and
assessing the relevance of classifying features (Fig. 2).


                         Fig. 1. The architecture of the prototype.


            Fig. 2. The model of assessing the relevance of classifying features.


                                             57
The collection preparation process includes five modules that can be combined into
three subsystems as Format Conversion, Preprocessing and Semantic Annotation.
   The Format Conversion subsystem provides conversion of a collection of mathe-
matical articles into xml format. Next, the Preprocessing subsystem sorts articles by
the specified UDC codes. At this stage, morphological analysis of the content of xml
tags is performed using the pymorpthy2 library. The Semantic Annotation subsystem
provides functionality for annotating articles in terms of a fixed set of subject areas of
the OntoMathPro ontology. At this stage, all named entities recognized by the ontolo-
gy are extracted from the text of the article, and a vector of the document is compiled
based on the ontology dictionary.
   The named entity recognition is implemented using fuzzy string comparison. The
modified Levenshtein metric implemented by the fuzzywuzzy library is used as a
comparison measure.


3.4    Assessing the Relevance of Features

The system for assessing the relevance of classification features is shown in Figure
2b. The Filter_MNE module receives on the input two collections with different the
UDC codes and a list of classifying features. The result of the module's work is the
formation of the code maps of UDC based on the selected classifying features. A
UDC code map obtained with classifying feature is a set of feature values that are
recognized in the corresponding subcollection of articles based on the OntoMathPro
ontology. The Map_Estimate module compares these code maps of UDC obtained on
these collections. At this stage, the general and specific terms of collection are deter-
mined. As a result, the module forms the code maps of UDC, which take into account
the relevance of each term.
   Experiments. We performed several experiments to test the working hypothesis
on the most representative subcollection with the UDC code 517 (“Analysis”) in our
collection of journal articles.
   We carried out several experiments, pairwise comparing the constructed different
code maps of UDC for different subcollections. The choice of UDC codes was based
on the position of these codes in the UDC hierarchy (different first-level subtrees in
the code tree with root vertice 517), relationship (descendants of one ancestor), and
the size of subcollections.
   The results of the experiments are presented in diagrams that show a number of
common and UDC-specific terms.
   Let us denote the complete code map of the classifying feature, built using the on-
tology, which includes all the values of this feature as SF, and the code map formed
from the subcollection with a given UDC code, as the SFcode, for example, SF517.
   Thus, we determine the classifying feature and its code map (characteristic set of
the feature) for the UDC code. Then we compare the code cards for two UDC codes
and compute the relevance of the classifying feature (represented as a fuzzy linguistic
variable with the values “weak”, “real”, “strong”). The relevance of the classifying
feature for two code maps (Rel_F (code1, code2)) is calculated as:


                                           58
                                                  𝑆𝐹𝑐𝑜𝑑𝑒1 ∩ 𝑆𝐹𝑐𝑜𝑑𝑒2
                     𝑅𝑒𝑙_𝐹(𝑐𝑜𝑑𝑒1, 𝑐𝑜𝑑𝑒2) =
                                                  𝑆𝐹𝑐𝑜𝑑𝑒1 ∪ 𝑆𝐹𝑐𝑜𝑑𝑒2
If the Rel_F (code1, code2) value is in the [0..0.3] range, then we can talk about a
strong difference in the UDC pair for this feature (the “strong” value).
   If the value of Rel_F (code1, code2) is in the range [0.3..0.7], then the UDC pair is
moderate distinguishable for this function (the "valid" value).
   If the Rel_F (code1, code2) value is in the [0.7..1] range, then the UDC pair is
poorly distinguishable for this feature (the “weak” value).


                            Fig. 3. The results of experiment 1.

Experiment 1. The first experiment involves UDCs of the same level and comparable
sizes of subcollections (UDC 517.51 (89 articles) and 517.54 (87 articles)), as well as
UDC 517.97 (75 articles) from another subdomain. We consider a method, an equa-
tion, and a problem as classifying features. The results of experiment 1 are shown in
figure 3 and the interpretation of these results is in table 1.

                Table 1. Assessing the classifying features in experiment 1.

                       Method                    Problem                Equation
 517.51 & 517.54       weak                      strong                 valid
 517.51 & 517.97       strong                    strong                 valid
 517.54 & 517.97       strong                    valid                  weak

It can be seen that for UDC 517.51 and 517.54 the most relevant feature will be the
methods, and for 517.97 – equations.
   Experiment 2. In this experiment, we consider highly specialized UDCs: 517.956
(57 papers), 517.958 (59), 517.982(21) and 517.983(36). These UDC codes do not
have a large number of representatives in the collection, but due to their high speciali-
zation, we believe that they should differ significantly in characteristics. The results
of experiment 2 are shown in fig. 4 and the interpretation of these results is in table 2.


                                            59
                           Fig. 4. The results of experiment 2.

               Table 2. Assessing the classifying features in experiment 2.

                        Method                  Problem                Equation
 517.956 & 517.958      strong                  valid                  valid
 517.956 & 517.982      strong                  strong                 valid
 517.956 & 517.983      strong                  valid                  valid
 517.958 & 517.982      strong                  strong                 strong
 517.958 & 517.983      strong                  weak                   strong
 517.982 & 517.983      valid                   strong                 valid

The analysis of the above results shows that for UDC 517.956 the most relevant fea-
ture is the problem, for UDC 517.958 - methods, and for UDC 517.982 and 517.983 -
equations.
   Experiment 3. In this experiment, we investigated single-level UDCs of one par-
ent node, which have the largest number of representatives in the collection:
517.92(129), 517.95(156) and 517.98(133).The results of experiment 3 are shown in
figure 5 and the interpretation of these results is in table 3.


                           Fig. 5. The results of experiment 3.


                                           60
                Table 3. Assessing the classifying features in experiment 3.

                      Method                     Problem                Equation
 517.92 & 517.95      strong                     valid                  weak
 517.92 & 517.98      strong                     weak                   weak
 517.95 & 517.98      weak                       valid                  weak

Analysis of the above results shows that methods and problems are the most relevant
feature for explored UDC codes.
   An important preliminary conclusion from the experiments carried out is the con-
struction of code maps of the studied UDC codes based on the OntoMathPro ontology
(see Table 4). The table contains values for each features (total and number of unique
values).

                   Table 4. Digital code maps of the studied UDC codes.

            Method                  Problem                       Equation
            All    Unique           All            Unique         All          Unique
 517.51     111    10               18             5              32           4
 517.54     112    11               43             4              73           8
 517.97     32     19               59             21             71           9
 517.956    21     10               55             19             53           16
 517.958    113    102              41             5              57           40
 517.982    4      1                6              0              21           1
 517.983    10     5                35             1              27           2
 517.92     36     15               51             4              92           6
 517.95     128    17               66             19             93           7
 517.98     125    14               46             5              76           3


4      Conclusion

The research carried out allows concluding that the combination of the selected fea-
tures and their values can successfully classify collections by UDC codes. Some relat-
ed groups of UDC codes can be classified according to only one feature, but with an
increase in the degree of code relationship, the number of required classifying features
increases. We also identified the most relevant features of the UDC code groups, by
which we can classify them in the general UDC code tree.
   The research carried out confirms our hypothesis that a group of mathematical
UDC codes can be classified by the features such as “method”, “task” and “equation”.

Acknowledgement. The research was funded by RSF according to the project № 21-
11-00105.


                                            61
References
 1. Jie Lu, Dianshuang Wu, Mingsong Mao, Wei Wang, Guangquan Zhang: Recommender
    system application developments: A survey. In: Decision Support Systems, vol. 74, pp.12-
    32 (2015).
 2. Ricci F. (2014) Recommender Systems: Models and Techniques. In: Alhajj R., Rokne J.
    (eds) Encyclopedia of Social Network Analysis and Mining. Springer, New York, NY.
    https://doi.org/10.1007/978-1-4614-6170-8_88.
 3. Liliana Shakirova, Marina Falileeva, Alexander Kirillovich, Evgeny Lipachev, Olga Nev-
    zorova, Vladimir Nevzorov: Modeling and Evaluation of the Mathematical Educational
    Ontology. In: SSI–2019 Scientific Services & Internet Proceedings of the 21st Conference
    on Scientific Services & Internet (SSI–2019) Novorossiysk-Abrau, Russia, September 23–
    28, 2019. CEUR Workshop Proceedings, vol. 2543, CEUR-WS.org, pp. 305–319 (2020).
 4. Elizarov A.M, Lipachev E.K.: Methods of processing large collections of scientific docu-
    ments and the formation of digital mathematical library. In: CEUR Workshop Proceedings,
    vol. 2543, pp.354–360 (2020).
 5. Elizarov A, Lipachev E.: Big math methods in Lobachevskii-DML digital library. In:
    CEUR Workshop Proceedings, vol.2523, pp.59-72 (2019).
 6. Romanov, A.Y., Lomotin, K.E., Kozlova, E.S. and Kolesnichenko, A.L.: Research of neu-
    ral networks application efficiency in automatic scientific articles classification according
    to UDC. In: International Siberian Conference on Control and Communications, SIBCON
    2016 – Proceedings, pp. 7–11 (2016).
 7. Khoo, M.J., Ahn, J.W., Binding, C., Jones, H.J., Lin, X., Massam, D. and Tudhope, D.:
    Augmenting Dublin core digital library metadata with Dewey decimal classification. In:
    Journal of Documentation, vol. 71, No. 5, pp. 976–998 (2015).
 8. Bai, X., Wang, M., Lee, I., Yang, Z., Kong, X. and Xia, F.: Scientific paper Recommenda-
    tion: a survey. In: IEEE Access, IEEE, vol. 7, pp. 9324–9339, doi:
    10.1109/ACCESS.2018.2890388 (2019).
 9. Beel, J., Aizawa, A., Breitinger, C. and Gipp, B.: Mr. DLib: recommendations-as-a-service
    (RaaS) for academia. In: Proceedings of the ACM/IEEE-CS Joint Conference on Digital
    Libraries (JCDL) (2017).
10. Shuai Zhang, Lina Yao, Aixin Sun, Yi Tay: Deep Learning Based Recommender System:
    A Survey and New Perspectives. In:ACM Computing Surveys, 52(1), pp. 1-38 (2019).
11. M. Schubotz et al.: AutoMSC: Automatic Assignment of Mathematics Subject Classifica-
    tion Labels. In: Proceedings of the 13th Conference on Intelligent Computer Mathematics
    (2020).
12. Matjaž Kragelj, Mirjana Kljajić Borštnar: Automatic classification of older electronic texts
    into the Universal Decimal Classification–UDC. In: Journal of Documentation, vol. 77, no.
    3 (2021).
13. Olga A. Nevzorova, Nikita Zhiltsov, Alexander Kirillovich and Evgeny Lipachev: On-
    toMathPro Ontology: a Linked data hub for mathematics // 5th International Conference,
    KESW 2014, Kazan, Russia, September 29 – October 1, 2014. Proceedings. Se-
    ries: Communications in Computer and Information Science, vol. 468, pp. 105–119.
    Springer (2014).


                                               62