Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


      CLUSTERING IN ONTOLOGY-BASED ANALYSIS OF
           RESEARCH PROJECT DESCRIPTIONS
              P. Lula1,a, J. Tuchowski1, U. Cieraszewska1, M. Talaga 1
                               1
                                   Cracow University of Economics, Poland

                                    E-mail: apawel.lula@uek.krakow.pl


Ontology-based approach in exploratory analysis of textual data can significantly improve the quality
of the obtained results. On the other hand, the use of domain knowledge defined in the form of
ontologies increases the time needed to prepare a model and makes required calculations more
complex. The publication will discuss selected aspects of cluster analysis performed on documents
automatically annotated using ontologies. It seems that methodological aspects of cluster analysis
process, especially the way in which distances are determined, should depend on the structure of a
given ontology. Three cases involving the use of ontologies with linear, hierarchical and network
structures will be discussed. The methodological aspects of ontology-based cluster analysis of text
documents was used for analysis of projects’ descriptions related to the area of economics and
registered in the period 2019-2021. Only Horizon and Framework Program projects were included.


Keywords: scientific productivity, ontology-based cluster analysis, CORDIS, JEL


                              Paweł Lula, Janusz Tuchowski, Urszula Cieraszewska, Magdalena Talaga


                                                             Copyright © 2021 for this paper by its authors.
                    Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


                                                   467
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


1. Introduction
        Scientific text annotation has become an important task for scientists. There is an increasing
need for the development of intelligent systems to support new scientific findings [1]. Currently,
ontologies are viewed as a shared and common understanding of a domain that can be communicated
between people and heterogeneous and distributed application systems [2]. Public databases available
on the Web provide useful data. Text annotation may help as it relies on the use of ontologies to
maintain annotations based on a uniform vocabulary.
         Clustering text documents into different category groups is an important step in indexing,
retrieval, management and mining of abundant text data on the Web or in corporate information
systems. Among others, the challenging problems of text clustering are big volume, high
dimensionality and complex semantics [3].
        Nowadays, there is an increasing need for decision support systems to guide the investments
on new scientific research projects. They need to extract useful information from many different
resources. One such resource that allows you to see in which directions research ideas are developing
is the Community Research and Development Information Service (CORDIS) [4] which is the
European Commission's primary source of results from the projects funded by the EU's framework
programmes for research and innovation. It has a rich and structured public repository with all project
information.


2. Methodological aspects of ontology-based cluster analysis of text
documents
        The main assumption which was made by the authors is that the cluster analysis of documents
is supported by domain knowledge represented by an ontology. Starting from this assumption the
following stages in the analysis process can be defined:
    ●   corpus preparation,
    ●   ontology building (or ontology selection),
    ●   documents’ annotation,
    ●   distance matrix calculation,
    ●   conducting a clustering process.
       In the corpus preparation phase all documents were transformed to pure text format coded in
UTF-8 format. Next all words were transformed to their base form (lemmatization process). Also,
numeric values and punctuation marks were omitted.
        Providing of a proper ontology is the main goal of the next step. It seems that the adoption of a
widely accepted ontology is better than building a new ontology designed exclusively for a given
research process.
        Next, an annotation process should be conducted. During this stage concepts from a given
ontology should be assigned to words or phrases in documents. There are many techniques which can
be used for implementing annotation task, but rule-based technique is the most popular.
        Ontology classification into three classes (with linear, hierarchical or network structure) is
very important from the perspective of cluster analysis because ontology’s type determines the way of
distance calculation.
        For ontologies with linear character (gazetteers) it is only possible to check if two concepts are
the same or not. For ontologies having hierarchical structure there are two popular approaches used to
calculate distances between concepts. First based on the length of a path connecting to concepts. And
the second which is based the information theory. It seems that for ontologies with network structure,


                                                   468
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


the most convenient way of distance calculation is based on the length of path between nodes
representing two concepts.
       Having a measure between concepts defined, the ontology-based similarity measure between
two documents should be specified. Let’s assume that the set 𝐷𝑖 contains all concepts occurring in a 𝑖-
th document. Then a similarity between two documents can be defined depending on the type of a
given ontology.
        For linear ontologies distances between documents can be calculated as:
    ●   Jaccard distance:

                                                                 |𝐷1 ∩ 𝐷2 |
                                          𝑑𝑖𝑠𝑡(𝐷1 , 𝐷2 ) = 1 −
                                                                 |𝐷1 ∪ 𝐷2 |

    ●   Hamming distance:
                                            𝑑𝑖𝑠𝑡(𝐷1 , 𝐷2 ) = |𝐷1 ⨁𝐷2 |


        While for hierarchical or network-based ontologies the following formulas for document
similarity can be used:
    ● average distance between all concepts:

                                   𝑠𝑖𝑚(𝐷1 ,D2 ) = 𝑎𝑣𝑔(𝑐𝑖 , 𝑐𝑗 ), 𝑐𝑖 ∈ 𝐷1 , 𝑐𝑗 ∈ 𝐷2


    ●   average distance between the nearest concepts:

                                        ∑𝑁                          𝑀
                                         𝑖=1 min (𝑠𝑖𝑚(𝑐𝑖 , 𝑐𝑗 )) + ∑𝑗=1 min (𝑠𝑖𝑚(𝑐𝑖 , 𝑐𝑗 ))
                                               𝑗                              𝑖
                       𝑠𝑖𝑚(𝐷1, 𝐷2 ) =
                                                               𝑁+𝑀

    ●   average distance between concepts chosen as a solution of the optimal alignment problem
        defined as:

                            𝑠𝑖𝑚(𝐷1 , 𝐷2 ) = arg min ∑ 𝑠𝑖𝑚(𝑐𝑖 , 𝑐𝑗 ) , 𝑐𝑖 ∈ 𝐷1 , 𝑐𝑗 ∈ 𝐷2
                                               𝑐𝑖 ,𝑐𝑗


        Formulas presented above allow to define similarity matrix between documents. This matrix is
a starting data for distance-based cluster analysis. The authors decided to use hierarchical,
agglomerative approach, mostly Ward’s method.


3. Analysis of Horizon and Framework Program projects related to the
area of economics registered in the CORDIS database
        The methodology presented in the previous section was used for analysis of projects’
descriptions related to the area of economics and registered in the period 2019-2021. Only Horizon
and Framework Program projects were included. The total number of projects was 292.


                                                        469
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


        All documents were annotated with the use of the JEL ontology [5]. During this process all
concepts defined in the ontology were identified. Next, the significance of main concepts was
evaluated by calculating an average number of occurrences for concepts belonging to every main
class. The results are presented in Figure 25.

                                                                                                  JEL main concepts significance


                         0.08


                         0.06


                         0.04


                         0.02


                         0.00
                                                                                                               Root/JEL_G


                                                                                                                                         Root/JEL_I


                                                                                                                                                                                                                       Root/JEL_O


                                                                                                                                                                                                                                                 Root/JEL_Q
                                                                                                                                                                                             Root/JEL_M
                                                                                                                                                      Root/JEL_J


                                                                                                                                                                                Root/JEL_L
                                                                                                  Root/JEL_F


                                                                                                                                                                                                                                                                                        Root/JEL_Z
                                 Root/JEL_A
                                              Root/JEL_B


                                                                                     Root/JEL_E


                                                                                                                                                                   Root/JEL_K


                                                                                                                                                                                                                                    Root/JEL_P


                                                                                                                                                                                                                                                                           Root/JEL_Y
                                                           Root/JEL_C
                                                                        Root/JEL_D


                                                                                                                            Root/JEL_H


                                                                                                                                                                                                          Root/JEL_N


                                                                                                                                                                                                                                                              Root/JEL_R
        Figure 25. JEL main concepts significance for the whole corpus of project descriptions
        For annotated documents cluster analysis may be performed with the use of Hamming
distance and Ward’s method. The results are presented on Figure 26.


                Figure 26. Dendrogram presenting the structure of project descriptions
         The shape of the dendrogram suggest that the division of descriptions into two groups. The
evaluation of clustering process quality based on silhouette coefficients shows that the structure of
clusters is rather weak. It means that clusters are overlapping.


                                                                                                                            470
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


                                                                                                                                                   Silhouette plot of (x = groups, dist = d)
                                                                                                                                                   n = 292                                                                                                                                                                                                                                                                                                  2 clusters Cj
                                                                                                                                                                                                                                                                                                                                                                                                                                                             j : nj | avei Cj si


                                                                                                                                                                                                                                                                                                                                                                                                                                                                  1 : 154 | 0.24


                                                                                                                                                                                                                                                                                                                                                                                                                                                            2 : 138 | 0.004


                                                                                                                                                                                                               0.0                                                               0.2            0.4                                                             0.6                                                                     0.8                                                                     1.0
                                                                                                                                                                                                                                                                                        Silhouette width si

                                                                                                                                                   Average silhouette width : 0.13

       Figure 27. Silhouette plot presenting the quality of project descriptions division into two groups
        For more than two clusters an average value of the silhouette index was smaller and therefore
further analysis was performed for two groups of projects Figure 27. For every group the significance
of JEL main concepts was estimated. The results are presented on Figure 28.
                                                         JEL main concepts significance - group: 1                                                                                                                                                                                                                                                 JEL main concepts significance - group: 2


0.10
                                                                                                                                                                                                                                                                                         0.10

0.08
                                                                                                                                                                                                                                                                                         0.08

0.06
                                                                                                                                                                                                                                                                                         0.06


0.04
                                                                                                                                                                                                                                                                                         0.04


0.02                                                                                                                                                                                                                                                                                     0.02


0.00                                                                                                                                                                                                                                                                                     0.00
                                                                                          Root/JEL_G


                                                                                                                    Root/JEL_I


                                                                                                                                                                                                  Root/JEL_O


                                                                                                                                                                                                                             Root/JEL_Q


                                                                                                                                                                                                                                                                                                                                                                                    Root/JEL_G


                                                                                                                                                                                                                                                                                                                                                                                                              Root/JEL_I


                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            Root/JEL_O


                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      Root/JEL_Q
                                                                                                                                                                        Root/JEL_M


                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Root/JEL_M
                                                                                                                                 Root/JEL_J


                                                                                                                                                                                                                                                                                                                                                                                                                           Root/JEL_J
                                                                                                                                                           Root/JEL_L


                                                                                                                                                                                                                                                                                                                                                                                                                                                     Root/JEL_L
                                                                             Root/JEL_F


                                                                                                                                                                                                                                                                    Root/JEL_Z


                                                                                                                                                                                                                                                                                                                                                                       Root/JEL_F


                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             Root/JEL_Z
           Root/JEL_A
                        Root/JEL_B


                                                                Root/JEL_E


                                                                                                                                              Root/JEL_K


                                                                                                                                                                                                                Root/JEL_P


                                                                                                                                                                                                                                                       Root/JEL_Y


                                                                                                                                                                                                                                                                                                      Root/JEL_A
                                                                                                                                                                                                                                                                                                                   Root/JEL_B


                                                                                                                                                                                                                                                                                                                                                          Root/JEL_E


                                                                                                                                                                                                                                                                                                                                                                                                                                        Root/JEL_K


                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         Root/JEL_P


                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                Root/JEL_Y
                                     Root/JEL_C
                                                   Root/JEL_D


                                                                                                       Root/JEL_H


                                                                                                                                                                                     Root/JEL_N


                                                                                                                                                                                                                                          Root/JEL_R


                                                                                                                                                                                                                                                                                                                                Root/JEL_C
                                                                                                                                                                                                                                                                                                                                             Root/JEL_D


                                                                                                                                                                                                                                                                                                                                                                                                 Root/JEL_H


                                                                                                                                                                                                                                                                                                                                                                                                                                                                               Root/JEL_N


                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Root/JEL_R


                                                  Figure 28. JEL main concepts significance for two clusters of project descriptions
        The analysis of descriptions assigned to every group confirms previous observation regarding
high similarity between clusters.

4. Conclusions
                          The results obtained during the analysis show that:
       ● ontology-based approach allows to perform the analysis of project descriptions to identify
         concepts related to a given research domain,
    ● Hamming distance and Ward’s method can be used for cluster analysis of documents
         annotated with automatically identified ontology concepts,
    ● silhouette coefficients inform about the quality of document clusters identified by cluster
         analysis methods.
        The authors are going to develop the system presented here by adding modules performing
concepts’ identifications defined in other domain ontologies (MeSH or CSO). Also the analysis of
relationships between concepts derived from more than one ontology will be ensured in future
solutions.


                                                                                                                                                                                                                                                                                       471
Proceedings of the 9th International Conference "Distributed Computing and Grid Technologies in Science and
                           Education" (GRID'2021), Dubna, Russia, July 5-9, 2021


References
[1] P. C. e C. Gomes, A. M. de C. Moura, and M. C. Cavalcanti, ‘A multi-ontology approach to
    annotate scientific documents based on a modularization technique’, J. Biomed. Inform., vol. 58,
    pp. 208–219, Dec. 2015, doi: 10.1016/j.jbi.2015.09.022.
[2] R. García, ‘A Semantic Web Approach to Digital Rights Management’, 2006.
[3] L. Jing, L. Zhou, M. K. Ng, and J. Z. Huang, ‘Ontology-based Distance Measure for Text
    Clustering’, 2006.
[4] ‘European Commission : CORDIS : Search : Results page’. https://cordis.europa.eu/projects/en
    (accessed Sep. 13, 2021).
[5] ‘Journal        of        Economic            Literature’.               [Online].          Available:
    https://www.aeaweb.org/econlit/jelCodes.php?view=jel


                                                   472