1. Introduction

An Intelligent Information System for Generating a Scientist's Scientometrics Using Content Analysis Methods

Mykola Dyvak

Andriy Yushko

a.yushko@wunu.edu.ua 0

Andriy Melnyk

0 0 West Ukrainian National University , 11 Lvivska Street, Ternopil, 46001 , Ukraine

The paper proposes methods and software tools for developing a scientometric profile of a researcher using content analysis techniques. A scientometric profile is a system of indicators that assesses a researcher's scientific productivity and influence. The growing volume of scientific information in various databases, such as Scopus and Web of Science, has made it challenging to manually track and analyze individual publishing activities. For scientific and higher education institutions, monitoring both the quantity and quality of publications is crucial. Additionally, understanding researchers' main areas of interest helps support their professional development and foster interdisciplinary collaboration. Existing tools for monitoring scientific metrics typically offer limited functionality, lack the ability to process large volumes of data efficiently, and struggle to filter irrelevant information automatically. This paper presents an approach to building a researcher's scientometric profile using content analysis, supported by large language models, specifically Ollama. A mathematical model was developed to filter out irrelevant publications based on the researcher's scientometric profile. The system for collecting and analyzing scientometric indicators was implemented, and experimental studies were conducted using profiles of researchers from West Ukrainian National University.

intelligent information system scientometrics researcher content analysis methods large language model irrelevant publications *

1. Introduction

only allows you to collect information from scientific databases about publications, projects, grants and participation in scientific events, but also forms a profile of a scientist, determining his scientific interests. Using this profile, the system is able to filter irrelevant publications, automatically assessing their relevance to the scientist's interests. This decision contributes to increasing the efficiency of scientific activity, allowing to focus attention on really important and relevant scientific achievements.

2. Algorithms and approaches for selecting keywords and determining the researcher's scientific interests

Modern research actively uses algorithms for automatic analysis of text data to select keywords that reflect the main scientific interests of the researcher. The development of such approaches is aimed at simplifying the process of collecting, analyzing and systematizing scientific materials, which allows not only to identify the main areas of work, but also to identify interdisciplinary connections.

The main methods used to analyze texts for the purpose of extracting keywords can be divided into several categories:

3. Statistical methods

One of the basic approaches is to calculate the frequency of use of terms in texts. The TF-IDF metric (Term Frequency-Inverse Document Frequency) is the most popular among statistical methods and allows taking into account both the frequency of a term in a document and its significance in the context of the entire corpus of texts [ 6 ]. This increases the accuracy of extracting significant terms, as frequent but insignificant words are given less weight.

Figure 1 shows an example of the implementation of the TF-IDF metric in the Python programming language using the scikit-learn library.

As a result of executing the code, we will get a table with the top 10 keywords and their TF-IDF values.

The TF-IDF value of each keyword reflects its weight in the context of the article's annotation – the higher the TF-IDF value, the more important the term is for this text

4. Rule-based methods

Rule-based approaches, such as Named Entity Recognition (NER), allow the extraction of certain categories of words, such as names of organizations, names of people, geographic locations, and other important entities [ 7 ]. In the Python programming language, you can use the transformers library from Hugging Face, which allows you to load a pre-trained model for recognizing named entities (Fig. 2).

As you can see from the code above, we use the pipeline method with a pre-trained dbmdz/bertlarge-cased-finetuned-conll03-english model that is specially tuned for Named Entity Recognition (NER). The aggregation_strategy="simple" parameter allows you to aggregate the results for greater convenience.

The next step is to run NER on the text. This pulls up a list of colds with the specified types (eg organizations, technology names, scientific concepts).

After that, keyword filtering is performed by selecting entities that may be relevant. For example ORG (organizations) and MISC (various terms such as technology or scientific concepts).

After passing all the stages, we get a list of keywords selected from the text, in our case it is: Google Cloud Reduce and MapReduce.

5. Natural language processing (NLP) models

Thanks to the development of natural language processing methods and the emergence of deep models such as BERT, GPT and others, it became possible to significantly improve the accuracy of text analysis [8]. These models take into account the context of words, which allows not only to highlight keywords, but also to understand their relationship and semantic meaning.

Figure 3 shows a code fragment for selecting article categories by their annotations using the ready-made facebook/bart-large-mnli model from the Transformers library.

In Figure 4, the output shows which categories most closely match each text, as well as the confidence level of the model corresponding to each category.

6. Text vectorization

To vectorize the text and create its numerical representation, you can use the Word2Vec or Doc2Vec methods from the gensim library in Python. Word2Vec creates vectors for individual words, while Doc2Vec allows you to get a vector representation for an entire document[9] (Fig. 5).

As a result of executing this code, we will get a vector representation for three annotation texts, which is shown in Figure 6.

In the future, the obtained vectors can be used to compare the similarities between documents or to cluster documents based on topics. For example, we can calculate cosine similarity between vectors to find out how similar documents are to each other (Fig. 7).

The cosine similarity value of two documents of 0.98 indicates that these documents have a very high level of similarity in terms of their vector representations. Cosine similarity measures the angle between the vectors of two texts: a value close to 1 means that the vectors are nearly parallel, indicating a high degree of similarity between the texts.

These methods can be used both separately and in combination to achieve more accurate results in determining the researcher's key scientific interests. The use of these approaches allows automating the processes of analysis of scientific activity, which, in turn, contributes to the formation of a comprehensive profile of a researcher capable of reflecting the dynamics of his scientific work and interdisciplinary connections.

Each of the described methods has its own unique application and can complement other methods in complex tasks of text analysis. In the next section, we will look at how you can use the Ollama model with its powerful language models to identify keywords in text. This approach will make it possible to apply the latest capabilities of deep learning to improve the accuracy of extracting relevant terms and analyzing complex textual data.

7. Methodology for creating a scientometric portrait of scientist using large Ollama language models

A researcher's profile is a comprehensive description of the researcher's professional activities, scientific achievements, and interests. It includes such key elements as name, surname, position, academic title, scientific interests, number of published works, participation in scientific grants and projects. The formation of a scientist's profile is an important task, since it can be used to solve such tasks as automated filtering of publications that match the researcher's scientific interests, and optimized selection of a scientific supervisor for young scientists or graduate students whose scientific activity coincides with the topic.

To form a profile of a scientist, first of all, it is necessary to collect basic metadata, which will become the foundation for further processing. The web scraping method can help us in this, which will allow us to collect basic information from the official website of the organization where the scientist works. This method provides automated extraction of such data as name, surname, position, academic title, circle of scientific interests, links to scientometric profiles of the author (Scopus, Web of Science, ORCID, Google Scholar, DSpace).

The use of web scraping at the initial stage provides automatic filling of the profile with publicly available information, which significantly reduces the time spent on manual data collection and creates an accurate starting point for further analysis.

To implement the web scraping process, you can use specialized libraries that allow you to automatically read and extract information from web pages [10,11]. For example, using the Cheerio library in JavaScript, it is possible to retrieve and process the HTML content of a page, extracting the required metadata such as name, title, academic interests, etc. The following example demonstrates the basic code for obtaining information about a scientist from the official website of the Western Ukrainian National University, focusing on the necessary profile elements.

Figure 8 shows a fragment of the code for parsing the metadata of scientists from the official website of the organization.

Now that we have a basic set of web-scraping metadata, we can move on to the next step — fleshing out a scientist's profile using Ollama's large language models.

Large Language Models (LLM) are a powerful tool for analyzing and processing textual data. Thanks to the ability to understand the context and extract meaningful units.

The main advantage of Ollama is the ability to run and manage large language models (LLM) locally on a computer, without the need for cloud services. This ensures increased confidentiality of data, reduces costs and allows users to fully control information processing processes [10].

The models presented in the Ollama platform are specialized in the processing of scientific texts and have a wide range of applications, such as automatic text classification, extraction of keywords and phrases, identification of scientific interests and creation of a generalized profile.

Table 1 provides a comparative analysis of the major language models supported by the Ollama platform [12,13].

The above table shows the main features of the models, their advantages and disadvantages and allows you to choose the appropriate model for solving this or that problem.

Phi-3 1,4B ~2,8 rSepseecairaclihzetdasikns;schiiegnhtiafciccuarnadcy. tMasakys;bneeleesdss esfeftetcintigv.e in general

Figure 9 demonstrates the process of forming a profile of a scientist, which includes the main stages: data collection, analysis of text documents using Ollama models, parsing of scientific interests, classification of information and final creation of a profile.

8. A mathematical model of filtering irrelevant publications based on the profile of a scientist

In scientometric databases, a problem often arises when, due to the coincidence of the author's last name, first name, and patronymic, publications that do not belong to the scientist are added to the scientist's profile. This distorts the indicators of scientific activity and complicates the objective assessment of the researcher's contribution. The development of a mathematical model for filtering irrelevant publications based on the profile of a scientist allows to effectively solve this problem. To build mathematical models in conditions of limited data sampling, it is advisable to use methods based on interval analysis [14-18]. Using detailed scholarly profile data, such as author research interests, affiliations, and other unique characteristics, it is possible to accurately identify publications that actually belong to a particular scholar. This ensures an increase in the accuracy of scientometric indicators and will contribute to a more objective analysis of scientific activity.

The process of building a mathematical model for filtering irrelevant publications based on the profile of a scientist can be divided into several steps:

Step 1. Formulation of the author's scientific interests.

The author's scientific interests can be represented as a vector of keywords that provides an opportunity to describe the main areas of research. Let I  k 1 , k 2 ,,k n  , where k i is a keyword or phrase describing the author's interests.

Step 2. Vector representation of the publication.

Each post can also be represented as a vector of keywords. Let Pj  p 1 , p 2 ,,pm  , where p i is a keyword or phrase associated with post Pj .

Step 3. Calculating relevance using cosine similarity.

To measure the similarity between the scientific interests of the author I and the publication vector Pj , you can use the cosine similarity: elevance I ,P j  

n i 1k i  pi n i 1k i2 

m i 1pi2 , (1) where  is the scalar product operation. The value of relevance I ,P j  ranges from 0 to 1, where a value close to 1 means high relevance.

Step 4. Filtering of irrelevant publications. irrelevant and is filtered out.

Pj is relevant if relevance I ,P j  T If the value relevance I ,P j  is less than some threshold T , then the publication is considered So, as we can see, the model we received allows us to automatically filter out irrelevant publications based on the scientific profile of the author.

9. Software implementation of the system for collecting and analyzing scientific and scientific-pedagogical activities of the academic team

In the modern conditions of the information society, it is important to have effective tools for collecting and analyzing scientific activity [19-21]. The developed system is aimed at automating the processes of data collection about the scientific and scientific and pedagogical achievements of the academic staff, filtering this data based on relevance to their interests, and forming reports by university, faculty, department, which allows to improve the quality of management and planning of scientific work.

Conventionally, our system can be divided into several interacting modules, namely: 1. The authorization and authentication module, which ensures secure user access to the system, access management and protection of personal data; 2. Data collection module: responsible for obtaining information from scientometric databases (for example, Scopus, Crossref, NRAT) and the profile of a scientist; 3. Data processing and analysis module: cleans, normalizes and pre-processes collected data for preparation for further analysis; 4. Filtering module: implements a mathematical filtering model using machine learning algorithms and criteria defined on the basis of the scientist's profile; 5. Reporting module: provides an opportunity to generate a general report on the scientific activity of the university, faculty or department; 6. User interface: provides user interaction with the system, providing the ability to perform CRUD operations with the main entities (for example, publications, dissertations, grants, projects, scientific activities).

The system architecture was implemented using advanced technologies that ensure reliability, performance and flexibility. The core technology stack is based on JavaScript as both client-side and server-side programming languages, which helps ensure codebase consistency and eases application development. The server part was developed using Node.js, which allows you to create high-performance and scalable server applications with high request processing capabilities even in real time.

To optimize the interaction between the client and the server, GraphQL is used, which gives the client the opportunity to get only the data that is needed, which reduces the load on the network and server resources. Which, in turn, will improve system performance from the point of view of building complex queries.

The MongoDB database acts as a storage, which provides speed and flexibility when working with large volumes of unstructured data. It also provides efficient work with various data types used in describing the structure for data from various scientometric information systems, and provides easy scalability of the database in accordance with the load and needs.

In addition, the Ollama platform is integrated into the system, which provides a mechanism for working with various models of machine learning and artificial intelligence. Thanks to such capabilities, the system can more accurately determine the relevance of publications based on the profile of a scientist, calculating complex relationships between data.

The system interface is developed based on the principles of building intuitiveness and ease of use, which provides convenient user access to the main functionality without the need for additional training of personnel.

Figure 10 shows the initial screen of the page with the authorization and authentication forms.

As you can see from the screenshot above, the authorization form is quite simple, as it requires the user to enter only the e-mail address and the password that was created during registration in the system (Fig. 11).

The registration form requires the user to fill out a database of information about himself, such as surname, first name, patronymic, position, faculty, department, etc. Also, when registering, the employee must specify his identifiers in other scientometric databases to ensure the process of automated information collection. Another of the main fields of this form is the scientist's last name and first name in Latin, as these data will be needed to search for publications in the Crossref database.

After successful authorization in the system, the user gets to the "Overview" page (Fig. 12) where he can see quantitative indicators of publication activity.

As can be seen from the screenshot above, the user has the opportunity to filter all indicators by faculty, department and publication period.

There is also an opportunity to create a world based on a specific division by clicking the "Download report" button. This opportunity is available only to employees with appropriate access rights (for example, the head of the department, the dean of the faculty, the vice-rector for scientific work).

If the user has entered the system and there is no added data yet, he will see a welcome window and a button that will allow synchronization of all publication activity from other scientometric databases (Fig. 13).

After pressing the "Synchronization" button, a window will open with a description of the databases in which information will be searched (Fig. 14).

By going to the "Publications" section, the user will be able to view all the publications that the system managed to find (Fig. 15).

If the system could not find any publication of the author, he can add it manually by pressing the corresponding button. After that, the user will open a form where he will need to fill in all the necessary fields (Fig. 16).

The page for viewing dissertations protected by the user, where there is an addition form, has a similar appearance (Fig. 17).

The R&D funding page displays a list of research and development (R&D) funded projects, including the name, manager, terms, amount of funding, type of funding, and category of each project (Fig. 22).

It is also possible to quickly search by faculty, department and deadline. As already mentioned earlier, the system provides for the possibility of automatic creation of a scientist's profile, which can be used in the future for the tasks of filtering publications.

As can be seen from the figure above, the user has the opportunity not only to view his profile, but also to edit the necessary information. 10. Conclusion This work emphasizes the need for automation of collection, processing and analysis of publication activity in the modern scientific environment. The increase in the volume of scientific information complicates the manual control and analysis of data, especially in large academic groups. The developed system described in this paper not only provides automated collection of information from scientometric databases such as Scopus and Web of Science, but also forms a profile of a scientist, which includes information about his scientific interests, publications, grants and participation in scientific events. This allows you to optimize the processes of managing scientific activities, making them more efficient and objective.

An important part of the work is the use of modern algorithms for automatic text analysis, such as TF-IDF, Named Entity Recognition (NER) and text vectorization, which contribute to the selection of keywords and the identification of scientific interests of researchers. The application of deep language models, such as BERT, GPT, as well as the capabilities of the Ollama platform for localized processing of big data, allows you to achieve high accuracy in text analysis, taking into account the semantic context and the relationship between terms.

In addition, a mathematical model of filtering irrelevant publications is built in the work, which is based on the profile of a scientist, which solves the problem of filtering the author's original works, thereby significantly increasing the accuracy of scientometric indicators.

Also, the use of vector representation of scientific interests and publications with the calculation of cosine similarity is proposed for the first time. This approach contributes to the objective assessment of scientific contributions, reducing the risk of inaccuracies due to random coincidence of surnames or errors in databases.

Another important component of this work is the integration of the Ollama platform into its own system, which allows the use of language models for accurate identification of scientific interests, as well as for automatic categorization and clustering of scientific materials. This greatly facilitates the formation of reports for scientific institutions, which allows you to quickly obtain generalized data on the activities of the university, faculties and departments.

Declaration on Generative AI During the preparation of this work, the authors used ChatGPT and Grammarly to check grammar and spelling, paraphrase, and reword the text. These tools help identify and correct grammatical errors, typos, and other writing mistakes, improving the clarity and professionalism of the text. After using these tools, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content. [8] L. M. Pham, H. C. The, LNLF-BERT: transformer for long document classification with multiple attention levels, IEEE Access (2024) 1. https://doi.org/10.1109/access.2024.3492102. [9] H. D. Abubakar, M. Umar, Sentiment classification: review of text vectorization methods: bag of words, tf-idf, word2vec and doc2vec, SLU J. Sci. Technol. 4.1&2 (2022) 27–33. https://doi.org/10.56471/slujst.v4i.266. [10] H.-S. Lee, H.-S. Shim, Implementation of generative AI using metaverse-based LLM, Korea Ind.

Technol. Converg. Soc. 29.2 (2024) 123–132. https://doi.org/10.29279/jitr.2024.29.2.123. [11] M. Brown, A. Gruen, G. Maldoff, S. Messing, Z. Sanderson, M. Zimmer, Web scraping for research: legal, ethical, institutional, and scientific considerations, 2024. https://doi.org/ 10.48550/arXiv.2410.23432. [12] D. P. Pau, F. M. Aymone, Forward learning of large language models by consumer devices,

Electronics 13.2 (2024) 402. https://doi.org/10.3390/electronics13020402. [13] C.-N. Hang, P.-D. Yu, R. Morabito, C.-W. Tan, Large language models meet next-generation networking technologies: A review, Future Internet 16.10 (2024) 365. https://doi.org/ 10.3390/fi16100365. [14] M. Dyvak, P. Stakhiv, A. Pukas, Algorithms of parallel calculations in task of tolerance ellipsoidal estimation of interval model parameters, Bull. Pol. Acad. Sci. 60.1 (2012). https://doi.org/10.2478/v10175-012-0022-9. [15] M. Dyvak, I. Voytyuk, N. Porplytsya, A. Pukas, Modeling the process of air pollution by harmful emissions from vehicles, in: 2018 14th international conference on advanced trends in radioelecrtronics, telecommunications and computer engineering (TCSET), 2018, pp. 1272– 1276. https://doi.org/10.1109/TCSET.2018.8336426. [16] N. Ocheretnyuk, I. Voytyuk, M. Dyvak, Y. Martsenyuk, Features of structure identification the macromodels for nonstationary fields of air pollutions from vehicles, in: Proceedings of international conference on modern problem of radio engineering, telecommunications and computer science, 2012, pp. 444–444. [17] M. Dyvak, Parameters identification method of interval discrete dynamic models of air pollution based on artificial bee colony algorithm, in: 2020 10th international conference on advanced computer information technologies (ACIT), 2020, pp. 130–135. https://doi.org/ 10.1109/ACIT49673.2020.9208972. [18] M. Dyvak, A. Pukas, I. Oliynyk, A. Melnyk, Selection the “saturated” block from interval system of linear algebraic equations for recurrent laryngeal nerve identification, in: 2018 IEEE second international conference on data stream mining & processing (DSMP), 2018, pp. 444– 448. https://doi.org/10.1109/DSMP.2018.8478528. [19] M. Pirnau, M. A. Botezatu, I. Priescu, A. Hosszu, A. Tabusca, C. Coculescu, I. Oncioiu, Content analysis using specific natural language processing methods for big data, Electronics 13.3 (2024) 584. https://doi.org/10.3390/electronics13030584. [20] M. Gkevrou, D. Stamovlasis, Illustration of a software-aided content analysis methodology applied to educational research, Educ. Sci. 12.5 (2022) 328. https://doi.org/10.3390/ educsci12050328. [21] N. Le, D. Tran, R. Sturgill, Content analysis of three-dimensional model technologies and applications for construction: current trends and future directions, Sensors 24.12 (2024) 3838. https://doi.org/10.3390/s24123838.

[1]

.-M. Petroșanu ,

Pîrjan ,

Tăbușcă , Tracing the influence of large language models across the most impactful scientific works , Electronics 12 .24 ( 2023 ) 4957 . https://doi.org/ 10.3390/electronics12244957.

[2]

Lutsiv ,

Maksymyuk ,

Beshley ,

Lavriv ,

Andrushchak ,

Sachenko ,

Vokorokos ,

Gazda , Deep semisupervised learning-based network anomaly detection in heterogeneous information systems , Comput., Mater. & Contin. 70.1 ( 2022 ) 413 - 431 . https://doi.org/ 10.32604/cmc. 2022 . 018773 .

[3]

Sachenko ,

Kochan ,

Turchenko ,

Tymchyshyn and

Vasylkiv , "Intelligent nodes for distributed sensor network," IMTC/99. Proceedings of the 16th IEEE Instrumentation and Measurement Technology Conference (Cat. No.99CH36309) , Venice, Italy, 1999 , pp. 1479 - 1484 vol. 3 . https://doi.org/ 10.1109/IMTC. 1999 .776072

[4]

Lytvyn ,

Vysotska ,

Pukach ,

Nytrebych , I. Demkiv ,

Senyk ,

Malanchuk ,

Sachenko ,

Kovalchuk ,

Huzyk , Analysis of the developed quantitative method for automatic attribution of scientific and technical text content written in Ukrainian , EasternEuropean J. Enterp. Technol. 6 . 2 ( 96 ) ( 2018 ) 19 - 31 . https://doi.org/10.15587/ 1729 - 4061 . 2018 . 149596 .

[5]

Zaki Ahmed ,

M. Rodríguez

Díaz , A methodology for machine-learning content analysis to define the key labels in the titles of online customer reviews with the rating evaluation , Sustainability 14 .15 ( 2022 ) 9183 . https://doi.org/10.3390/su14159183.

[6]

Wang , Research on the TF-IDF algorithm combined with semantics for automatic extraction of keywords from network news texts , J. Intell. Syst. 33.1 ( 2024 ). https://doi.org/ 10.1515/jisys-2023-0300.

[7]

Singh ,

Garg , Named entity recognition (NER) and relation extraction in scientific publications , Int. J. Recent Technol. Eng. (IJRTE) 12.2 ( 2023 ) 110 - 113 . https://doi.org/ 10.35940/ijrte.b7846. 0712223 .