=Paper= {{Paper |id=Vol-3909/Paper_4.pdf |storemode=property |title=Self-Reinforcing Effects on Trustworthiness of Intelligent Systems in Small Data Problem-Solving |pdfUrl=https://ceur-ws.org/Vol-3909/Paper_4.pdf |volume=Vol-3909 |authors=Vladyslav Kuznetsov,Alexander Samoylov,Tetiana Karnaukh,Rostyslav Trokhymchuk,Anatoliy Kulias |dblpUrl=https://dblp.org/rec/conf/iti2/KuznetsovSKTK24 }} ==Self-Reinforcing Effects on Trustworthiness of Intelligent Systems in Small Data Problem-Solving== https://ceur-ws.org/Vol-3909/Paper_4.pdf
                                Self-Reinforcing Effects on Trustworthiness of Intelligent
                                Systems in Small Data Problem-Solving
                                Vladyslav Kuznetsov , Alexander Samoylov , Tetiana Karnaukh , Rostyslav
                                Trokhymchuk and Anatoliy Kulias
                                1
                                    Glushkov Institute of Cybernetics, 40, Glushkov ave., Kyiv, 03187, Ukraine
                                2
                                    Taras Shevchenko National University of Kyiv, 63/13, Volodymyrska str., Kyiv, 01601, Ukraine



                                                   Abstract
                                                   In this work, we propose to investigate such a topic as trustworthiness in large language models in
                                                   particular, in tasks of knowledge mining. As a part of experimental research, we conducted anonymous
                                                   study of big data models, whereas analyzed how the size of the context, model memory and number of
                                                   interactions affect trust estimated value in some quality assessment indicator value. The trust estimate
                                                   was formed based on the assessment of the quality of the answers given by the models, related to the
                                                   questions given, using an estimate value, on a scale from 1 to 5, which shows complacency and
                                                   conciseness of the model answers. As part of the experiments, 11 large language models, with the number
                                                   of parameters ranging from 1.5 to 13 billion, were studied. For the quality assessment on the task of
                                                   knowledge mining, the questions were formed on such area, as trustworthiness, using standardized
                                                   definitions of trust categories from the ISO/IEC TR 24028 standard. In this task, during the experiments,
                                                   we noticed that each interaction is crucial for trust assessments, because the estimated value of trust
                                                   either remained the same, increased or decreased. As a result, this showed the complex nature of
                                                   interactions and a wider range of values of trust in artificial intelligence (AI). We inferred from the
                                                   experiments that the value of trust is very likely dependent on the previous context and memory of
                                                   interactions. Thus, the effects of trust reinforcement had shown themselves on nearly all large language
                                                   models tested in the experiments, whereas the best results were obtained by language models with exact 4
                                                   and 8 billion parameters. The study was also covering aspects of model perforance and its efficiency based
                                                   on language encoding. As a result of the experiments, we suggested a number of requirements for
                                                   personal, as well as public AI systems given the case study example.

                                                   Keywords
                                                   Trustworthiness, Intelligent systems, Large language models, Self-reinforcing effects, Small data 1



                                1. Introduction
                                Recently, with the rapid development of machine intelligence algorithms such as Large Language
                                Models (LLMs) [1], as well as greater market availability, in particular in area of parallel computing
                                tools, it has been possible to deploy large machine intelligence models on local machines as
                                personal artificial intelligence elements. Opposed to commonly known commercial platforms such
                                as Google Cloud, Amazon AWS, Microsoft Azure or, conditionally free like Hugging Face, one can
                                run a personal AI. This has raised an interest for the developers to make use of AI models that can
                                be run on personal computers locally as personal artificial intelligence for natural language
                                processing, recognition and synthesis [2,3]. The presence of full control over the process of the


                                Information Technology and Implementation (IT&I-2024), November 20-21, 2024, Kyiv, Ukraine
                                 Corresponding author.
                                 These authors contributed equally.
                                   kuznetsow.wlad@gmail.com (V. Kuznetsov); SamoylovSasha@gmail.com (A.Samoylov); tkarnaukh@unicyb.kiev.ua
                                (T. Karnaukh); trost@knu.ua (R.Trokhymchuk); anatoly016@gmail.com (A. Kulias)
                                     0000-0002-1068-769X (V. Kuznetsov); 0000-0002-7423-5596 (A.Samoylov); 0000-0001-6556-1288 (T. Karnaukh); 0000-
                                0003-3516-9474 (R.Trokhymchuk); 0000-0002-1322-4551 (H. Kudin); 0000-0003-3715-1454 (A. Kulias)
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).




                                                                                                                                                                                      43
CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
language model allows, firstly, increasing the transparency of the process both from the point of
view of reliability (algorithmic) and from the point of view of data security, without the risks of
transferring sensitive data to third parties or companies.
    Currently, there are numerous tools for both language models training, post-training and model
deployment, which can range from libraries, command line utilities up to more complex
applications such as GPT4All or Nvidia ChatRTX [4]. However, further deployment of the models
was limited due to limits of consumer grade hardware [5], and inability to use of high machine
precision (like FP16 or FP32), so the developers experimented with memory optimizations - with
less precision, trimming weights, creating reduced versions, for instance fp8 GGUF, Q_4, bf16 nf4
(mixed precision FP16) [6,7]. Such optimizations of the models, in fact, led to the appearance of a
separate class of models - small language models, which show the greatest interest for this study
[8,9]. It is worth noting that reducing the size of the language model has its drawbacks, since the
number of parameters of the language model as well as its numerical precision directly affects the
accuracy of the model and its ability to predict flow of the dialog, the amount of short-term
memory, attention, and the ability to retain context.
    These limitations pointed to one critical aspect in the testing and evaluation of artificial
intelligence models [10]. While, with increase in the number of parameters in classical methods, it
is possible to assess how this change will affect the accuracy of the answer, contrary to this, in the
evaluation of the performance of language models, a large role is played by the subjective factor.
Hence, when evaluating the model in terms of answers, the main role is played by the competence
of the user and his or her ability to ask clearly formulated questions in order to receive the same
clear and specific answers [11,12]. The accuracy value itself is a very subjective assessment which
is difficult to formalize. In order to evaluate the performance of large language models, it is
necessary to involve experts who evaluate model responses based on pairwise comparisons of one
language model with others, or utilize user surveys that show evaluation of model performance
based on satisfaction with answers [13].
    One should note, that such approaches have their advantages, however the presence of a person
in the evaluation process introduces its own subjective factor (human in the loop), which makes
one to ask about the feasibility of such evaluations, as well as their trustworthiness based on
selection criterion [14]. In our opinion, a proven way is to assess the trustworthiness of human-
machine interaction (human-AI in this case), using a widely accepted industry standard, ISO/IEC
TR 24028 [15].
    This standard is aimed to formalize the concept of trust in AI systems as well as the ability of AI
to provide clear and understandable answers that match the user's expectations, understanding and
priorities. Within the ISO/IEC TR 24028 standard, the concept of trustworthiness in AI systems is
formalized in the form of a mathematical apparatus based on set theory [16], an algebra of
concepts, that use ontologies to clearly define trustworthiness categories. This made possible to
obtain a clear and understandable set of tools and methods for assessment of trustworthiness in
artificial intelligence.
    For a clear understanding of the requirements for trust in AI systems [17], the standard
describes a number of requirements that can also be applied to large language models (LLM).
However, in the standard, as a drawback, not enough attention paid to ability of LLMs to behave
independently, with certain effects of reinforcing or amplification of the interaction over time [18],
in contrast to systems that have time-fixed trustworthiness. It is of a great importance to take into
account this indicator, as well as other indicators, such as accuracy, stability, reliability
(algorithmic) [19]. This creates a number of questions when evaluating LLMs, we decided to ask
prior to formulating tasks of the study:

   •   Can AI be trusted if it gives correct but not complete answers?
   •   Does training sample size, number of network parameters, or other constraints (e.g., speed,
       memory, short-term memory size, context size) affect confidence in the model?
                                                                                                44
   •    Is trust in AI in the case of LLM a static value or is it variable over time?

    It is clear that the difference between systems that do not have short-term memory and
attention mechanism, and ones that have it, is quite noticeable when one does practical assessment
of their performance [20]. Though, questions may arise in comparing models with different degrees
of attention. For example, in AI systems that have no attention mechanism and short-term memory
as LLMs have, the errors and their affect on accuracy and trustworthiness can be expressed
numerically within given dataset and being valid for this set on small and large data after
numerous experiments [21]. In the case of LLMs, there is an effect of uncertainty and ambiguity.
This happens because in LLMs each subsequent interaction is attended to a number of previous
interactions, which is limited by the ability to retain context and the amount of connectivity of
weights (for example, using the mechanism of self-attention) [22]. The presence of spatio-temporal
dynamics of trustworthiness implemented as dependency on both the number of interactions and
their type, expressed by a certain textual or numerical vector or matrix, creates potential trust-
reinforcing effects, when with an increase in the number of interactions, the amount of trust
increases or decreases over time, impacting the overall confidence and trustworthiness of the
studied model. Because of that, one may require a set of mechanisms to evaluating the self-
reinforcing effects of trust over time [23].
    In this work, we propose to study the aspects of trustworthiness in AI in human-machine
interaction using an example of a sequential dialogue of a LLM and a user. We propose to assess
the problem of dynamic trustworthiness of LLMs by fulfilling the following tasks:

   •   to propose an estimate the self-reinforcing effects in trust in AI systems,
   •   to conduct practical study the qualitative and quantitative interactions which may or may
       have influenced on the amount of trust over time,
   •   to define influence of model limitations on its theoretical accuracy and trust, the scope of
       interaction and its impact on trust as a result.

   This, in turn will allow one to determine trust in AI and allows it to be analyzed and researched.
Using case study on small data sets, as well as personal interaction of users with AI, will allow one
to determine how the volume of training data and the number of parameters affect the behavior of
small and medium-sized LLMs. For this purpose, in the following sections of this work, we
proposed to solve the proposed problems by conducting experimental research of AI systems on
the example of the problem of knowledge mining and to assess estimated values of confidence in
AI during these experiments. The practical recommendations for trust evaluation based on user
satisfaction for the practical evaluation of the model would be developed, as well as the proposals
from the ISO/IEC TR 24028 standard would be utilized.

2. Experimental Study of Large Language Models

In experimental studies, we proposed to investigate the accuracy and quality of the response of
various small LLMs by asking clearly formulated questions related to the topic of trustworthiness.
This will make it possible, firstly, to fundamentally assess the abilities of AI to understand
trustworthiness, and secondly, to use the ISO / IEC standard TR 24028 not only for the
development of the assessment methodology, but also for determining the completeness of the
answers to the compliance of the standard and its definitions. Also, as part of the experiments, we
proposed to investigate how trust changes in LLM with the number of interactions, to build graphs
of the dependence of trust on the number of interactions, with special attention to dependencies
and trends, disturbances and fluctuations of trust and the general resistance of LLM to fluctuations
in input data (for example, the length of the text, its content, expected user response, etc.). By

                                                                                                  45
analyzing these factors, we will be able to better understand how trust is reinforced during long-
term interaction.
   As part of the experiments, each of the tested models was asked a number of clear and
understandable questions on the subject of trustworthiness in AI. Answers were rated on a scale
from 1 to 5, where 1 was incomplete and unclear (irrelevant) answers, and 5 was well-structured
answers that contained a combination of ideas, original definitions with interpretations of
generally accepted definitions, and retained memory (context) of the conversation to predict
questions and type of answers that the user expects to receive. To reinforce this, feedback was
introduced into each question, which contained the user's assessment and satisfaction with the
answer, in order to reinforce useful actions of the LLM.
   Nvidia-based LLM deployment environment Chat RTX and LLM checkpoints downloaded from
the official repositories of LLM developers. The Nvidia video accelerator was used to defragment
LLM RTX 4060 Ti on a Windows 11 computer with a total memory of 32 GB (system memory and
video accelerator memory was used for deployment).
   In order to reduce the influence of bias towards LLM, an anonymous study of models was
conducted, which included uploading to the Chat RTX system model files (checkpoints) that
contained only the number of parameters and the serial number of the model, without specifying
the architecture and developer of the model in the name.
   The experiment will be conducted in 2 stages:

   •   At the 1st stage, the effects of trust reinforcement in models with 1.5-4 billion parameters
       (small LLMs) would be investigated.
   •   Similarly, at the 2nd stage, these effects would be studied in more detail in models with 7-13
       billion parameters (medium-sized LLMs).
   •   As an additional stage, an experimental study would be conducted to test language
       encoding density in 7-9 billion parameter monolingual models.

   This approach would made possible to focus on the effects that occurred at the micro level, to
determine their types and features, while at the 2nd stage, effects at the macro level were studied,
when not only quantitative changes occurred (improvement of trust from the number of
interactions), but also qualitative ones (the system could anticipate the user's questions and adapt
to his requirements and the format of the answers). This made possible to determine the effect of
model size on response quality, context retention, and confidence build up over time.

2.1. Stage 1 of the study: testing small LLMs
To conduct the research, an assessment was made about increase or decline of trust over time. For
instance, each answer, regardless of its quality, was given 5 percentage points (absolute) relative to
the previous answer, so it increased by adding 5 points when trust increased, and subtracted if
trust decreased accordingly. All models were given a confidence value of 70 percentage points
before asking the first question, as a conservative estimate, regardless of their performance to
reduce the influence of model bias or human factor. Thus, if the model gave an answer better than
expected (based on a conservative score of 70 points) on 1 step or question, it was given a score of
75 points right away and similarly on subsequent steps (questions) if the confidence changed in
one or the other side (positive or negative).
   In the first stage, the smallest model with 1.5 billion parameters was initially tested. Overall, this
model showed the ability to increase trust over time in the short term, with a sharp decline in trust
after the 2nd question. In general, the model showed the quality of answers at the level of 1 2
points, with a peak of 2 points after the first two questions. However, after the 2nd question, the
quality began to deteriorate, with a steady decline from the initial 70 points to 60.


                                                                                                      46
    It was followed by a model with 3 billion parameters, which showed much better response
quality and relatively high stability (with minor fluctuations in response quality and confidence).
Overall, the model received approximately 4 points for the best answer, with an average rating of
2.5 points. The quality of the answers began to decline after question 11, reaching a minimum after
question 15, which received 2 points per answer. In general, the value of trust reached 80% for a
short period, with a decrease to 70% (initial value of trust).
    In turn, the 4 billion parameter model consistently scored 4 (lowest 3, high 5) in response
quality average, with confidence increasing from an initial 70% to 80%, with subsequent swings
around the 80% average with scores of confidences that did not exceed 85%. In general, this model
showed a much better understanding of the concept of trustworthiness in AI and was able to
generate relatively original answers related to the topic of trustworthiness in AI. Compared to the
previous 2 models, there was a clear trend with increasing confidence (trust) over time, however,
the presence of trust fluctuations, as well as the inherent uncertainty of the maximum value,
allowed us to give conservative estimates based on a short set of questions.

2.2. Stage 2 of the study: testing of medium sized LLMs
Since the general trends and effects of increasing the trust were clear, in this experiment we
proposed to focus more on the changes in the number of percentage points of trust with the
number of interactions (using the example of 22 questions on the topic of trust in AI). If the model
reached a trust plateau from either the upper or lower bound, no further questions were asked.
Experiments for the entire question set were conducted separately for models that provided
meaningful responses throughout the study to all questions. 8 models with the number of
parameters from 7 to 13 billion were analyzed (including 6 models with 7 billion parameters, 1 of 8
and 1 of 13, respectively). In order to compare the models, graphs were constructed that contained
the question number horizontally and the trust value (in percentage points) vertically, respectively.
The graph (fig. 1) uses the MxPy notation , where x denoted the serial number of the model and y
denoted the number of parameters. For example, M1P7 is model number 1 from the subspecies of
models with 7 billion parameters.
   Let's dwell in more detail on the graph (fig. 1) for models with 7-13 billion parameters, and
models that showed an increase in trust indicators over time:

   •   M1P7 model (1st model out of 7 billion parameters) showed reliable indicators of the quality
       of the answer at the level of 4.5 points on average. The quality of the answers was kept at
       this level until at least 15 questions. Trust fluctuations relative to the trend were about 2.5%
       (absolute). The trend was horizontal, with average trust levels at 78%.
   •   Model M1P8 (1st model of 8 billion parameters) showed the quality of answers at 5 points
       on average. At the same time, the relationship of trust on the number of answers had a
       clear upward trend, which began at the level of 77.5% and ended at the level of 92.5%. The
       magnitudes of trust swings were minimal at the level of 0.75% (absolute), with a potential
       asymptote at the level of 95%.
   •   Model M1P13 (1st model out of 13 billion parameters) showed the results, at the level,
       approximately in the middle, between the model M1P7 and M1P8. The quality of the
       answers ranged between 4.5 and 4.75 points. The general trend was upward (not flat as in
       M1P7, but not as steep in slope as in the M1P8 model), with an asymptote at the level of 82-
       84% and trust swings at the level of 1.2% (absolute), which showed an ability to improve
       results over time.




                                                                                                    47
Figure 1: Graph of dependence of trust in large language models on the number of interactions.

   The most interesting result was that in all models with each interaction, the effect of interaction
improvement decreased with each step and the magnitude of this improvement became non-
significant after 21 questions in almost all models, indicating that the effect of interaction
improvement has certain limitations and cannot happen indefinitely. This indicates that while
model size is important, it is more critical to obtain trustworthy results in the first steps, which
anchors the user's trust in the first steps and the trust grows steadily in subsequent steps, allowing
for predictable results over time.

2.3. Stage 4 of the study: impact of language encoding effectiveness on the model
        performance in monolingual context
The third part of the experiment was devoted to the influence of the effectiveness of text encoding
in large language models on the completeness of their answers. This experiment was based on the
hypothesis of a potentially more efficient encoding of text [24] in languages with hieroglyphic
writing, such as Chinese and Japanese, and languages that do not contain vowel sounds in writing
(for example, Arabic). The essence of the experiment consisted in a detailed analysis of the model's
answers to questions posed with the help of a machine translator, for which the answers were
compared with a model that had similar properties, but was English-speaking. These models had
the following parameters: the Japanese model - 8 billion parameters, the Chinese model - 9 billion
parameters, the Arabic model - 7 billion parameters, the English model - 8 billion parameters
(model M1P8 from the previous experiment). The image below (fig. 2) demonstrates part of a dialog
done in English, having a question and answer, dedicated specifically to ISO/IEC TR 24028
standard.
    In general, the efficiency of coding showed itself differently both in terms of text density and
the quality of answers compared to English. To determine the density, the entire dialogue with the
language model was translated into English, and the coding accuracy was determined as the ratio
of the number of symbols in the text, written in the given non-English language (without space
characters) to the number of symbols of this fragment translated into English. The Japanese model
had a coding density of 54% (about 2 English characters for each Japanese character or symbol), the
Chinese model had a coding density of about 32% (about 3 English characters for each Chinese

                                                                                                   48
character or symbol), the Arabic model had a coding density of about 75% (about 4 English
characters per each 3 Arabic characters or symbols).




Figure 2: Example of the conversation in English with a 8 billion parameter LLM.

   Let's dwell in more detail on the answers of different models based on the assessment of the
quality and completeness of the answers translated into English, while the assessment was made
on the basis that the quality of the answer is at least 4 points on a 5-point scale in order to exclude
the influence of the translator on the quality of the text.
   Compared to the English model, the Japanese model demonstrated greater quality and
completeness of responses, which was manifested in more expressive, structured responses, which
was correspondingly reflected in the initial values of both confidence in the model and its ability to
preserve the context of the conversation. The quality of the answers reached 5 points on a 5-point
scale, which is the highest indicator among the models that were tested. The model showed an
unambiguously good understanding of the concepts of trust in artificial intelligence, for example, a
more detailed description and breakdown of definitions into separate categories, lists, highlighting
the main structural categories and concepts of trust, their components and connections between
these concepts. The answers of this model took into account the wishes of the user (feedback) and
contained potential advice on potential questions on the subject of trust in artificial intelligence.
Also, a significant feature of the model was the coding of complex concepts with a smaller number
of characters. Overall, confidence in the model was on average higher than in the English-language
M1P8 model, but reached a plateau after which the differences between English- and Japanese-
language responses became smaller. The figure 3 demonstrates a fragment of a conversation with
Japanese based LLM.
   The result of the text of the Chinese-language model was interesting. Although it had three
times higher coding efficiency and a larger model size (9 billion parameters compared to 8 in
English), the quality of these responses were lower than in English. This was telling about, firstly,
less engagement in the conversation and less initiative in terms of providing leading questions,
including prompts for further discussion. Also, the completeness of the answers was clearly lower
and was at the level of 3.5 points on average (compared to at least 4 points in the English language
model). However, this has not resulted in bad understanding the concepts of trustwothiness - the
model understood what trust is and could give a quote from the ISO/IEC TR 24028 standard - but in
their formulation, which was more reference than an original, own. This can be caused both by
restrictions on the length of the answer, and by the style of the model itself, which aims to give
                                                                                                    49
shorter answers, as opposed to long ones, as, for example, in the Japanese model (fig. 4). Also,
another reason for the low response quality score was that the model repeated old definitions and
used less text from previous responses to improve results.




Figure 3: Example of the conversation in Japanese with a 8 billion parameter LLM.




Figure 4: Example of the conversation in Chinese with a 9 billion parameter LLM.

    In the Arabic language model, these differences were less significant (potentially because of
high density of encoding characters in written text). A feature of the answers is their shorter
length (less than in English), but the quality was not as high as, for example, in Japanese due to the
impossibility of structuring concepts into more deep categories (fig. 5). Therefore, on the one hand,
the model showed a good understanding of the topic of trust, and showed the ability to predict the
course of the conversation (albeit at a level somewhat lower than the Japanese one). Thus, the
effectiveness of this model, as well as the conciseness of its answers, can be estimated at a level
slightly higher than in English, but lower than in Japanese. In general, some answers were at the
level of 4.5 points, which is in itself a good result for a monolingual model.

                                                                                                   50
Figure 5: Example of the conversation in Chinese with a 9 billion parameter LLM.

3. Discussion
The results obtained as part of the experiments made it possible to formulate several ideas or
hypotheses and further directions of research into large language models.
    Controllability, stability, and reliability are important indications of language models, which are
essentially closed-loop systems with a human in the decision-making loop [25]. Human
participation in decision-making, such as feedback or response evaluation, is crucial as feedback.
This can be observed in the language models at the level of 7 billion parameters: while the M1P7
model showed excellent performance, with average but predictable responses, it maintained a
horizontal trend with minor fluctuations, indicating its controllability and stability. However, as a
counter-example, the other models with 7 billion parameters (M3P7, M4P7, M5P7, M6P7) showed
unpredictable effects, which were consequently expressed in the decreasing trust with a trend
towards decreasing of trust observed in much smaller models like M1P1.5 or M1P3. This meets
some observations regarding efficiency degradation in LLMs [26].
    In large language models acting as closed-loop automatic control systems with a human in the
loop, any actions that do not match the predictions of one or the other participant (AI or human)
cause perturbations or disturbances [27]. These perturbations are essentially latent, but are
expressed in a misunderstanding (or dissatisfaction) with the user's response or a
misunderstanding by the model of the essence of the user's question or feedback, respectively. This
causes non-linear dependencies of trust levels on the number of questions. In turn, it shows that
the clarity of the answer and the question are decisive in assessing the level of trust. This could be
seen on the example of a model with 8 billion parameters, when slight deviations of the response
from the expected caused slight fluctuations in user confidence, and disturbances that were
transmitted through feedback, steadily increasing trend of confidence.
    No less important are the observation horizon and memory [28]. These two properties are as
important as the connectivity (attention) and short-term memory of the model. This can be
illustrated by an example - for instance, in the M1P1.5 model (1.5 billion parameters), efficiency
increased in a short interval, but after the 5th question, due to a small amount of short-term
memory, trust and efficiency decreased. In contrast to this model, in the M1P4 model (4 billion
parameters) the presence of memory and the ability to retain the context of the conversation
allowed the quality of answers to be preserved at least up to 21 questions. As a counter-example,
we can mention a model with 13 billion parameters, which, even with a larger short-term memory
and observation horizon, received small gain from these indicators.
    Also, it is worth noting that the dependence of the value of trust on the number of parameters
and interactions does not always meet expectations; for example, the number of interactions is
unpredictable (can either increase, decrease or remain the same). Therefore, in such models with

                                                                                                    51
memory, changes are accumulated gradually and are implemented in leaps. These effects were
most clearly seen in large language models with 4 and 8 billion parameters.
    As in big data models that have many interactions and users, and where the actions of
individual users can reinforce beneficial interactions for the whole system in it, self-reinforcing
effects can also be observed at the micro level, for example, the interaction of 1 user with 1 AI
system. Despite the small volume of data (small data) of interactions, these effects are sufficient to
form a data value chain [29] and evaluate data based on context and the decision-making chain
"action-response-utility". In essence, the utility (value) of data for the user is formed as a result of
the generation of answers that satisfy the user's expectations based on their evaluation, which, in
turn, causes the generation of data and reinforcement of trust.
    Another result, that was worth mentioning, was an impact of model effectiveness of training
and its language encoding quality. In one-language models, like in these case, English, Chinese,
Japanese and Arabic, some effects were discovered, whereas efficiency of language encoding can
affect the model performance, which shows potential for models for their inprovement.
    Such results summarize that the user's satisfaction with the answers, and not the number of
parameters, is the determining factor when learning a language model. In our opinion, more
effective application of architecture optimizations, tokenization of source texts, higher quality
training sample are the main criteria for achieving high quality of answers. However, it is worth
noting that the advantages, obtained through more efficient text encoding, are in themselves may
have been decreased because of the need for translator, and therefore, invisible increase in the
number of parameters, so as such monolingual models are more useful likely for users who are
native speakers of a given language, in the first place, and secondly for others [30].
    So, one can assume that human-machine interaction when communicating with AI (using the
example of large language models) has a complex nature. Small bits of data generated at each step
affect the model's behavior and increase or decrease its performance over time.

4. Conclusions
This study analyzed how trust shows itself during long-term interaction. As a result of
experiments, conducted on the example of 14 large language models with 1.5-13 billion parameters,
we have shown that trust is not a constant value over time and indeed has complex dynamics. This
dynamic can be traced over time, using such criteria as explainability, predictability and
confidence, which form a general assessment of the value of trust.
   Each interaction potentially leads to both the growth of trust and its decline, or certain
oscillatory processes. This means that each subsequent response can have both a positive effect
(user satisfaction and positive feedback) and a negative one, which can lead to unpredictable
results as an outcome. The context retention in the conversation is critical because the model can
learn patterns of user behavior and reinforce more useful responses for the user. This is due to the
presence of self-reinforcing effects, which, in the ideal case, are increasing confidence in the model
compared to the initial level of confidence and, as a consequence, confidence grows with time and
with each interaction, so trust does too. As in big data models, the presence of feedback forms a
data value chain that is linked to trust: the higher the value or utility of the data, the higher the
trust.
   Thus, summarizing the obtained results, we may assume that the studied features of the large
language models show a potential number of problems in the field of trustworthiness in AI, which
should be solved when creating new models. Firstly, these models should be stable, understandable,
predictable, especially for cases when the models are used as personal intelligence. The relatively
small number of interactions and small amount of data (compared to big data) indicates the need to
develop large language models that adapt to the user and generate trustworthy responses using the
context and memory of previous interactions.

                                                                                                     52
   We also want to mention that effects of self reinforcing of trust can be drastically improved by
good quality learning and more efficient encoding of the data by the large language models. For
instance, by studying models in 7-9 billion parameter range, we discovered, that usage other
language that has more informational density, in fact, could (but not always) affect the quality of
the answers, their conciseness, structure and context retention. For instance, in Japanese large
language model, compared to English one with the same amount of parameters, the trust curve was
increasing with higher rate, but, from our observation it started to taper at the same level as the
English one, after long conversation, which shows the effect of diminished returns on
improvement of model performance. As a counter-example, Chinese large language model
performed less efficiently, compared to English one, despite more efficient encoding of language
concepts, making effects of such encoding less predictable; it says that more effective application of
architecture optimizations, as well as better training data is key for success here.
   In the further research we propose to refine our experimental study on other large language
models, likely with a larger number of parameters (more than 13 billion), including the online
models available at Hugging Face Spaces and other cloud resources as well.

Declaration on Generative AI
During the preparation stage of the study, the authors studied the behaviour of AI models, in
particular, Large large language models (LLMs), which were analyzed as a main subject of the
study, whereas a set of AI models was benchmarked one against another, in order to establish the
formal feasibility of usage of AI models in knowledge mining.
   In the paper we conducted research on a large family of language models including:
        Meta Llama instruct 3.0 8b.
        Mistral 2 7b.
        Microsoft Phi 3 mini instruct 4b.
   and another large language models with varying number of parameters from 1.5 to 13 billion
parameters. In order to establish their utility for our tasks, we tested different LLM deployment
software such as Chat RTX, Ollama (Docker image), and different Python libraries for LLM
deployment, including llama-cpp, exllama2 and ollama.
   Since AI models, such as LLMs were used during our experiments as main subject of the study,
we acknowledge usage of such AI models during our experiments as a subject of the study and we
take a responsibility of usage of AI tools according to rules of fair use of Generative AI in order to
keep the academic integrity of our paper and authorship. Authors reviewed and edited the paper
in order to comply with the requirements for publishing the paper with such type of content.
   The content generated by LLMs, such as responses of the models, such as on figures 2,3,4,5
serves only the illustrative purposes to prove the feasibility of LLMs usage for knowledge mining.
The figures 2-5 were created by transformative work, such as copying the text into an editor,
formatting the text, making a screenshot, formatting the image and using image editing software
(Ms Paint). Each figure of above include the question posed by the author and the answer by an
LLM in 4 languages English, Japanese, Chinese and Arabic as follows.

References
[1] X. Han, Z. Zhang, N. Ding, Y. Gu, X. Liu, Y. Huo, J. Qiu, L. Zhang, W. Han, M. Huang, et al.,
    Pre-Trained models: past, present and future, AI Open (2021). doi:10.1016/j.aiopen.2021.08.002.
[2] Kuznetsov, V., Krak, I., Barmak, O., Kulias, A. Facial expressions analysis for applications in
    the study of sign language CEUR Workshop Proceedings, 2353, (2019) . pp. 159-172.
    DOI:10.32782/cmis/2353-13.



                                                                                                   53
[3]
      convolutions for sign language alphabet recognition, CEUR Workshop Proc. 3137 (2022) 1 10.
      doi:10.32782/cmis/3137-7.
[4]   NVIDIA ChatRTX. URL: https://www.nvidia.com/en-us/ai-on-rtx/chatrtx/.
[5]   J. Peddie, The GPU environment software extensions and custom features, in: The history of
      the GPU - eras and environment, Springer International Publishing, Cham, 2022, pp. 251 281.
      doi:10.1007/978-3-031-13581-1_7.
[6]   Xu, Y. Ding, S. X. Hu, M. Niemier, J. Cong, Y. Hu, Y. Shi, Scaling for edge inference of deep
      neural networks, Nat. Electron. 1.4 (2018) 216 222. doi:10.1038/s41928-018-0059-3.
[7]   X. V. Lin, T. Mihaylov, M. Artetxe, T. Wang, S. Chen, D. Simig, M. Ott, N. Goyal, S. Bhosale, J.
      Du, et al., Few-shot learning with multilingual generative language models, in: Proceedings of
      the 2022 conference on empirical methods in natural language processing, Association for
      Computational Linguistics, Stroudsburg, PA, USA, 2022. doi:10.18653/v1/2022.emnlp-main.616.
[8]
     dynamic experimental databases, Processes 6.7 (2018) 79. doi:10.3390/pr6070079.
[9] X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, J. Tang, GPT understands, too, AI Open
     (2023). doi:10.1016/j.aiopen.2023.08.012.
[10] Jiang, A. Anastasopoulos, J. Araki, H. Ding, G. Neubig, X-FACTR: multilingual factual
     knowledge retrieval from pretrained language models, in: Proceedings of the 2020 conference
     on empirical methods in natural language processing (EMNLP), Association for Computational
     Linguistics, Stroudsburg, PA, USA, 2020. doi:10.18653/v1/2020.emnlp-main.479.
[11] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K.
     Slama, A. Ray, et al., Training language models to follow instructions with human feedback,
     in: S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh (Eds.), Advances in Neural
     Information Processing Systems, Curran Associates, Inc., 2022, p. 27730--27744. URL:
     https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731
     -Paper-Conference.pdf.
[12] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K.
     Slama, A. Ray, et al., Training language models to follow instructions with human feedback,
     in: S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh (Eds.), Advances in Neural
     Information Processing Systems, Curran Associates, Inc., 2022, p. 27730--27744. URL:
     https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731
     -Paper-Conference.pdf.
[13] M. Hardalov, P. Atanasova, T. Mihaylov, G. Angelova, K. Simov, P. Osenova, V. Stoyanov, I.
     Koychev, P. Nakov, D. Radev, BgGLUE: A bulgarian general language understanding
     evaluation benchmark, in: Proceedings of the 61st annual meeting of the association for
     computational linguistics (volume 1: long papers), Association for Computational Linguistics,
     Stroudsburg, PA, USA, 2023. doi:10.18653/v1/2023.acl-long.487.
[14] P. C. Bauer, Clearing the jungle: Conceptualizing and measuring trust and trustworthiness,
     SSRN Electron. J. (2013). doi:10.2139/ssrn.2325989.
[15] ISO/IEC TR 24028:2020 Information technology Artificial intelligence (AI) Overview of
     trustworthiness in AI, 2020. URL: https://www.iso.org/standard/77608.html.
[16] B. Brodaric, F. Neuhaus (Eds.), Formal Ontology in Information Systems, IOS Press, 2020.
     doi:10.3233/faia330.
[17] A. Ferrario, Justifying our Credences in the Trustworthiness of AI Systems: A Reliabilistic
     Approach, SSRN Electron. J. (2023). doi:10.2139/ssrn.4524678..
[18] E. Katsamakas, O. V. Pavlov, Artificial intelligence feedback loops in mobile platform business
     models, Int. J. Wirel. Inf. Netw. (2022). doi:10.1007/s10776-022-00556-9.
[19] V. Kuznetsov, S. Kondratiuk, H. Kudin, A. Kulyas, I. Krak, O. Barmak, Development of models
     of self-reinforcing effects for big data evaluation, IEEE 18th international conference on

                                                                                                   54
       computer      science     and     information   technologies     (CSIT),    IEEE,     2023.
       doi:10.1109/csit61576.2023.10324045.
[20]
     A., Courville, A., Bengio, Y. A closer look at memorization in deep networks. 34th
     International Conference on Machine Learning. 2016. Vol. 70. PP. 2337-2346. URL:
     https://arxiv.org/abs/1706.05394.
[21] I. Krak, V. Kuznetsov, S. Kondratiuk, L. Azarova, O. Barmak, P. Padiuk, Analysis of deep
     learning methods in adaptation to the small data problem solving, in: Lecture notes in data
     engineering, computational intelligence, and decision making, Springer International
     Publishing, Cham, 2022, pp. 333 352.
[22]
                                                  50. doi:10.1007/978-3-319-23742-8_2.
[23] Assessing and improving AI trustworthiness: Current contexts and concerns, National
     Academies Press, Washington, D.C., 2021. doi:10.17226/26208.
[24] Z. Li, Z. Zhang, H. Zhao, R. Wang, K. Chen, M. Utiyama, E. Sumita, Text compression-aided
     transformer encoding, IEEE Trans. Pattern Anal. Mach. Intell. (2021) 1.
     doi:10.1109/tpami.2021.3058341.
[25]
                                                              81. doi:10.1049/sbra517e_ch3.
[26] X. Feng, X. Han, S. Chen, W. Yang, LLMEffiChecker:Understanding and testing efficiency
     degradation of large language models, ACM Trans. Softw. Eng. Methodol. (2024).
     doi:10.1145/3664812..
[27] K. A. Grasse, Nonlinear perturbations of control-semilinear control systems, SIAM J. Control
     Optim. 20.3 (1982) 311 327. doi:10.1137/0320024.
[28] L. Hewing, K. P. Wabersich, M. Menner, M. N. Zeilinger, Learning-Based model predictive
     control: toward safe learning in control, Annu. Rev. Control, Robot., Auton. Syst. 3.1 (2020)
     269 296. doi:10.1146/annurev-control-090419-075625.
[29] H. Kasim, T. Hung, X. Li, Data value chain as a service framework: For enabling data handling,
     data security and data analysis in the cloud, in: 2012 IEEE 18th International Conference on
     Parallel and Distributed Systems (ICPADS), IEEE, 2012. doi:10.1109/icpads.2012.131..
[30] D. Bahdanau, K. Cho, Y. Bengio. Neural Machine Translation by Jointly Learning to Align and
     Translate. URL: https://arxiv.org/abs/1409.0473.




                                                                                                55