<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>journal
Abbreviation: Proceedings of the ACM Symposium on Applied Computing Pages: 936
Publication Title: Proceedings of the ACM Symposium on Applied Computing.
[14] A. Tripathy</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.48550/arXiv.2303.08774</article-id>
      <title-group>
        <article-title>A Hybrid Framework for COSMIC Measurement: Combining Large Language Models with a Rule-Based System</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Safae Laqrichi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Arts et Métiers Campus of Rabat</institution>
          ,
          <addr-line>Technopolis Rabat-Shore, Rabat 11103, Maroc</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>2207</volume>
      <issue>2207</issue>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Accurate Functional Size Measurement (FSM) is crucial for efective project management and resource allocation in software development. The COSMIC FSM provides a consistent and standardized method, allowing for objective comparisons and accurate estimation of efort. However, manual FSM is often time-consuming, error-prone, and subject to assumptions and human bias. Automating this process can greatly benefit organizations but introduces challenges associated with handling requirements expressed in Natural Language (NL). In this paper, we explore the integration of Large Language Models (LLM) with rule-based systems to enhance the automation of the COSMIC Function Point Measurement. Our hybrid framework combines the contextual understanding and adaptability of LLMs, such as GPT-4, with the precision and consistency of rule-based engines crucial for tasks requiring strict adherence to rules and regulations. This approach enables a more robust analysis, processing of NL requirements, and precise application of the COSMIC measurement method. The efectiveness and feasibility of this approach are demonstrated through a series of experiments conducted on COSMIC public case studies. Our results show a promising F1-score of 0.95 in identifying COSMIC key components using GPT-4, an F1-score of 0.97 in deducing data movements using the rule-based system, and a global error rate of only 6.6% in measuring COSMIC function points using the integrated framework. These results underscore the potential for more reliable and eficient system analysis through the synergistic capabilities of LLMs and rule-based systems.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;COSMIC FSM</kwd>
        <kwd>Automation</kwd>
        <kwd>LLM</kwd>
        <kwd>GPT-4</kwd>
        <kwd>ChatGPT</kwd>
        <kwd>Rule-based-system</kwd>
        <kwd>Prompting</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Software measurement plays a crucial role in software development by providing quantitative
insights into various aspects of software products, such as size, complexity, and efort estimation.
Accurate measurement of software requirements is particularly important as it forms the
foundation for successful project planning and management. The COSMIC function point (CFP)
method is a widely accepted approach for measuring software size and estimating efort based
on functional requirements analysis [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, the process of manually measuring software
requirements can be time-consuming, error-prone, and subject to interpretation biases [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        To address these challenges, there is growing interest in leveraging Large Language Models
(LLMs) like GPT-3 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and GPT-4 [4], which use powerful NLP techniques and deep learning
algorithms. They have demonstrated remarkable capabilities in natural language processing
(NLP) and generation, raising the possibility of automating the measurement of natural language
(NL) software requirements using the COSMIC function point method.
      </p>
      <p>In this paper, we propose an approach that utilizes LLM for automating the measurement of
NL software requirements in the COSMIC function point method. Our objective is to leverage
the language understanding and generation capabilities of these models to analyze and measure
the functional aspects of software requirements, thereby streamlining the measurement process
and reducing the manual efort involved.</p>
      <p>Prior research has explored various techniques for automating software requirement
measurement, including rule-based approaches, machine learning-based models, and ontology
[5]. However, the recent advancements in LLMs ofer an exciting opportunity to overcome
the limitations of previous approaches and improve the accuracy and eficiency of software
requirement measurement.</p>
      <p>The remainder of this paper is organized as follows. Section 2 provides background
information on the COSMIC function point method and reviews the related work on automating
COSMIC FSM for NL requirements. Section 3 describes our proposed framework using LLMs
and outlines the methodology involved. Section 4 presents the results of our experiments and
evaluates the performance of the applied LLM in measuring NL software requirements. Finally,
section 5 concludes the paper and outlines potential avenues for future research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <sec id="sec-2-1">
        <title>2.1. COSMIC Function point method</title>
        <p>Numerous studies have demonstrated the efectiveness of the COSMIC method as a functional
size measurement (FSM) approach. The COSMIC method’s has gained global adoption due
to its flexibility allowing to measure any type of software, including web applications, agile
development contexts, and large-scale software projects.</p>
        <p>The fundamental concept underlying the COSMIC method is that, for many types of software,
a significant portion of development eforts is dedicated to handling data movements between
persistent storage and users. Therefore, the total number of these data movements can provide
valuable insight into the size of the system [6, 7].</p>
        <p>In COSMIC FSM, data movement is the Base Functional Component which moves a single
data group. To measure these data movements, three elements need to be identified. A Data
Group is a distinct, non-empty, non-ordered and non redundant set of data attributes where
each included data attribute describes a complementary aspect of the same one object of interest.
An object of interest is any ‘thing’ that is identified from the point of view of the FUR about
which the software is required to process and/or store data. A Data Attribute is smallest parcel
of information, within an identified data group, carrying a meaning from the perspective of the
software’s FUR.</p>
        <p>Data movements are categorized based on their direction and purpose within a system: An
Entry (E) transfers data from a user to the necessary functional process. An Exit (X) sends data
from a process back to the user. A Read (R) retrieves data from persistent storage for use within
a process, and a Write (W) stores data from a process into persistent storage.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Software requirements in Natural Language</title>
        <p>The majority of software requirement specifications are written in NL, including initial
conceptions and requests for proposals (RFPs). However, NL is inherently ambiguous, leading to
messy software requirements specification (SRS). There is a tradeof between using NL or a
mathematics-based formal language (MBFL). NL allows for easy understanding by stakeholders,
although interpretations may vary. MBFL is unambiguous but may lack writers and
understanding among stakeholders. Research in requirements engineering (RE) aims to address the
ambiguity of RSs by promoting MBFLs and improving accessibility. NL SRSs are inevitable as
they bridge the informal ideas of software development with formal coding. The transition from
informal to formal occurs during the requirements engineering process. NL remains essential,
even if only during the initial conception [8]. However, requirements that are expressed in
NL and in a free-form manner are susceptible to being imprecise, long and incomplete [9].
These characteristics pose significant challenges for eficient and standard measurement,
particularly in contexts like COSMIC function point analysis, which requires a high degree of clarity,
completeness, unambiguity and a high level of granularity.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Literature review of COSMIC FSM automation studies</title>
        <p>The automation of the COSMIC method has sparked a lot of interest over the past decades.
Several studies have investigated both structured FURs, such as UML diagrams, and unstructured
FURs, like free-form text or NL.</p>
        <p>To the best of our knowledge, there is limited academic research that has been conducted on
automating functional measurement using NL specifications. Only a few studies, such as those
by Les [10, 11, 12, 13, 14, 9], have addressed this topic. In [10], Hussain, Ishrar, and Ormandjieva
proposed an innovative concept based on a supervised approach to word processing. Their
method involved identifying parts of speech and associating them with specific features to
automate functional measurement from informal textual specifications. However, their method
had limitations such as not considering context and assuming consistent language styles in
specifications.</p>
        <p>In his thesis [11], Hussain proposed a method combining machine learning and NLP. The
method involved preprocessing the data by tokenizing, separating sentences, tagging part of
speech, and extracting names. The machine learning algorithm was used for classifying input
data into types of data movement. While the results were satisfactory, the use of tokenization
without considering context was acknowledged as a limitation.</p>
        <p>In a diferent study [ 12], authors presented an NLP method for functional measurement based
on the extraction of part of speech. They focused on specifications written in SBVR (Semantic
Business Vocabulary and Rules) syntax and used a parsing method to extract object classes.
However, limitations were observed due to the restrictions of the parsing method and the fact
that SBVR specifications were not common.</p>
        <p>Another study [13] proposed a contextual NLP model (CNLP) and a flexible NL representation
called "two level grammar (TLG)." However, the method has not been validated yet.</p>
        <p>Tripathy et al. [14] presented a technique for transforming NL specifications into
objectoriented structures using the part of speech (POS) NLP methodology. They aimed to generate a
tree structure and a conceptual model for function point counting. The authors acknowledged
the need to optimize the parsing step and expressed the intention to test the model on multiple
use cases.</p>
        <p>Bagrayanik et al. [5] designed a newly requirements engineering ontology and develop a
method to automatically measure the software size in COSMIC FP using the ontology. The
method has been validated using real projects conducted within the ICT department of a leading
telecommunications provider in Turkey. The results demonstrated a consistent agreement
between manual and automated measurement outcomes. One limitations of this study is
the need to specify requirements in terms of the proposed requirements ontology. Another
limitation arises from the fact that the proposed method exclusively supports the COSMIC
Function Point method as the functional size measurement methodology. This limitation is
particularly unfavorable for companies that adopt other FSM methods, such as IFPUG.</p>
        <p>A more recent research study performed by Erdin, Colin, and Alain in [9] introduced
ScopeMaster®, a commercial tool automating COSMIC FSM on requirements in user story format. The
tool’s methodology leverages NLP and pattern matching techniques. It focuses on identifying
the "Subject verb object" structure within requirements to extract the necessary measurement
elements. For instance, in a user story like "As a user, I want to display orders," the subject
"user" is considered a candidate for the Functional User, the object "orders" is a candidate for an
Object of Interest, and the verb "display" corresponds to one of the functions from the CRUDL
(Create, Read, Update, Delete, List) set. The specific details of how ScopeMaster ® performs these
techniques are proprietary and subject to a pending patent application. Currently, examples
handled by the tool are only in a US format; there is no experimentation conducted on free-form
text FUR. Another mentioned limitation pertains to the non-customizability of the used ontology.
For instance, the list of recognized verbs is fixed and identical across all projects. However, there
may arise a need for nuances or combinations of verbs to accommodate diverse development
contexts.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Capabilities of LLMs in applying COSMIC Function Point principles</title>
      <p>
        LLM [
        <xref ref-type="bibr" rid="ref3">3, 15</xref>
        ] such as GPT-4 [16], PaLM [17] and LLaMA [18], have shown impressive performance
in various NLP tasks and gained significant attention from academia and industry. These models
leverage extensive pre-training on massive datasets and reinforcement learning from human
feedback (RLHF) [19] to excel in language understanding, generation, interaction, and reasoning.
      </p>
      <p>LLMs typically refer to Transformer-based models with hundreds of billions of parameters.
They use multiple layers of attention mechanisms, known as multi-head attention, in a deep
neural network architecture. LLMs implement similar Transformer architectures and
pretraining objectives as smaller models like BERT [20]. Scaling involves increasing the number of
parameters, training data, and computational resources to enhance performance [21].</p>
      <p>
        LLMs exhibit emergent abilities, which are not present in smaller models but arise in larger
ones [21]. Typical emergent abilities include in-context learning [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], instruction following [22],
and chain-of-thought prompting [23]. These abilities open up possibilities for building advanced
AI systems [24].
      </p>
      <p>For our research, we explored instruction-following LLMs. These models understand a wide
range of user instructions, also called prompts, and produce contextually and grammatically
correct responses. Due to their generalization ability, they perform strongly on unseen tasks
presented as instructions without needing explicit examples [25, 26].</p>
      <p>Applying the COSMIC method requires high-level reasoning and strict adherence to specific
rules. Precision in applying these rules is imperative; failure can result in approximate rather
than accurate measurements.</p>
      <p>Using rule-based prompts, chain-of-thought, or least-to-most prompting [27] can guide the
LLM to generate responses within a specific framework of rules. This is efective for systems
with easily stated rules, but less so for complex tasks with hierarchical rule structures, like
COSMIC measurement.</p>
      <p>An alternative approach integrates a rule-based system with LLM prompting. The rule-based
component refines the LLM’s output to ensure adherence to predefined rules. This hybrid model
leverages the LLM’s advanced NL understanding and generation capabilities while ensuring
compliance with intricate rules, advantageous in scenarios demanding high precision and strict
rule compliance.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Integrating LLMs and Rule-Based System for Automated</title>
    </sec>
    <sec id="sec-5">
      <title>COSMIC Measurement</title>
      <p>Our framework represents an initial attempt to harness the capabilities of LLMs combined with
rule-based systems for COSMIC measurement. This preliminary research is based on several
assumptions: (i) the software is categorized as a new development project, (ii) the software
architecture follows a single-layer structure, eliminating the need to identify multiple layers, and
(iii) the scope of measurement is clearly defined, with the requirements document exclusively
describing the functionalities within this scope.</p>
      <p>Consequently, our framework for automating COSMIC measurement will bypass the steps of
determining the purpose and scope of measurement, as well as the identification of architectural
layers.</p>
      <p>Our framework is structured into two major stages as illustrated in Figure 1: a) identification
of COSMIC key components using LLM, and b) deduction of data movements using rule-based
system.</p>
      <p>LLMs are utilized for NL understanding and processing, and information extraction tasks,
required for step a) of the framework. The highly rule-based task of deducing data movements
in step c) is managed using a rule-based system. This approach ensures that the analytical and
interpretative tasks are efectively addressed by leveraging both advanced machine learning
techniques and structured logical processes.</p>
      <sec id="sec-5-1">
        <title>4.1. COSMIC key components identification using LLM</title>
        <p>To identify these COSMIC key components, our methodology, depicted in Figure 1, requires
executing various steps of the COSMIC process. This methodology comprises five distinct
steps: identifying Functional User Requirements (FUR), Functional Users, Functional Processes,
Data Groups, and finally, the identification of Sub-process Components. The latter step, not
explicitly outlined in the oficial manuals, aims to properly partition the functional process into
sub-processes while adhering to the granularity required for COSMIC measurement.</p>
        <p>Furthermore, we have identified additional critical components essential for the final phase
of identification and classification of data movements, such as action verbs, destinations, and
sources of data. Each component plays a vital role in both understanding and quantifying data
movement. Definitions of these additional key components are :
• Action verb: This specifies the operations performed on the data, such as entering,
exiting, reading, writing, sending, or receiving, detailing the manner in which the data is
transferred or altered.
• Destination: This indicates where the data is directed or the endpoint of the action. The
destination may be internal (within the system) or external (to another system or a user).
• Source: This identifies the origin of the data, which may also be internal (originating
within the system) or external (coming from another system or a user).</p>
        <p>The integration of both the source and the destination is critical as they delineate the trajectory
of the data movement, essential for gauging the type of each interaction within the system. The
source is particularly significant in establishing the starting point of the data flow, which is
vital for analyzing data transfer across system boundaries, especially in complex systems with
numerous interacting external applications or subsystems.</p>
        <sec id="sec-5-1-1">
          <title>4.1.1. Definitions and rules adaptations for improved linearity</title>
          <p>The process described in the oficial manual appears linear at first glance. However, some
rules establish a component’s definition based on subsequent components in the procedure,
introducing an inherent non-linearity. For instance, the rule 7 of the oficial COSMIC manual for
functional users identification states that "all functional users that trigger, provide information
to, or receive information from functional processes in the FUR of the software within the scope
of the FSM shall be identified". Yet, these functional processes themselves are only determined
at a later stage in the procedure. Another example is Rule 10 and its Guidance, which specify
that to identify a functional process, one must identify the "Triggering Event" that causes a
functional user to generate a data group, thereby initiating a sequence of data movements. At
this stage, the definitions of ’data group’ and ’data movement’ remain unspecified, with their
detailed identification addressed later in the COSMIC process. This observation highlights the
need to adapt definitions to ensure better linearity of the process and facilitate the automation
of the COSMIC methodology using LLM prompting.</p>
          <p>These insights have guided our approach to ensure that each step logically follows the
previous one, enhancing the clarity and efectiveness of the COSMIC methodology application.
Below, we provide the definitions and rules used at each step.</p>
          <p>• Step 1: Identification of Functional User Requirements (FUR)</p>
          <p>The LLM is used to examine user requirements documents to identify FUR, suggest
improvements, and ensure they are measurable in FSM methods. If the requirements are
in free-form text, the LLM breaks them down into FUR. In agile developments, these FUR
can be transformed by the LLM into User Stories (US) eficiently.
• Step 2: Identification of Functional Users</p>
          <p>Functional users, as defined in the COSMIC methodology, are identified based on their
role as senders or recipients of data. In this definition, we do not reference functional
processes at this step, as they have not yet been defined.
• Step 3: Identification of Functional Processes</p>
          <p>Functional processes are identified by recognizing triggering events, the functional user
responding to these events, and the data attributes sets initiated representing the
Triggering Entry (TEn). In this step, data groups are not referenced, as they have not yet been
defined, "data attributes sets" term are utilized rather.
• Step 4: Identification of Data Groups</p>
          <p>Data Groups is a critical concept in many functional sizing methods, including COSMIC.
In COSMIC method, data groups are identified by pinpointing objects of interest and their
data attributes sets, which include frequency occurrence and identification keys. Data
representing control commands are not data groups. Data attributes sets that describes
diferent objects of interest represent diferent data groups. For data attributes sets that
describe the same objects of interest, to identify if they represent distinct data groups, a
straightforward rule is applied : if two data attribute sets difer in either their key attribute
or their frequency of occurrence, they are considered separate data groups.
• Step 5: Identification of Sub-process Components</p>
          <p>This step involves partitioning functional processes into steps then identifying
subprocesses to meet the granularity required for COSMIC measurement and extracting
complementary key components for identifying and classifying data movements.</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>4.1.2. Prompt Design and Optimization</title>
          <p>The prompts used at each stage of the methodology have been carefully designed, tested, and
optimized to fully leverage the capabilities of the LLM through prompt engineering techniques.
Table 1 synthesizes the components of these prompts. They may range from simply defining
the key component to be identified and specifying the required task, to outlining detailed steps
necessary for identifying the target component. The prompts were crafted based on COSMIC
method rules from the oficial manual, and the ’Rule Reference’ column in the table indicates
the specific rules (R) and guidelines (GR) from the manual applied at each step.</p>
          <p>Here is the structured approach for crafting efective prompts used in our approach :
• Introduction and Context: Introduce the task’s background and contextual details,
including specific roles or expertise required (e.g., a software measurement expert for COSMIC
tasks).
• Task Description: Clearly describe the objective or task to be achieved.
• Step-by-Step Instructions: Break down the task into sequential stages, providing clear
definitions and instructions to guide the language model or tool.
• Expected Output Format: Define the desired structure or format of the output, specifying
essential attributes or information.
• Example or Scenario: Utilize few-shot learning techniques by presenting an example or
scenario that illustrates the task or problem, thereby enhancing the quality of responses.</p>
          <p>Example of a prompt for identifying functional processes:
Role: Your are a software measurement expert specializing in the COSMIC method for functional
size measurement.</p>
          <p>Task Description: your task involves identifying functional processes from the provided
Functional User Requirements (FUR).</p>
          <p>Step-by-Step Instructions: The process involves the following steps:
1. Identify the Triggering Events: These are distinct events in the world of the functional
users that the software being measured must respond to.
2. Identify the implied Functional User types: these are functional users of the software that
are likely to respond to each triggering event.
3. Identify the Triggering Entry: these are the data that each functional user may generate
in response to the event.
4. Identify the Functional Process: which is initiated by each triggering data. This process
comprises all operations necessary to fulfill its FUR, addressing all potential responses to
the triggering event.</p>
          <p>Expected Output Format: Present the result in this python dictionary format :
functional-process-format = { "Add-Professor":
{ "Triggering Events": "New employement"
"Functional User" : "Administator",
"Triggering Entry" : "Professor details"}
Example: "As a student, I want be able to check my courses schedule."
Identified functional process :
{ "Check-schedule":
{ "Triggering Events": "Student request"
"Functional User" : Student",
"Triggering Entry" : "Student ID"}
Here is the FUR :</p>
        </sec>
        <sec id="sec-5-1-3">
          <title>4.1.3. Adjusting Gpt-4 sampling temperature</title>
          <p>Several inference hyperparameters can be adjusted to modify the Gpt-4 outputs at runtime.
These include sampling temperature, Top-p (Nucleus Sampling), and frequency penalty [28].</p>
          <p>The most critical parameter in controlling the behavior of LLMs during our process is the
sampling temperature, which afects the randomness and the variability of the model’s output.
Lower temperatures exploit more probable solutions, while higher temperatures explore the
solution space more broadly [29]. For our measurement task, preliminary tests indicated that
lower temperatures (e.g., 0.2 to 0.3) make the output more deterministic and focused, leading
to consistent and accurate identification of COSMIC key components. Conversely, higher
temperatures (e.g., 0.7 to 1.0) introduce more variability and creativity, which can sometimes
reduce precision. Therefore, we fixed the sampling temperature at 0.2 to ensure that GPT-4’s
responses adhered closely to the COSMIC methodology’s rules and guidelines. Regarding the
Top-p value and frequency penalty hyperparameters, they are maintained at default settings,
typically at 1 for Top-p to allow for a full spectrum of probable outcomes, and 0 for the frequency
penalty to avoid artificially inhibiting the model’s NL generation capabilities. These settings
help balance the model’s creativity with the need for accuracy and relevance in its outputs.</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Rule-based system for data movements deduction</title>
        <p>Rule-based systems are an important class of expert systems. They are fundamentally composed
of a series of IF-THEN rules, which can serve various domains, including decision support and
predictive decision-making across real-world applications. These rules are generally constructed
through the use of expert knowledge or through learning from real data. These rules can
represent the specialised knowledge of a task area.</p>
        <p>The table below provides a detailed overview of the COSMIC measurement rules implemented
within our rule-based engine. These rules are derived and synthesized from the oficial COSMIC
measurement manual, further enriched by the extensive practical experience I have gained
through applying COSMIC Function Points across a variety of projects during my tenure at
Estimancy. Each rule is precisely defined to ensure the accurate enumeration and classification
of data movements, tailored to specific actions and data interactions. Organized for clarity, the
table delineates the conditions under which each rule applies, the actions involved, the sources
or destinations impacted, the resulting data movement type, and the counting methodology
employed. This integration of established guidelines with empirical insights not only solidifies
the theoretical basis of our rules but also confirms their practical eficacy, providing a robust
framework for accurately quantifying software size and complexity in diverse environments.</p>
        <p>RU1: A triggering entry (TEn) must trigger only one functional process, for the set of identified
functional process inspect the triplets (FP, TEn, FU), IF there is many triplets with the same
TEn THEN merge them into one FP.</p>
        <p>RU2: For each functional process, inspect all quadruplets (action verb, data groups, destination,
source). IF there are repetitions, THEN remove them from the dataset to ensure uniqueness of
data movements.</p>
        <p>RU3: For each quadruplet (action verb, data groups, destination, source), deduce the data
movements and their types using the table 2.</p>
        <p>Rule ref</p>
        <p>Action type
RU3-1
RU3-2
RU3-3
RU3-4
RU3-5</p>
        <p>Input
Retrieve</p>
        <p>Store
Output
Retrieve</p>
        <p>FU
Internal storage</p>
        <p>Software</p>
        <p>Software
External Application</p>
        <p>Internal storage</p>
        <p>Software
Internal storage</p>
        <p>FU
Software</p>
        <p>Entry
Read
Write</p>
        <p>Exit
Exit+Entry</p>
        <p>RU4: For each functional process, verify the presence of an Entry data movement. IF no Entry
data movement is identified, THEN include an additional Entry to the total count of identified
data movements.</p>
      </sec>
      <sec id="sec-5-3">
        <title>4.3. Implementation details</title>
        <p>The architecture of our hybrid framework, depicted in Figure 2, is implemented in Python. The
"Prompt Dispatch System" handles the transformation of user requirements into structured
prompts that are processed by GPT-4. The outputs from GPT-4, which include identified
COSMIC key components are then relayed to the Rule-Based System. This architecture ofers
lfexibility by allowing easy replacement of the LLM, should the need arise, ensuring adaptability
to new developments and improvements in language model technology.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Evaluating the approach through public case studies</title>
      <sec id="sec-6-1">
        <title>5.1. Case studies description</title>
        <p>These Case Studies can be accessed in the Case Study section of the COSMIC website [30]. Case
studies selected for this experiment are examples from the Real-Time and Business Applications.
Functional requirements of software presented in case studies are documented in various format
: free-form text, user stories and UML Use Cases. Descriptions of case studies are presented in
the table 3.</p>
        <p>As our approach focused in NL processing leveraging LLMs, our experiments will be held on
the free-form text (FFT) and user stories (US). In order to be able to compare the performance of
the proposed approach on generating US from FFT, a manual rewriting task has been undertaken
on functional requirements originally documented in FFT to convert them to US format. For
the purpose of comparison and testing, the datasets used in this study can be accessed my Git
repository [31].</p>
      </sec>
      <sec id="sec-6-2">
        <title>5.2. Evaluation Method</title>
        <p>The outputs of our automated system are of two distinct types: textual components such as FUR,
FP, FU, DG and TE and categorical components such as action verb type, source, destination, data
movement type and data movement number. Each type requires a diferent evaluation approach
to accurately measure the performance of our COSMIC Function Point method automation.</p>
        <sec id="sec-6-2-1">
          <title>5.2.1. Evaluation of categorical components</title>
          <p>For components classified into predefined categories by our automated system, we employed
standard classification metrics such as precision, recall, and F1 score to evaluate performance.</p>
        </sec>
        <sec id="sec-6-2-2">
          <title>5.2.2. Evaluation of textual components</title>
          <p>For the textual component, we used a tailored evaluation strategy that integrates precision, recall,
and semantic similarity metrics. This section details the methodology used to calculate True
Positives (TP), False Positives (FP), and False Negatives (FN) by leveraging semantic similarity
provided by GPT-4 embedding. The following steps outline the evaluation process in detail:
• Data Preparation: We defined two sets of components: the reference set (ground truth)
with manually identified COSMIC key components and the output set generated by our
approach.
• Embedding Generation: Using the OpenAI GPT-4 API, we generated high-dimensional
vector embeddings for each component in both sets, enabling cosine similarity
calculations.
• Similarity Calculation: Cosine similarity measured the semantic equivalence between
reference and output components. We created a similarity matrix where each element
represents the cosine similarity score between a reference component and an LLM output
component.
• Threshold Determination: A similarity threshold of 0.8 was set based on preliminary
experiments. Components with a cosine similarity score equal to or higher than this
threshold were considered equivalent.
• Matching and Evaluation: For each reference component, we identified the output
component with the highest similarity score above the threshold. Components were classified
as follows:
– True Positives (TP): Reference components with a corresponding output component
above the threshold.
– False Positives (FP): Output components not matching any reference component.
– False Negatives (FN): Reference components without a corresponding output
component above the threshold.</p>
          <p>This evaluation method ensures a robust assessment of our approach performance, capturing
both exact matches and semantically equivalent variations in the identified COSMIC key
components.</p>
          <p>By using this dual evaluation methodology, we comprehensively assess our approach’s
performance, capturing both exact matches and semantically equivalent variations in the
identified COSMIC key components for textual components and categorical components.</p>
        </sec>
      </sec>
      <sec id="sec-6-3">
        <title>5.3. Results of framework experiments on case studies</title>
        <p>To evaluate the efectiveness of our hybrid framework for COSMIC measurement, we conducted
two separate experiments. The first experiment assessed the performance of the LLM in
identifying COSMIC key components which are textual. The second experiment evaluated the
rule-based system’s accuracy in deducing data movements.</p>
        <sec id="sec-6-3-1">
          <title>5.3.1. COSMIC key component identification Using LLM</title>
          <p>When applying the designed prompt-based strategy on case studies to identify the COSMIC
key components, the results presented in 4 indicates favorable F1-scores for the identification
of key components using LLM in each case study.</p>
          <p>The ’C-REG’ System and RestoSys demonstrate high F1 scores (close to or at 1.00) across
almost all categories, indicating the LLM-based approach is highly efective in accurately
identifying key components for business application type.</p>
          <p>The Rice Cooker case study shows slightly lower F1 scores, particularly for Functional
Processes (0.86), Triggering Events (0.86), Data Groups (0.88), and sub-process components
(0.92). This suggests some challenges in identifying these components in more complex or
nuanced scenarios related to real time software.</p>
          <p>The overall average F1-scores for all components remain high, with values ranging from 0.90
to 1.00 and a global F1-score of 0.95 highlighting the efectiveness of using LLM for COSMIC
key component identification across diferent case studies.</p>
        </sec>
        <sec id="sec-6-3-2">
          <title>5.3.2. COSMIC Data Movements Deduction Using Rule-Based System</title>
          <p>The table 5 presents the F1 scores for the classification of sub processes using rule-based system
in each case study. This table includes the scores for the four types of data movements: Entry,
Read, Write, and Exit, along with an average score for each software system and overall metric
averages across all systems.</p>
          <p>The results demonstrate high accuracy for the C-REG system and RestoSys, with slightly
lower performance for the Rice cooker case study. This discrepancy may be attributed to
the vocabulary and context specific to the real-time domain, which difer from the business
application vocabulary on the basis of which the majority of the rules were built.</p>
          <p>The overall metric average across all systems and data movement types is 0.97, indicating
that the rule-based system is highly efective and reliable in classifying data movements across
diferent case studies. However, further improvements may be required for real time software.</p>
          <p>The resulting total COSMIC Function Point (CFP) sizes calculated using our global framework
for the case studies are presented in Table 6. They exhibit an average error of 6.6%, indicating
that our automated approach yields reasonably accurate estimations of the functional size of
software.</p>
        </sec>
      </sec>
      <sec id="sec-6-4">
        <title>5.4. Conclusion</title>
        <p>Validated across three case studies, our framework demonstrated significant accuracy and
eficiency, substantially reducing the time, efort, human error, and subjective bias associated
with manual measurements. This leads to more reliable and objective estimates of software size.</p>
        <p>Our framework is language-agnostic, leveraging LLMs like GPT-4, which are trained on
extensive multilingual datasets from the internet. This capability allows it to process texts in
multiple languages, such as English, Spanish, French, and more, enhancing its adaptability
for various software development projects regardless of the language used in requirements
specifications.</p>
        <p>Finally, our LLM and rule -based framework ofers high customizability. In fact, in addition
COSMIC Function Point method, it could be adapted to support various FSM methods, including
IFPUG and SiFP (Simple Function Point), as well as approximation methods like E&amp;QFP (Early
and Quick Fuction Point). Indeed, initial experiments have been carried out on some examples
for SiFP measurement and have shown a promising level of accuracy.</p>
      </sec>
      <sec id="sec-6-5">
        <title>5.5. Threats to validity</title>
        <p>In our study, the initial validation of the framework involved experiments across three distinct
case studies. While these provided valuable initial insights, the limited number of case studies
is generally insuficient for robust scientific validation. This limitation is particularly significant
given the broad spectrum of software types and complexities, such as Artificial Intelligence (AI)
software, which exhibit unique behaviors and requirements not covered by a small,
homogeneous set of case studies. Such limitations in scale and complexity raise concerns about the
external validity of our findings, questioning the generalizability of the results across varied
software types and more complex system architectures.</p>
        <p>Our research has primarily focused on measuring the functional size of new software
development projects, inadvertently overlooking the distinct challenges posed by software evolution
or maintenance stages. This limitation represents a significant gap, as these project types
involve unique complexities and lifecycle considerations that could crucially influence both the
applicability and the efectiveness of our proposed framework. Addressing these oversight will
be crucial for enhancing our framework’s relevance and utility across diferent software project
types.</p>
        <p>Additionally, our research has predominantly focused on measuring the functional size of
new development projects, not tackling the distinct challenges posed by software evolution
or maintenance stages. This limitation represents a significant gap, as these project types
involve unique complexities and lifecycle considerations that could crucially influence both the
applicability and the efectiveness of our proposed framework.</p>
        <p>Furthermore, the internal validity of our study faces challenges due to potential
misconfigurations or biases in setting hyperparameters like sampling temperature, which could significantly
distort outcomes. Construct validity is also critical, as the reliance solely on F1-scores and error
rates may not fully capture the efectiveness of our framework. Consistency and clarity in the
operational definitions of measured constructs are essential to ensure accurate reflection of the
framework’s capabilities.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>6. Conclusion and future work</title>
      <p>In summary, our exploration into the integration of Large Language Models (LLMs) like GPT-4
with rule-based systems for automating COSMIC Function Point Measurement has shown
promising results. Our hybrid framework efectively leverages the contextual understanding of
LLMs alongside the precision of rule-based engines, enhancing the robustness and accuracy of
software measurement in diverse settings. Experiments conducted across three case studies
yielded an F1-score of 0.9 in identifying COSMIC key components and 0.97 in deducing data
movements, with a global error rate of 6.6% in COSMIC function point measurements.</p>
      <p>However, the limited number of case studies presents a significant threat to the validity
of our findings, highlighting the need for broader testing across various software types and
complexities to ensure the generalizability and applicability of our framework.</p>
      <p>Future development should focus on enhancing the framework’s performance in real-time
software contexts. This can be achieved by creating more specialized prompts and rules tailored
to the specific vocabulary and operational characteristics of real-time systems, enabling the
framework to deliver more precise and contextually relevant measurements. Further
improvement involves enriching prompts with examples from real-time applications, which would
help LLMs better understand and process the unique requirements and scenarios of such
environments. Insights for these enhancements can be drawn from studies like those by [32, 33],
which ofer valuable perspectives on automating COSMIC measurement for real-time embedded
software.</p>
      <p>Experiments with our framework reveal a slightly lower performance in identifying data
groups, suggesting that the Gpt-4 lacks the necessary knowledge to eficiently extract data
models from requirements. To enhance the LLM’s specialization in this task, fine-tuning may
be necessary. This process aims to improve the model’s performance and accuracy for specific
tasks by training it on a domain-specific dataset [ 22]. Future work will involve developing a
dataset specifically for data model or UML extraction from NL requirements. This dataset will
be used to fine-tune the LLM, enhancing its ability to accurately analyze and generate data
models from requirements.</p>
      <p>For this study, we utilized a non-open source GPT-4 model through the OpenAI API, which
is a cloud-based service. Hence, it is crucial to consider the privacy implications associated
with using this service. To address these concerns, our ongoing research includes exploring and
implementing alternative open-source LLMs that can be deployed on private servers, such as
Llama2 [34] and Llama3 [35].
[cs].
[16] OpenAI, GPT-4 Technical Report, 2023. URL: http://arxiv.org/abs/2303.08774. doi:10.</p>
      <p>48550/arXiv.2303.08774, arXiv:2303.08774 [cs].
[17] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W.</p>
      <p>Chung, C. Sutton, S. Gehrmann, P. Schuh, K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao,
P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran, E. Reif, N. Du, B. Hutchinson, R. Pope,
J. Bradbury, J. Austin, M. Isard, G. Gur-Ari, P. Yin, T. Duke, A. Levskaya, S. Ghemawat,
S. Dev, H. Michalewski, X. Garcia, V. Misra, K. Robinson, L. Fedus, D. Zhou, D. Ippolito,
D. Luan, H. Lim, B. Zoph, A. Spiridonov, R. Sepassi, D. Dohan, S. Agrawal, M. Omernick,
A. M. Dai, T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira, R. Child, O. Polozov, K. Lee,
Z. Zhou, X. Wang, B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei, K. Meier-Hellstern, D. Eck,
J. Dean, S. Petrov, N. Fiedel, PaLM: Scaling Language Modeling with Pathways, 2022. URL:
http://arxiv.org/abs/2204.02311. doi:10.48550/arXiv.2204.02311, arXiv:2204.02311
[cs].
[18] L. Zhao, W. Alhoshan, A. Ferrari, K. Letsholo, M. Ajagbe, R. Batista-Navarro, E.-V. Chioasca,
Natural Language Processing (NLP) for Requirements Engineering: A Systematic Mapping
Study https://arxiv.org/abs/2004.01099 (2020). doi:10.1145/3444689.
[19] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal,
K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder,
P. Christiano, J. Leike, R. Lowe, Training language models to follow instructions with
human feedback, 2022. URL: http://arxiv.org/abs/2203.02155. doi:10.48550/arXiv.2203.
02155, arXiv:2203.02155 [cs].
[20] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding, 2019. URL: http://arxiv.org/abs/1810.04805.
doi:10.48550/arXiv.1810.04805, arXiv:1810.04805 [cs].
[21] J. Wei, Y. Tay, R. Bommasani, C. Rafel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma,
D. Zhou, D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, W. Fedus,
Emergent Abilities of Large Language Models, 2022. URL: http://arxiv.org/abs/2206.07682.
doi:10.48550/arXiv.2206.07682, arXiv:2206.07682 [cs].
[22] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, Q. V. Le, Finetuned
Language Models Are Zero-Shot Learners, 2022. URL: http://arxiv.org/abs/2109.01652.
doi:10.48550/arXiv.2109.01652, arXiv:2109.01652 [cs].
[23] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, D. Zhou,
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, 2023. URL:
http://arxiv.org/abs/2201.11903. doi:10.48550/arXiv.2201.11903, arXiv:2201.11903
[cs].
[24] Y. Shen, K. Song, X. Tan, D. Li, W. Lu, Y. Zhuang, HuggingGPT: Solving AI Tasks with
ChatGPT and its Friends in HuggingFace, 2023. URL: http://arxiv.org/abs/2303.17580.
doi:10.48550/arXiv.2303.17580, arXiv:2303.17580 [cs].
[25] V. Sanh, A. Webson, C. Rafel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Chafin, A. Stiegler,
T. L. Scao, A. Raja, M. Dey, M. S. Bari, C. Xu, U. Thakker, S. S. Sharma, E. Szczechla,
T. Kim, G. Chhablani, N. Nayak, D. Datta, J. Chang, M. T.-J. Jiang, H. Wang, M. Manica,
S. Shen, Z. X. Yong, H. Pandey, R. Bawden, T. Wang, T. Neeraj, J. Rozen, A. Sharma,
A. Santilli, T. Fevry, J. A. Fries, R. Teehan, T. Bers, S. Biderman, L. Gao, T. Wolf, A. M.
Rush, Multitask Prompted Training Enables Zero-Shot Task Generalization, 2022. URL:
http://arxiv.org/abs/2110.08207. doi:10.48550/arXiv.2110.08207, arXiv:2110.08207
[cs].
[26] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong,
Y. Du, C. Yang, Y. Chen, Z. Chen, J. Jiang, R. Ren, Y. Li, X. Tang, Z. Liu, P. Liu, J.-Y. Nie,
J.-R. Wen, A Survey of Large Language Models, 2023. URL: http://arxiv.org/abs/2303.18223.
doi:10.48550/arXiv.2303.18223, arXiv:2303.18223 [cs].
[27] D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuurmans, C. Cui, O. Bousquet,
Q. Le, E. Chi, Least-to-Most Prompting Enables Complex Reasoning in Large Language
Models, 2023. URL: http://arxiv.org/abs/2205.10625. doi:10.48550/arXiv.2205.10625,
arXiv:2205.10625 [cs].
[28] C. Wang, S. X. Liu, A. H. Awadallah, Cost-Efective Hyperparameter Optimization for
Large Language Model Generation Inference, 2023. URL: http://arxiv.org/abs/2303.04673.
doi:10.48550/arXiv.2303.04673, arXiv:2303.04673 [cs].
[29] M. Renze, E. Guven, The Efect of Sampling Temperature on Problem Solving in Large
Language Models, 2024. URL: http://arxiv.org/abs/2402.05201. doi:10.48550/arXiv.2402.
05201, arXiv:2402.05201 [cs].
[30] C. community, Cosmic website, https://cosmic-sizing.org/cosmic-publications/overview/,
2024. Accessed: 2024-07-11.
[31] S. Laqrichi, Cosmic_Case_studies_measured_requirements, 2024. URL: https://github.com/
slaqrichi/COSMIC-case-studies-dataset.
[32] S. Salem, H. Soubra, Functional Size Measurement Automation for IoT Edge Devices,</p>
      <p>Rome, Italy, 2023.
[33] H. Soubra, A. Abran, Functional Size Measurement for the Internet of Things (IoT): An
example using COSMIC and the Arduino open-source platform, 2017. doi:10.1145/
3143434.3143452.
[34] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra,
P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu,
J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini,
R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura,
M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov,
P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten,
R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan,
P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic,
S. Edunov, T. Scialom, Llama 2: Open Foundation and Fine-Tuned Chat Models, 2023. URL:
http://arxiv.org/abs/2307.09288. doi:10.48550/arXiv.2307.09288, arXiv:2307.09288
[cs].
[35] W. Huang, X. Ma, H. Qin, X. Zheng, C. Lv, H. Chen, J. Luo, X. Qi, X. Liu, M. Magno,
How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study, 2024. URL:
http://arxiv.org/abs/2404.14047. doi:10.48550/arXiv.2404.14047, arXiv:2404.14047
[cs].</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A. J.</given-names>
            <surname>Albrecht</surname>
          </string-name>
          ,
          <article-title>Measuring application development productivity</article-title>
          , volume
          <volume>10</volume>
          ,
          <year>1979</year>
          , pp.
          <fpage>83</fpage>
          -
          <lpage>92</lpage>
          . URL: http://www.bfpug.com.br/Artigos/Albrecht/ MeasuringApplicationDevelopmentProductivity.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Soubra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Abufrikha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abran</surname>
          </string-name>
          ,
          <article-title>Towards universal COSMIC size measurement automation, CEUR-WS, Mexico City</article-title>
          , Mexico,
          <year>2020</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2725</volume>
          /paper2.pdf, issue: 2725 Num Pages:
          <volume>15</volume>
          Number:
          <fpage>2725</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>T. B. Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ryder</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Subbiah</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kaplan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dhariwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Neelakantan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Shyam</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Sastry</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Askell</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Herbert-Voss</surname>
            , G. Krueger,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Henighan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Child</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ramesh</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          <string-name>
            <surname>Ziegler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Winter</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Hesse</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            , E. Sigler,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Litwin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Chess</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Berner</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>McCandlish</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Amodei</surname>
          </string-name>
          , Language Models are
          <string-name>
            <surname>Few-Shot Learners</surname>
          </string-name>
          ,
          <year>2020</year>
          . URL: http://arxiv.org/abs/
          <year>2005</year>
          .14165. doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>2005</year>
          .
          <volume>14165</volume>
          , arXiv:
          <year>2005</year>
          .14165 [cs].
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>