<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>O. Hassanzadeh);</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Using Large Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elita Lobo</string-name>
          <email>loboelita@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oktie Hassanzadeh</string-name>
          <email>hassanzadeh@us.ibm.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nhan Pham</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nandana Mihindukulasooriya</string-name>
          <email>nandana@ibm.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dharmashankar Subramanian</string-name>
          <email>dharmash@us.ibm.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Horst Samulowitz</string-name>
          <email>samulowitz@us.ibm.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IBM Research</institution>
          ,
          <addr-line>Yorktown Heights, NY</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Massachusetts Amherst</institution>
          ,
          <addr-line>MA</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1990</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Enterprises often own large collections of structured data in the form of large databases or an enterprise data lake. Such data collections come with limited metadata and strict access policies that could limit access to the data contents and, therefore, limit the application of classic retrieval and analysis solutions. As a result, there is a need for solutions that can efectively utilize the available metadata. In this paper, we study the problem of matching table metadata to a business glossary containing data labels and descriptions. The resulting matching enables the use of an available or curated business glossary for retrieval and analysis without or before requesting access to the data contents. One solution to this problem is to use manually-defined rules or similarity measures on column names and glossary descriptions (or their vector embeddings) to find the closest match. However, such approaches need to be tuned through manual labeling and cannot handle many business glossaries that contain a combination of simple as well as complex and long descriptions. In this work, we leverage the power of large language models (LLMs) to design generic matching methods that do not require manual tuning and can identify complex relations between column names and glossaries. We propose methods that utilize LLMs in two ways: a) by generating additional context for column names that can aid with matching and b) by using LLMs to directly infer if there is a relation between column names and glossary descriptions. Our preliminary experimental results show the efectiveness of our proposed methods.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>Large collections of structured tabular data that businesses possess can be invaluable resources
for various analytic tasks. Traditionally, such data collections are gathered in large databases
or data warehouses, along with mechanisms of collecting and maintaining metadata with
well-curated schemas, data catalogs, or master data as a part of a master data management
solution. In practice, the overhead of maintaining accurate metadata may be prohibitively
dificult and expensive. More recently, enterprises are moving toward collecting all their data
nEvelop-O</p>
      <p>
        CEUR
in data lakes without any requirements or strict enforcement of metadata availability or quality.
As a result, there is a need for solutions that can efectively use limited metadata, such as
column headers, and automatically generate useful metadata. Most organizations maintain
some business glossary [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] with a set of concepts that are relevant to the business processes. If
table columns can be annotated with business glossary terms, it helps downstream tasks such
as data discovery, data integration, or performing advanced analytics.
      </p>
      <p>
        The task of mapping table columns to a business glossary is similar to the task of annotating
a table column with an ontology concept, which is referred to as the Column Type Annotation
(CTA) task [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. However, to our knowledge, prior work has not considered further restricting
the task to using only the table metadata (table name and column headers) and business glossary
containing labels and descriptions only. The problem we study in this paper is inspired by our
ongoing work on implementing an automated semantic layer for enterprise data lakes [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ],
and has the following characteristics: 1) we do not have access to data contents due to access
restrictions common in enterprise data lakes; 2) we have tabular data with no metadata other
than column headers, which is a result of large data imports from highly heterogeneous sources
or automated table extraction pipelines; and 3) there is no or very little training data, as the
process of manually labeling table columns with business glossary terms is laborious and
requires domain expertise. Figure 1 shows a few example column headers along with their
context (other column headers in the same table) and their associated business glossary terms.
      </p>
      <p>
        In the absence of rich metadata and an ontology, the matching process can only rely on the
header labels and glossary labels and descriptions. Essentially, the problem becomes a string
or text similarity matching problem. Prior work has studied various flavors of such matching
methods for record matching in databases [
        <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
        ] as well as ontology alignment [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Such methods
rely on either syntactic matching methods, which rely on common tokens and substrings between
the terms that should be matched, or semantic matching methods, which rely on the availability
of a dictionary of terms along with lists of related terms such as synonyms, hypernyms, and
hyponyms. In our setup, we often need to match terms with very little syntactic similarity, and
we do not have access to a dictionary that could enable semantic matching. Column headers in
tabular data are often cryptic terms, and business glossaries use terminology very specific to
a particular enterprise. More recent work has proposed the use of vector representations of
terms in the form of embeddings [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]; however, such methods require domain-specific training
data and tuning.
      </p>
      <p>In this paper, we propose a novel matching solution that relies on the power of Large Language
Models (LLMs) to enable the matching of table columns with glossaries when column headers
are not very descriptive, and glossary terms do not have a close syntactic similarity to the
column headers, and little or no training data is available. In what follows, we first discuss
related work. We then present the problem description, and then we present the details of our
solution. In section 5, we present the results of our experiments using real-world enterprise
data and business glossaries. We end the paper by outlining several lessons learned and avenues
for future work.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Related Work</title>
      <p>
        A core problem in semantic table understanding [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is column type annotation, i.e., annotating
table columns with a type from an ontology, which enables many business intelligence tasks
such as semantic retrieval, data exploration, and knowledge discovery. SemTab challenge [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
which aims at benchmarking systems dealing with the tabular data to KG matching problem,
provides several datasets in which the column type annotation task can be evaluated. In the
SemTab challenge, this task is formulated as an unsupervised task where participating systems
are not given training data.
      </p>
      <p>
        Column type annotations using table data MTab [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], JenTab [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], and DAGOBAH [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ],
are examples of the systems that participated in the SemTab challenge comprising three KG
matching tasks, namely, cell to KG entity (CEA task), column to KG class (CTA task), and
column pair to KG property (CPA task). As these systems typically solve the three tasks in a
joint manner, they follow a pipeline architecture. The first step links cell mentions to entities
within the target ontology. The second step predicts the most likely type for the query column
based on the linking results. MTab and DAGOBAH also use additional information from the
graph, such as entity relations, to improve cell linking accuracy. It is a requirement for these
systems to have cell values that can be mapped to KG entities, which might not be the case in
most industry tables.
      </p>
      <p>
        Column type annotations with only using metadata The problem setup that is studied
in this paper difers from the traditional column type annotation task. In our setup, the system
that is performing the matching between the table column and glossary concepts will only
have access to the table metadata (i.e., table name and column headers) but not the actual
table data (i.e., cell values). This problem setup has similarities with the ontology matching
methods that rely purely on string similarity measures. String similarity measures have been
studied extensively for various matching tasks, including in ontology alignment [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Syntactic
measures of similarity measure how close two strings are based on measuring the overlap
between tokens or substrings in two strings or measures based on the number of character edit
operations that can transform one string to another. Examples of such methods are edit similarity
and Jaro-Winkler [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. While such approaches have shown very promising performance in
various matching tasks, they are inherently not capable of diferentiating between strings that
are syntactically very similar but semantically dissimilar. Classic semantic measures rely on
resources such as WordNet [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] containing related terms. The application of those methods
is limited to when such resources are available. More recently, methods that rely on vector
representations for semantic similarity have shown superior performance in various tasks.
Initial approaches relied on word2vec [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], which can handle semantic matching between words
and short phrases. More recently, word embeddings [
        <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
        ] and sentence embeddings [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
have shown promising performance in semantic textual similarity tasks. As we will show in
our experiments, business glossaries often have very similar labels and descriptions that these
sentence transformer-based approaches alone cannot efectively diferentiate between.
      </p>
    </sec>
    <sec id="sec-4">
      <title>3. Preliminaries</title>
      <sec id="sec-4-1">
        <title>3.1. Problem Setup</title>
        <p>We assume a setting in which we have only superficial tabular metadata corresponding to a
access to a business glossary that consists of  glossary items  = {(  ,  
)}=1 . Here,   and
  ∀ ∈ [1, ]</p>
        <p>represent respectively the label and description of the  ℎ glossary item. For example,
the glossary could be a list of tuples containing labels and descriptions of various business
concepts. Given such superficial metadata  , the task is to find its closest glossary item match.</p>
        <p>In this paper, we consider a relaxed version of the glossary matching problem, where the task
is to select  glossary items for any given metadata  such that it maximizes the probability of
takes the value one if the selected  glossary items contain the closest match of the metadata
and zero otherwise. Finally, we also assume that we have a human feedback bank available
in the form of  tuples ℋ</p>
        <p>= {(  ,   )=1 } where   represents some metadata and   ∈ 
represents the correct glossary match. We will use the human feedback bank ℋ to construct
task demonstrations for the In-Context Learning approach described in the next section.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Large Language Models</title>
        <p>
          Recent work [
          <xref ref-type="bibr" rid="ref18 ref19 ref20 ref21 ref22">18, 19, 20, 21, 22</xref>
          ] has demonstrated that Large Language Models (LLMs) perform
extraordinarily well on instruction-based tasks as long as these tasks can be represented in
natural language. LLMs are transformer models with billions of parameters, trained on large data
corpora and fine-tuned on instructions-based tasks, including classification tasks, generation
tasks, and question-answering tasks. An LLM takes as input a prompt containing the description
of the task along with additional context represented using natural language and outputs the
results of the task in natural language. In this work, we leverage LLMs, specifically,
Flan-t5models [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] to obtain more accurate metadata for business glossary matching. Since LLMs have
been trained on large data corpora, they can identify complex relations and patterns between
diferent objects in natural language and, thereby, can be used to obtain more accurate matching.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. In-Context Learning</title>
        <p>
          Fine-tuning LLMs for new tasks or datasets is often computationally expensive and requires large
amounts of data, which is often not feasible. A common approach to evade this is by appending
question: column name: customer numeric. , other column names: customer id , full name , customer status , customer status product , customer status reason , description , basic data
incomplete , residence area id. , column description:
answer: specifies the unique identification of the customer. [SEP]
question: column name: communication status. , other column names: cost amount , employee id , customer id , communication id , planned end date , planned start date , actual start
date , actual end date , communications version anchor. , column description:
answer: a term that distinguishes between communications according to the status of the lifecycle; for example, a communication may be awaiting further information, closed or open.
[SEP]
question: column name: customer status. , other column names: industry code , customer tenure range , customer active tenure range , staff turnover range , legal form , staffing
structure , purpose , franchised , customer status tenure range , organization customer profile id , customer market segment , location country , introduction. , column description:
answer:
one (one-shot) or multiple (multi-shot) demonstrations of the task in natural language to the
prompt. This is commonly known as the One-shot or Multi-shot In-Context Learning (ICL) or
In-Context Prompting method [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]. Figure 2 shows an example of multi-shot in-context learning
on a classification task. In-context learning is known to have worked well for several new
problems in the prior literature. This work uses the human feedback bank to generate relevant
demonstrations for in-context learning. We conjecture that ICL, with good demonstrations, can
improve the performance of the glossary matching task without additional fine-tuning.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Methodology</title>
      <p>We propose two diferent classes of methods a) Metadata Description to Glossary Matching
(MDGM) and b) Direct Metadata to Glossary Matching (DMGM) that retrieve a set of  glossary
items for any given metadata, i.e., it contains the glossary match with high probability. In
MDGM methods, we use the Large Language Models to obtain a metadata description and use
the description to retrieve  glossary items that are most similar to the given description in
some latent space. On the other hand, in the DMGM methods, we use LLMs to directly match
metadata to business glossaries. More specifically, we treat the metadata to glossary matching
problem as either a Boolean or a Multi-class classification problem. We design prompts to LLMs
to directly infer which glossary items are potential descriptions of the metadata and choose
the top- glossary items most likely to be the description of the given metadata. Although
MDGM methods seem like an indirect approach to glossary matching, they can be useful when
the glossary constantly changes during test time, and direct inference over large glossaries is
expensive. We show in section 5.2 that MDGM methods tend to outperform DMGM methods.</p>
      <sec id="sec-5-1">
        <title>4.1. Column Description to Glossary Matching</title>
        <p>We now describe various techniques we propose for generating descriptions of metadata for
the metadata to business glossary matching problem.</p>
        <sec id="sec-5-1-1">
          <title>4.1.1. Metadata Description Generation via Multi-Shot In-Context Learning (MDG-MICL)</title>
          <p>Since LLMs are trained on large data corpora, they can generate good descriptions of any
concept they may have seen during the training period. In this method, we leverage this
knowledge of LLM to generate descriptions of the given metadata. We construct a special
prompt instructing the LLM to generate a metadata description. Further, we use ICL to improve
We have a database table with columns: industry code, customer tenure range, customer active tenure range, staff turnover range, legal form, staffing structure, purpose, franchised,
customer status tenure range, customer status, organization customer profile id, customer market segment, customer profitability segment, revenue range, profit after tax range, amount
owed range, age range, preferred payment method, debit credit status, spoken language, location country, introduction.</p>
          <p>
            Answer this question, making sure that the answer is supposed by the text: Is "A term that distinguishes between Involved Party / Location Relationships according to the specific
importance of the address in relation to conducting business with Involved Parties." a description of "importance level"?
yes,no
the quality and control the format of the description generated by the LLM. Figure 2 shows an
example of the ICL prompt used for this task with tuned Flan-T5-XL and Flan-T5-XXL models.
To construct demonstrations for ICL, we proceed as follows. Using the Sentence-BERT (SBERT)
sentence embeddings [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ], we first generate sentence embeddings of the metadata and all the
descriptions corresponding to the glossary items in the human-feedback bank, ℋ. We use the
cosine similarity metric to find  glossary item descriptions from ℋ closest to the metadata in
the SBERT sentence embedding space. We construct demonstrations from these  glossary items
in ℋ, and append them to the prompt, in response to which the LLM generates a description.
We obtain the final description by appending the table name and metadata to the LLM-generated
description. We embed this description using SBERT and obtain the top- glossary items by
computing the cosine similarity metric between this embedding and the sentence embeddings
of all the glossary item descriptions. The procedure of computing top- glossary items from
the glossary set  for a given metadata description using the SBERT sentence embeddings and
cosine similarity metric is used in several subsequent methods; for the sake of brevity, we will
here on refer to this procedure as the SBERT − nearest neighbors method.
          </p>
        </sec>
        <sec id="sec-5-1-2">
          <title>4.1.2. Metadata Description Generation via Classification (MDG-Cl)</title>
          <p>As discussed in section 3, LLMs can also identify complex relations between various concepts
in natural language. Therefore, in this method, we leverage LLMs to directly select the best
description from the set of glossary descriptions in  using a classification-based technique.</p>
          <p>Specifically, for a given metadata and glossary set  , we construct a binary classification
prompt against each glossary item in the set  , that queries the LLM on whether the given
glossary item is a potential description of the metadata. Figure 2 shows an example of the
classification prompt used for this task. Note that when the glossary set  is too large, it may
result in high inference costs. We can mitigate these high costs by first shortlisting the top
 1, ( 1 &gt; ) glossary item from the glossary set  using the SBERT − nearest neighbors method
before proceeding with the classification prompts.</p>
          <p>The final description of the metadata corresponds to the glossary description for which the
classification response was positive with the highest log probability score of the classification
task, appended to the table name and the metadata itself. If no such glossary item exists, we
use the metadata itself as the description. Once the metadata description is generated, we select
the top- glossary items from the glossary set  using SBERT − nearest neighbors described in
the MDG-MICL method.
The main column name in the table is "customer status" and the other column names are industry code, customer tenure range, customer active tenure range, residence area id.
Based on the above text, what is the best description of "customer status" from the following choices?
1. Customer Performance Status : A term that distinguishes between Customers according to the degree to which the Customer is judged to be performing or non-performing. The
formula for deriving this judgment will be client-specific, but factors which could be taken into account include: the Customer Credit Risk Rating, the Customer Non-Performing Loan
Status, the balance of the Customer's non-performing loans compared to the Customer Funds Under Management, the various risk-related classifications of the Customer's Finance
Service Arrangements, etc.
2. Customer Relationship Life Cycle Status : A term that distinguishes between Customers according to the specific life cycle state in which the Customer relationship exists.
3. Customer Relationship Review : Identifies a Status Review for the purposes of meeting with the Customer to enhance the customer relationship.
4. Status Review : Identifies a Business Activity in which the status of an item is reviewed to determine if it is still valid; for example; confirmation of bankruptcy status of an Individual.
5. None of the above</p>
        </sec>
        <sec id="sec-5-1-3">
          <title>4.1.3. Metadata Description Generation via Multiple Choice Question Answering (MDG-MCQA)</title>
          <p>An alternative way of generating descriptions is by using a Multiple Choice Question Answer
(MCQA) prompt (see Figure 4) that instructs the LLM to choose the best description of the
metadata amongst the descriptions of selected glossary items. Although this may seem
counterintuitive, we observe in our experiments that using the description of the selected glossary item
to find the top-  glossary items instead of simply returning the glossary item corresponding to
the selected description results in a higher Hit@5. Similar to previous methods, we shortlist the
top  1( 1 &gt; ) glossary items that are closest to the metadata in sentence-embedding space and
use them as choices in our Multiple Choice Question Answer prompt. We also add “None of
the above” to the list of choices in the MCQA prompt. Finally, we append the metadata and
the table name to the description of the LLM-selected glossary item and use it to find the top- 
glossary items using the SBERT − nearest neighbors method. Note that when the LLM selects
the “None of the above” option, we use the metadata itself as the description.</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>4.2. Direct Metadata to Glossary Matching</title>
        <sec id="sec-5-2-1">
          <title>4.2.1. Direct Inference via Classification (DI-Cl)</title>
          <p>This method is a variant of the MDG-Cl method that uses LLM to directly select the top-
glossary items for any given metadata without generating a description of the metadata. Similar
to previous methods, we shortlist the top  1 ( 1 &gt; ) glossary items closest to the metadata in the
sentence-embedding space. For each of these glossary items, we construct binary classification
prompts that query the LLM on whether the description of the glossary item matches the
metadata. An example prompt is shown in Figure 5. Among all glossary items with positive
responses, the top- glossary items with the highest log probability scores are selected. It is
important to note that this method may return less than  glossary items. Such missing items
are assumed to be incorrect matches while computing the Hit@k metric.</p>
        </sec>
        <sec id="sec-5-2-2">
          <title>4.2.2. Direct Inference via Multiple Choice Question Answering (DI-MCQA)</title>
          <p>We consider another variation of the MDG-MCQA method that computes a single best match
of the given metadata without generating a description of the metadata. In this method, we
follow the same procedure as MDG-MCQA to construct Multi-Choice Classification prompts,
Is "The text value of the Alternative Identifier of the Communication." a description of the column name: "importance level"?
yes,no
Answer: no
Question: Is "The text value of the Alternative Identifier of the Communication." a description of the column name: "communication reference numeric"?
yes,no
Answer: yes
Question: Is "Identifies a Hierarchy Level that is on the bottom of the hierarchy." a description of the column name: "importance level"?
yes,no</p>
          <p>Answer:</p>
          <p>We have a database table with columns: industry code, customer tenure range, customer active tenure range, staff turnover range, legal form, staffing structure, purpose, franchised,
customer status tenure range.</p>
          <p>What is the correct description of "importance level"?
Choose the best answer to the above question from the following choices:
1. Customer Importance Rating : Identifies a Rating Scale that represents the degree to which a customer should be given special treatment.
2. Priority Rating : Identifies a Rating Scale that is used to represent the priority or importance of one object over another one; for example, the bank's vaults are more important that
company chairs (RI priority), serving a VIP customer is more important than a daily status meeting, etc.
3. Involved Party Hierarchy Level : Indicates the level of the Involved Party in the Hierarchy.
4. Education Level Criterion : Identifies an Individual Criterion according to the amount of formal training attained by the individual.</p>
          <p>5. None of the above
as shown in Figure 6, and return the glossary item corresponding to the description selected by
the LLM. Since this method always returns a single glossary item, we will assume that the  − 1
missing glossary items are incorrect matches while computing the Hit@k metric.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Experiments</title>
      <p>In this section, we empirically evaluate and compare the performance of all the MDGM and
DMGM methods proposed in section 4. Specifically, we investigate the following two questions
a) Which of the proposed methods: MDG-MICL (0-shot, 1-shot, 2-shots), MDG-Cl, MDG-MCQA,
DI-Cl, and DI-MCQA, best leverage LLM in solving the metadata to glossary matching task,
and b) is it possible to obtain more accurate matching, i.e., higher Hit@5 and Hit@1 using
LLMs than using basic similarity-score based matching methods? Our preliminary results show
that LLMs are indeed efective in improving glossary matching accuracy.</p>
      <sec id="sec-6-1">
        <title>5.1. Experimental setup</title>
        <p>
          In all our experiments, the metadata consists of the column name of interest and the other
column names in the table. The glossary is a list of tuples where each tuple consists of a label
and a description of the label. Multiple column names may be matched to the same label.
Furthermore, the label of the matched glossary item for each column name may not coincide
with the column name. We evaluate our methods on the Flan-T5-XL, Flan-T5-XXL [
          <xref ref-type="bibr" rid="ref23 ref25">25, 23</xref>
          ], and
a Flan-T5-XL model that we fine-tune on the training dataset using the supervised fine-tuning
method for LLMs [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]. For each method and LLM model, we experiment with 4-6 diferent
prompt templates. However, due to lack of space, we only provide examples of the prompt
templates that achieved the highest hit@5 rate (2,3,4,5,6).
        </p>
        <p>
          We use the all-mpnet-base-v2 SBERT model from the sentence-transformers library[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]
in all experiments. We evaluate each method based on their Hit@5 and Hit@1, i.e., the
empirical mean of Hit@5 and Hit@1 computed on the test dataset. These measures were
chosen to reflect our goal of having the correct glossary item as the top or within the top
5 glossary items returned to the user on a GUI [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. We compare these scores against ones
produced by a baseline, which computes the top- glossary items based on the cosine similarity
of the sentence embedding between the column name and the descriptions in the glossary.
MDE Dataset The MDE Dataset is an IBM-internal benchmark developed by annotating the
column names of the “Customer Insight” example DB of “IBM InfoSphere Warehouse Pack” [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]
with the glossary terms from the IBM Knowledge to fine-tune Financial Services (IBM KAFS) [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ]
glossary. The db consists of 26 tables with 688 columns. The column names contain cryptic
codes and abbreviations to reflect realistic tables commonly seen in client engagements. IBM
KAFS contains 9,137 business terms with their labels and descriptions. Out of 688 columns,
488 have suitable matching terms in the glossary, and the rest are annotated as null mappings
and ignored in the evaluation. We split the mappings into train, test, and demonstration
splits with 208, 212, and 68 columns, respectively. These splits contain tuples of the form
( _, ℎ _ _,    _) . The training, test, and demonstration splits
are used to fine-tune the LLM model, evaluate the method, and as a proxy for human feedback.
        </p>
      </sec>
      <sec id="sec-6-2">
        <title>5.2. Experimental Results</title>
        <p>
          Table 1 shows the Hit@5 and Hit@1 rates achieved by diferent methods on the MDE dataset
with diferent LLM models. As expected, the Hit@5 rate increases with the number of
demonstrations in the MDG-MICL method. This result suggests that it may be beneficial to use
in-context learning whenever demonstrations are readily available. However, we do not observe
a similar trend for Hit@1 rate. We believe this is due to the dificulty of matching metadata
to a single glossary item when multiple glossary items have similar descriptions. Overall,
MDG-MICL achieves the highest Hit@5 and Hit@1 scores and significantly outperforms the
baseline method when two demonstrations are provided in the prompt. Meanwhile, we observe
that the DI-Cl and DI-MCQA methods achieve the worst Hit@5 rates. We conjecture that this
may be due to the underlying biases of LLMs towards certain class labels in classification and
question-answering tasks as observed in prior works [
          <xref ref-type="bibr" rid="ref29 ref30">29, 30</xref>
          ]. These biases can be corrected
using various calibration techniques [
          <xref ref-type="bibr" rid="ref29 ref30">29, 30</xref>
          ], which we leave for future work. It is also important
to note that the DI-MCQA method selects a single best glossary match and, thus, more likely
fails to select the correct item when multiple glossary items have similar descriptions. These
preliminary results indicate that LLMs alone may not improve the matching accuracy. Finally,
although MDG-Cl and MDG-MCQA use the same classification prompt and Multiple Choice
Question Answer prompts as DI-Cl and DI-MCQA, they achieve higher Hit@5 and Hit@1
than the latter methods. We believe that this is because these methods select glossary items
whose descriptions are similar to the closest glossary match and, thus, tend to perform better.
Methods
Baseline Method
MDG-MICL (0-shot)
MDG-MICL (1-shot)
MDG-MICL (2-shots)
MDG-Cl
MDG-MCQA
DI-Cl
DI-MCQA
        </p>
        <p>Flan-T5-XL
Hit@1 Hit@5</p>
        <p>Flan-T5-XXL
Hit@1 Hit@5</p>
        <p>Tuned Flan-T5-XL</p>
        <p>Hit@1 Hit@5</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>6. Discussion and Future work</title>
      <p>
        This paper proposes two diferent classes of methods, i.e., MDGM and DMGM, that leverage
LLMs for solving the metadata to glossary matching problem. MDGM methods use LLMs to
generate good metadata descriptions, which we couple with similarity-based metrics for more
refined matches. This class of methods is necessary when the glossary is likely to change
during test time frequently, and repeated inferences are expensive. The second class of methods
(DMGM) utilizes LLMs to infer which glossary items are potential matches directly. These
methods are helpful when the metadata is too complex and there is a significant diference
between the description of the metadata and that of its closest glossary match. Although we have
shown that many of these methods can potentially obtain more accurate glossary matching, we
can further improve them in several ways. Our experiments show that DMGM methods perform
poorly compared to MDGM methods. This may be due to the undesirable biases towards specific
class labels that LLMs learned during training. One approach to mitigating these biases is using
various calibration techniques [
        <xref ref-type="bibr" rid="ref29 ref30">29, 30</xref>
        ]. Providing several positive and negative demonstrations
in the classification and multiple-choice question-answer prompts may also help mitigate LLMs’
default biases [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. We can further improve the MDGM methods that generate descriptions of
metadata by constraining LLMs to sample words mainly from the glossary or providing LLMs
with the top- glossary items and prompting LLMs to generate descriptions similar to those of
the glossary items. This can be achieved by using the Constrained Beam Search algorithm [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ]
or simply manipulating the output distribution of LLMs before sampling such that it assigns
higher weights to words from the glossary. It may also be helpful to use various prompt-tuning
and prompt-editing methods [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] to further improve the eficiency of the prompts used with
LLMs. Although these directions remain intriguing, they warrant more in-depth empirical
study, which we leave for future work.
1–12
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Sarkar</surname>
          </string-name>
          , Business Glossary for DaaS,
          <year>2015</year>
          , pp.
          <fpage>103</fpage>
          -
          <lpage>122</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Jiménez-Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Hassanzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Efthymiou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Srinivas</surname>
          </string-name>
          ,
          <year>Semtab 2019</year>
          :
          <article-title>Resources to benchmark tabular data to knowledge graph matching systems</article-title>
          ,
          <source>in: European Semantic Web Conference</source>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>514</fpage>
          -
          <lpage>530</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Mihindukulasooriya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dash</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bagchi</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. F. M. Chowdhury</surname>
            ,
            <given-names>A. M.</given-names>
          </string-name>
          <string-name>
            <surname>Gliozzo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Farkash</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Glass</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Gokhman</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Hassanzadeh</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Pham</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Rossiello</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Rozenberg</surname>
            ,
            <given-names>D. S.</given-names>
          </string-name>
          <string-name>
            <surname>Yehoshua Sagron</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Takahashi</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Tateishi</surname>
          </string-name>
          , L. Vu,
          <article-title>Unleashing the Potential of Data Lakes with Semantic Enrichment Using Foundation Models</article-title>
          ,
          <source>in: Proceedings of the ISWC 2023 Posters, Demos and Industry Tracks</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D. K. I.</given-names>
            <surname>Weidele</surname>
          </string-name>
          , G. Rossiello,
          <string-name>
            <given-names>G.</given-names>
            <surname>Bramble</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Valente</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bagchi</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. F. M. Chowdhury</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Martino</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Mihindukulasooriya</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Ananthakrishnan</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Strobelt</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Patel</surname>
            ,
            <given-names>A. M.</given-names>
          </string-name>
          <string-name>
            <surname>Gliozzo</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Cornec</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Mehta</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Kesarwani</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Samulowitz</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Hassanzadeh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Agarwal</surname>
          </string-name>
          , L. Amini,
          <source>Conversational GUI for Semantic Automation Layer</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chandel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Hassanzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Koudas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sadoghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          ,
          <article-title>Benchmarking declarative approximate selection predicates</article-title>
          ,
          <source>in: SIGMOD</source>
          ,
          <year>2007</year>
          , pp.
          <fpage>353</fpage>
          -
          <lpage>364</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>O.</given-names>
            <surname>Hassanzadeh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sadoghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>Accuracy of approximate string joins using grams</article-title>
          ,
          <source>in: QDB</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cheatham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hitzler</surname>
          </string-name>
          ,
          <article-title>String similarity metrics for ontology alignment</article-title>
          ,
          <source>in: The Semantic Web - ISWC 2013 - 12th International Semantic Web Conference</source>
          , Sydney,
          <string-name>
            <surname>NSW</surname>
          </string-name>
          , Australia,
          <source>October 21-25</source>
          ,
          <year>2013</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          , volume
          <volume>8219</volume>
          , Springer,
          <year>2013</year>
          , pp.
          <fpage>294</fpage>
          -
          <lpage>309</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Sentence-bert: Sentence embeddings using siamese bert-networks</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pujara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Szekely</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>From tables to knowledge: Recent advances in table understanding</article-title>
          ,
          <source>in: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery &amp; Data Mining</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>4060</fpage>
          -
          <lpage>4061</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          , I. Yamada,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kertkeidkachorn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ichise</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Takeda</surname>
          </string-name>
          ,
          <year>Semtab 2021</year>
          :
          <article-title>Tabular data annotation with mtab tool</article-title>
          , in: SemTab at (ISWC
          <year>2021</year>
          ),
          <year>2021</year>
          , pp.
          <fpage>92</fpage>
          -
          <lpage>101</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>N.</given-names>
            <surname>Abdelmageed</surname>
          </string-name>
          , S. Schindler,
          <article-title>JenTab meets semtab 2021's new challenges</article-title>
          ., in: SemTab@ ISWC,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>V.-P.</given-names>
            <surname>Huynh</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chabot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Deuzé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Labbé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Monnin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Troncy</surname>
          </string-name>
          ,
          <article-title>DAGOBAH: Table and graph contexts for eficient semantic annotation of tabular data</article-title>
          , in: SemTab@ ISWC,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>W. E.</given-names>
            <surname>Winkler</surname>
          </string-name>
          ,
          <article-title>String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage</article-title>
          ,
          <source>in: Proceedings of the Section on Survey Research</source>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>F.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Sandkuhl</surname>
          </string-name>
          ,
          <article-title>A survey of exploiting wordnet in ontology matching</article-title>
          ,
          <source>in: Artificial Intelligence in Theory and Practice II</source>
          ,
          <string-name>
            <surname>Springer</surname>
            <given-names>US</given-names>
          </string-name>
          ,
          <year>2008</year>
          , pp.
          <fpage>341</fpage>
          -
          <lpage>350</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Matching ontologies with word2vec model based on cosine similarity</article-title>
          ,
          <source>in: Proceedings of the International Conference on Artificial Intelligence and Computer Vision (AICV2021)</source>
          , Springer International Publishing,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>L. L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bhagavatula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Neumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wilhelm</surname>
          </string-name>
          , W. Ammar,
          <article-title>Ontology alignment in the biomedical domain using entity definitions and context</article-title>
          ,
          <source>in: Proceedings of the BioNLP</source>
          <year>2018</year>
          workshop, Association for Computational Linguistics, Melbourne, Australia,
          <year>2018</year>
          , pp.
          <fpage>47</fpage>
          -
          <lpage>55</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Jiménez-Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Horrocks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Antonyrajah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hadian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Augmenting ontology alignment by semantic embedding and distant supervision</article-title>
          ,
          <source>in: The Semantic Web</source>
          , Springer International Publishing, Cham,
          <year>2021</year>
          , pp.
          <fpage>392</fpage>
          -
          <lpage>408</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Wainwright</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Agarwal,
          <string-name>
            <given-names>K.</given-names>
            <surname>Slama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hilton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Kelton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. E.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Simens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Welinder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Christiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leike</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Lowe</surname>
          </string-name>
          ,
          <article-title>Training language models to follow instructions with human feedback</article-title>
          ,
          <source>ArXiv</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Guu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. W.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Finetuned language models are zero-shot learners</article-title>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. I.</given-names>
            <surname>Muresanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Paster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pitis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          ,
          <article-title>Large language models are human-level prompt engineers</article-title>
          ,
          <source>in: ICLR</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Jang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Seo</surname>
          </string-name>
          ,
          <article-title>Can large language models truly follow your instructions?</article-title>
          ,
          <source>in: NeurIPS ML Safety Workshop</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>W. X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          , P. Liu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Nie</surname>
          </string-name>
          , J. rong
          <string-name>
            <surname>Wen</surname>
          </string-name>
          ,
          <article-title>A survey of large language models</article-title>
          ,
          <source>ArXiv abs/2303</source>
          .18223 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>S.</given-names>
            <surname>Longpre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Vu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Webson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          , et al.,
          <article-title>The flan collection: Designing data and methods for efective instruction tuning</article-title>
          ,
          <source>arXiv preprint arXiv:2301.13688</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>S.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lyu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Holtzman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Artetxe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          , L. Zettlemoyer,
          <article-title>Rethinking the role of demonstrations: What makes in-context learning work?</article-title>
          ,
          <source>in: EMNLP</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Longpre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fedus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brahma</surname>
          </string-name>
          , et al.,
          <article-title>Scaling instruction-finetuned language models</article-title>
          ,
          <source>arXiv preprint arXiv:2210.11416</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wainwright</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Agarwal,
          <string-name>
            <given-names>K.</given-names>
            <surname>Slama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hilton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Kelton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Simens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Welinder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Christiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leike</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lowe</surname>
          </string-name>
          ,
          <article-title>Training language models to follow instructions with human feedback</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>I. U. S.</given-names>
            <surname>Announcement</surname>
          </string-name>
          ,
          <source>Ibm infosphere warehouse pack for customer insight v8.2</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <article-title>IBM knowledge accelerator for financial services</article-title>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>B.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lyu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>She</surname>
          </string-name>
          , Y. Zhang, $
          <article-title>k$NN prompting: Beyond-context learning with calibration-free nearest neighbor inference</article-title>
          ,
          <source>in: ICLR</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Preston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Poon</surname>
          </string-name>
          ,
          <article-title>Llm calibration and automatic hallucination detection via pareto optimal self-supervision</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2306</volume>
          .
          <fpage>16564</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>C.</given-names>
            <surname>Hokamp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Lexically constrained decoding for sequence generation using grid beam search</article-title>
          , in: ACL,
          <year>2017</year>
          , pp.
          <fpage>1535</fpage>
          -
          <lpage>1546</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          , TEMPERA:
          <article-title>Test-time prompt editing via reinforcement learning</article-title>
          ,
          <source>in: The Eleventh International Conference on Learning Representations</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>