<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Generating Data Augmentation Queries Using Large Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Christopher Buss</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jasmin Mosavi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mikhail Tokarev</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arash Termehchy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Maier</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefan Lee</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Oregon State University</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Portland State University</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Users often want to augment entities in their datasets with relevant information from external data sources. As many external sources are accessible only via keyword-search interfaces, a user usually has to manually formulate a keyword query that extracts relevant information for each entity. This is challenging as many data sources contain numerous tuples, only a small fraction of which may be relevant. Moreover, diferent datasets may represent the same information in distinct forms and under diferent terms. In such cases, it is dificult to formulate a query that precisely retrieves information relevant to a specific entity. Current methods for information enrichment mainly rely on resource-intensive manual efort to formulate queries to discover relevant information. However, it is often important for users to get initial answers quickly and without substantial investment in resources (such as human attention). Thus, as an alternative to manually writing mappings from entities to queries, one can learn these mappings progressively by leveraging end users' feedback. We evaluate the use of parameter eficient techniques for leveraging a pretrained large language model (LLM) for this task of online query policy learning.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Information Integration</kwd>
        <kwd>Pre-trained Large Language Models</kwd>
        <kwd>Online Learning</kwd>
        <kwd>Query Learning</kwd>
        <kwd>Machine Learning</kwd>
        <kwd>AI</kwd>
        <kwd>and Databases</kwd>
        <kwd>Applied ML and AI for data management</kwd>
        <kwd>Heterogeneous and federated DBMS</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. Introduction
Joint Workshops at 49th International Conference on Very Large Data
Bases (VLDBW’23) — Workshop on LLMs and Databases (LLMDB’23),
August 28 - September 1, 2023, Vancouver, Canada
$ bussch@oregonstate.edu (C. Buss); mousavij@oregonstate.edu
(J. Mosavi); tokarevm@oregonstate.edu (M. Tokarev);
termehca@oregonstate.edu (A. Termehchy); maier@pdx.edu
(D. Maier); leestef@oregonstate.edu (S. Lee)</p>
      <p>© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License their local dataset, they will need to repeat the process.
CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) Moreover, if they need information from multiple
exterMediator generates keyword query</p>
    </sec>
    <sec id="sec-2">
      <title>Featurization Φ</title>
    </sec>
    <sec id="sec-3">
      <title>Query Policy</title>
      <p>“serotonin
depression
panic”
Example Local Datasource: FDA-Approved Uses
Brand Drug Class Approved Use
boosts serotonin,
reduces stress</p>
      <p>User</p>
      <p>Query Paroxetine !$#%&amp; IBS</p>
      <p>Example External Datasource: Off-Label Uses
Generic M. Formula Off-Label Use How Works
Sertraline !"!"# Fibromyalgia trmhaaiesyebrsreasdienur,ocsettoodnpeisnpprleeasvnseiiclos,nin
raises serotonin levels in
the brain, improves
mood, controls stress
decreases serotonin &amp;
Quetiapine #!#'&amp;# Anxiety dopamine in the brain,</p>
      <p>Externalimreprtouversntshintkoinpg and
3 results for the generated</p>
      <p>query to the user</p>
      <p>Query Interface
Feedback</p>
      <p>Reward</p>
      <p>Response
4 User provides relevance feedback;</p>
      <p>used to update query policy
nal data sources, then the work required to query for ever, to get the most out of an LLM, its output
repreeach drug is exacerbated. Furthermore, other researchers sentations should be adjusted to suit the specific task
with similar information needs must repeat the same such and domain. This is commonly done through finetuning,
work themselves. where the weights of the LLM are trained jointly with</p>
      <p>To alleviate the burden, one can use a shared system the task-specific model. However, finetuning is
resourcethat automates query formulation. This mediator intensive and may overwrite the LLM’s knowledge [6].
system acts as a go-between for users and external data Thus, in this paper, we evaluate more parameter eficient
sources: a user specifies a local entity (e.g., Zoloft) per- techniques for our online setting.
haps through a query or a graphical user interface, and Due to the wide-spread use of keyword query
interthe mediator maps the local entity to queries that retrieve faces over external sources, we use an online learning
the relevant external entities (e.g., Sertraline) from their method for formulating keyword queries. We evaluate
respective external sources. prefix tuning and attribute encoding as parameter
ef</p>
      <p>To the best of our knowledge, such mediators are cur- ficient techniques for boosting the performance of an
rently created by manually writing programs that gener- LLM-based query policy learner. We evaluate the
techate queries for specific external sources. These programs niques using Longformer [7] over four pairs of real-world
consist of rules that cannot necessarily be reused across datasets. We find the techniques may be highly efective
data sources. Thus, they require a significant amount of for select datasets.
labor and expert attention to build and maintain.</p>
      <p>In this paper, we learn the mediator’s query policy on- 2. Framework
line through user interaction. As illustrated in Figure 1,
after the user specifies a local entity, the mediator formu- We briefly outline the problem of learning a query policy
lates a query to retrieve records from an external source online. A more detailed discussion of the framework,
according to its query policy and shows the returned challenges, and related work can be found in [8]. The
external records to the user. The user then provides feed- mediator wraps the local dataset and the query
interback on the relevance of the returned records to the local face over the external data source. We assume the local
entity. The mediator then uses this feedback to improve dataset is a single table where each tuple stores
informaits query policy. tion about a distinct entity. We denote the set of local</p>
      <p>Of course, online learning of query policies has its dataset entities as ℰ . Given a local entity  and
exterown set of challenges. First, the mediator must learn a nal dataset , () ∈  represents the external entity
suficiently efective policy in the short run so users will that is relevant to the local one, where the definition of
continue providing feedback. This challenge is easier to relevance is domain-dependent. For example, Figure 1
meet when the users’ only alternative is tiresome (i.e., shows excerpts of a local (left) and an external (right)
manually submitting queries for many local entities) or dataset. ℰ consists of all drugs in FDA-Approved Uses. If
there are many users providing feedback. Second, the  is Zoloft then the relevant tuple () in Of-Label Uses
mediator should continue leveraging user feedback to is Sertraline.
ifnd increasingly efective policies in the long run (i.e., Given a user-specified local entity  ∈ ℰ , the mediator
it should not be prone to under-fitting to local entities). must devise and submit a query to the interface to extract
To help overcome these challenges, we use a pretrained (). The set of queries accepted by the given interface
LLM to extract features from local entities and terms. is . In this work, we consider keyword queries. A
keyThrough pretraining, LLMs encode linguistic knowledge word query  is a string comprised of terms. The number
within the rich representations of their outputs. How- of terms in a query is its length ℓ. A querying policy is a
mapping from local entites to queries  : ℰ → . Ide- nated onto each corresponding representation forming
ally, the policy should produce queries that efectively  = [(, ), ℎ] where [· , · ] denotes concatenation.
extract external entities relevant to . One such mea- Vector  is then passed through a small fully connected
sure of efectiveness is reciprocal rank (RR) 1 where  layer to predict reciprocal rank  for each term.
is the position of the first relevant answer in the results. As discussed in Section 1, we desire parameter
efiContinuing our example, given  = Zoloft, the mediator cient methods for adjusting the output of the LLM to our
must devise a keyword query to extract () = Sertra- specific task and data. We consider two such methods:
line. One can use the content of the input entity within prefix tuning and attribute embeddings.
the output query. However, terms in Brand are likely Prefix Tuning. We use prefix-tuning as an alternative
unique to the local dataset. Given this, assume the policy to updating all weights of the LLM [9]. Before passing
ignores those terms and produces the keyword query the base encoding of entity  (i.e., ) through the LLM,
 ="serotonin depression panic". It submits  to the query we prepend a prompt consisting of  vectors onto . This
interface over the external dataset in Figure 1, which contextualizes the output of all tokens in  on this
conreturns the ranked results (Paroxetine, Sertraline). The tinuous prompt. Feedback is propagated back to these
RR of this query would thus be 12 .  vectors, resulting in downstream representations that
are aligned with our objective.
3. LLM-Based Query Learning Attribute Embeddings. To inject the structural
information of local entity  within its downstream
represenFigure 1 illustrates a single interaction of online query tation, we adjust the base encoding of  prior to passing
policy learning. The mediator’s policy is refined pro- it through the LLM [10]. Each attribute (column) within
gressively over many interactions with the objective of the local dataset is encoded as a vector. These vectors are
maximizing the mean reciprocal rank (MRR) of its queries. then added to tokens to provide attribute information.
As discussed in Section 1, an optimal method would over- These encodings are updated based on feedback.
come two major obstacles. First, it would maintain user Selecting Queries and Updating. To encourage
exploengagement by producing efective queries in the short ration, we apply an  -greedy approach to query
formularun. Second, it would have the capacity to improve its tion [11] — selecting either the next-highest-scoring term
policy in the long run. or, with probability  , a random term until the desired</p>
      <p>We use a pretrained LLM in help meet the aforemen- query length is achieved. User feedback (RR) is used as a
tioned challenges. The model may benefit from the LLM’s prediction target for all query terms appearing in the
rerich representations of tuples and terms, boosting the turned external matches. Unobserved terms have targets
model’s early performance while also allowing it to fit to of 0 assigned. These term-entity-RR tuples are added to
the diversity of local entities over time. a first-in-first-out bufer of examples for the last 30
obEncoding Tuples and Scoring Terms. Given an entity served queries. We train the model by stochastic gradient
, we concatenate its terms into a single string  and pass descent with batches of 8 samples from the bufer at each
it through an LLM after standard byte-pair-encoding tok- interaction.
enization. The LLM produces a sentence-contextualized We use a pretrained Longformer model from the
Hugrepresentation ℎ for each input token. Note that the gingface Transformers library. Parameters are trained
byte-pair-encoding may break terms into multiple inputs using Pytorch’s implementation of Adam with default
or terms may appear multiple times in the entity, so to hyper-parameters.
produce feature ℎ corresponding to term , the output
encodings of all these instance are averaged. For con- 4. Empirical Evaluation
venience, we write this process as: ℎ1, ..., ℎ = LM().</p>
      <p>These representations capture information about each Our datasets are listed in Table 1. Each one contains a
term given the context of all terms within the entity. local and an external source. We include the entity count
However, they lack contextual information about the and the average number of terms per entity. Each local
local data source. Thus, we add this information post- entity has at least one relevant external entity, but some
encoding. external sources have additional irrelevant entities that</p>
      <p>We define a feature vector (, ), which contains can appear in results. Thus, we also specify the number of
distributional and schematic features of terms relative relevant external entities. ChEBI is derived from sources
to the local source. One such feature is Inverse docu- used in the NIH project discussed in Section 1. The local
ment Frequency (IDF). Let Dataset Frequency (DF) of a source uses DrugBank data, which contains molecular
term denote the fraction of entities in the local dataset information about drugs [12]. The external source uses
in which the term appears. IDF of a term is the inverse ChEBI data, which contains molecular entities used to
of its DF, and it quantifies how well that term identi- intervene in the processes of organisms [13]. WDC is
ifes the entity within the dataset. (, ) is concate- derived from the English WDC Product corpus,
containing products scraped from many sites [14]. CORD-19 0.8
contains research records related to COVID-19 [15]. We 0.6
split CORD-19 into two sources: one containing abstracts
(local) and one containing the remaining attributes (ex- RRM0.4
ternal). Drugs contains reviews from Drugs.com (local) 0.2
[16] and descriptions of the same drugs in Wikipedia 0.0
(external). 0 250 500 (7c5)0 CInthe1Era0Bc0t0Iions1250 1500 1750 2000
Interactions. We simulate a series of interactions. Each
interaction is initiated by sampling a local entity. Given 0.4
the entity, the mediator generates a query of length ℓ and
submits it to the external source, which returns its top-20 RRM0.2
results using BM25. The query is then scored based on
simulated feedback (i.e., ground truth). 0.0
Sampling. Entity preference tends to follow a Zipf dis- 0 250 500 750 1000 1250 1500 1750 2000
Ttrhibuus,tiuosner1s/reqwuehsetrtehe p′opℎumlaorsittyproapnuklaarnedntity≈ ap1p[r1o7x]-. ILDLFM, ℓℓ==44 IILnDLtFMe,r+ℓa=cℓt1=io64ns LLLLMM+ℓ=ℓ1=616
imately twice as often as the ( + 1)′ℎ most popular
entity. We simulate user preference by sampling local (d) CORD-19
entities from a Zipf distribution ( = 1). We randomly Figure 2: Longformer and IDF comparison for queries of
assign popularity, which is held constant across methods. length 4 and 16. LLM+ uses both prefix tuning and attribute
Evaluation Metric. We compute MRR as a sliding av- encoding whereas LLM uses neither.
erage over the previous 500 interactions. We report the
average of three runs each comprising 2000 interactions. out prefix tuning and attribute encoding along with Static
We plot this average against the current interaction. We IDF.
include error bands around each line to show a 95% in- Our results indicate that these techniques may
drasterval for standard error across runs. tically help the model for some datasets and keyword
Hyperparameters. We treat query length as a hyper- lengths. For example, in Figure 2c, we observe LLM+
parameter and use ℓ ∈ {4, 16}. We use  = 5 prefix ℓ = 4 exceeding the performance of LLM ℓ = 4 by a large
tokens for prefix tuning along with a moderate amount margin. Since ChEBI has 21 attributes in total, it may
of exploration ( = 0.05). specifically benefit from the use of attribute encodings.
Static IDF. To help contextualize performance, we in- On the other hand, we observe these techniques
producclude a naive policy for comparison. Static IDF produces ing worse results on CORD-19 and Drugs. In contrast to
queries using the top-ℓ terms in the content of  based ChEBI, the local sources for both Drugs and CORD-19
conon their IDF. As explained in Section 3, IDF quantifies tain one long textual field with few to no other attributes.
term specificity within a dataset. Besides the review text field, Drugs also contains
drugName and condition. Terms from drugName tend to be
4.1. Results efective and since drugName always appears before all
We seek to understand whether prefix tuning and at- other attributes, the positional encodings learned by the
tribute encoding lead to more efective query policies. pretrained model may be enough to help LLM identify
Figure 2 compares the LLM-based model with and with- terms originating from drugName.
Since CORD-19 contains a single abstract text field,
attribute encodings should have little to no efect on [12] D. Wishart, Y. Feunang, A. Guo, E. Lo, A. Marcu,
performance. Thus, prefix tuning likely degraded the J. Grant, T. Sajed, D. Johnson, C. Li, Z. Sayeeda,
initial performance of LLM+ ℓ = 4 in Figure 2d. It is et al., Drugbank 5.0: a major update to the drugbank
possible that prefix tuning requires more feedback to be database for 2018, in: Nucleic Acids res. 2017 Nov
efective. If this is true, than it may be possible to balance 8, 2017.
short run and long run performance by adjusting the [13] J. Hastings, G. Owen, A. Dekker, M. Ennis,
amount of parameters within the prefix. N. Kale, V. Muthukrishnan, S. Turner, N. Swainston,
P. Mendes, C. Steinbeck, Chebi in 2016: Improved
References services and an expanding collection of metabolites,
Nucleic acids research 44 (2016) D1214–D1219.
[1] N. S. Foundation, N. I. of Health, Smart health [14] A. Primpeli, R. Peeters, C. Bizer, The wdc training
and biomedical research in the era of arti- dataset and gold standard for large-scale product
ifcial intelligence and advanced data science matching, in: Companion Proceedings of The 2019
(sch), 2021. URL: https://www.nsf.gov/pubs/2021/ World Wide Web Conference, 2019, pp. 381–386.
nsf21530/nsf21530.htm. [15] L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas, J. Yang,
[2] E. C. Wood, A. K. Glen, L. G. Kvarfordt, F. Wom- D. Eide, K. Funk, R. M. Kinney, Z. Liu, W. Merrill,
ack, L. Acevedo, T. S. Yoon, C. Ma, V. Flores, P. Mooney, D. Murdick, D. Rishi, J. Sheehan, Z. Shen,
M. Sinha, Y. Chodpathumwan, A. Termehchy, J. C. B. Stilson, A. D. Wade, K. Wang, C. Wilhelm, B. Xie,
Roach, L. Mendoza, A. S. Hofman, E. W. Deutsch, D. A. Raymond, D. S. Weld, O. Etzioni, S. Kohlmeier,
D. Koslicki, S. A. Ramsey, Rtx-kg2: a system for Cord-19: The covid-19 open research dataset, ArXiv
building a semantically standardized knowledge (2020).
graph for translational biomedicine, bioRxiv (2021). [16] F. Gräßer, S. Kallumadi, H. Malberg, S. Zaunseder,
URL: https://www.biorxiv.org/content/early/2021/ Aspect-based sentiment analysis of drug reviews
11/01/2021.10.17.464747. applying cross-domain and cross-data learning, in:
[3] T. T. Ashburn, K. B. Thor, Drug repositioning: iden- International Conference on Digital Health, 2018,
tifying and developing new uses for existing drugs, pp. 121–125.</p>
      <p>Nature Reviews Drug Discovery 3 (2004) 673–683. [17] C. Cunha, A. Bestavros, M. Crovella,
Characteris[4] P. Wang, R. Shea, J. Wang, E. Wu, Progressive deep tics of WWW client-based traces, Technical Report,
web crawling through keyword queries for data 1995.</p>
      <p>enrichment, in: SIGMOD, 2019, p. 229–246.
[5] X. L. Dong, D. Srivastava, Big data integration,</p>
      <p>PVLDB 6 (2013).
[6] M. McCloskey, N. J. Cohen, Catastrophic
interference in connectionist networks: The sequential
learning problem, in: Psychology of learning and
motivation, volume 24, Elsevier, 1989, pp. 109–165.
[7] I. Beltagy, M. E. Peters, A. Cohan, Longformer:</p>
      <p>The long-document transformer, arXiv:2004.05150
(2020).
[8] C. Buss, J. Mosavi, M. Tokarev, A. Termehchy,</p>
      <p>M. David, S. Lee, Efective Entity Augmentation
By Querying External Data Sources, Technical
Report, 2023. URL: https://web.engr.oregonstate.edu/
~termehca/papers/entityarg.pdf .
[9] X. L. Li, P. Liang, Prefix-tuning: Optimizing
continuous prompts for generation, in: Proceedings of the
59th Annual Meeting of the Association for
Computational Linguistics and the 11th International
Joint Conference on Natural Language Processing
(Volume 1: Long Papers), 2021, pp. 4582–4597.
[10] P. Dufter, M. Schmitt, H. Schütze, Position
information in transformers: An overview, Computational</p>
      <p>Linguistics 48 (2022) 733–763.
[11] A. Slivkins, Introduction to multi-armed bandits,</p>
      <p>Found. Trends Mach. Learn. 12 (2019).</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>