-

Generating Data Augmentation Queries Using Large Language Models

Christopher Buss

Jasmin Mosavi

Mikhail Tokarev

Arash Termehchy

0 1

David Maier

Stefan Lee

0 0 Oregon State University , USA 1 Portland State University

Users often want to augment entities in their datasets with relevant information from external data sources. As many external sources are accessible only via keyword-search interfaces, a user usually has to manually formulate a keyword query that extracts relevant information for each entity. This is challenging as many data sources contain numerous tuples, only a small fraction of which may be relevant. Moreover, diferent datasets may represent the same information in distinct forms and under diferent terms. In such cases, it is dificult to formulate a query that precisely retrieves information relevant to a specific entity. Current methods for information enrichment mainly rely on resource-intensive manual efort to formulate queries to discover relevant information. However, it is often important for users to get initial answers quickly and without substantial investment in resources (such as human attention). Thus, as an alternative to manually writing mappings from entities to queries, one can learn these mappings progressively by leveraging end users' feedback. We evaluate the use of parameter eficient techniques for leveraging a pretrained large language model (LLM) for this task of online query policy learning.

eol>Information Integration Pre-trained Large Language Models Online Learning Query Learning Machine Learning AI and Databases Applied ML and AI for data management Heterogeneous and federated DBMS

1. Introduction Joint Workshops at 49th International Conference on Very Large Data Bases (VLDBW’23) — Workshop on LLMs and Databases (LLMDB’23), August 28 - September 1, 2023, Vancouver, Canada $ bussch@oregonstate.edu (C. Buss); mousavij@oregonstate.edu (J. Mosavi); tokarevm@oregonstate.edu (M. Tokarev); termehca@oregonstate.edu (A. Termehchy); maier@pdx.edu (D. Maier); leestef@oregonstate.edu (S. Lee)

© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License their local dataset, they will need to repeat the process. CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ACttEribUutRion W4.0oInrtekrnsahtioonpal (PCCroBYce4.0e).dings (CEUR-WS.org) Moreover, if they need information from multiple exterMediator generates keyword query

Featurization Φ Query Policy

“serotonin depression panic” Example Local Datasource: FDA-Approved Uses Brand Drug Class Approved Use boosts serotonin, reduces stress

User

Query Paroxetine !$#%& IBS

Example External Datasource: Off-Label Uses Generic M. Formula Off-Label Use How Works Sertraline !"!"# Fibromyalgia trmhaaiesyebrsreasdienur,ocsettoodnpeisnpprleeasvnseiiclos,nin raises serotonin levels in the brain, improves mood, controls stress decreases serotonin & Quetiapine #!#'&# Anxiety dopamine in the brain,

Externalimreprtouversntshintkoinpg and 3 results for the generated

query to the user

Query Interface Feedback

Reward

Response 4 User provides relevance feedback;

used to update query policy nal data sources, then the work required to query for ever, to get the most out of an LLM, its output repreeach drug is exacerbated. Furthermore, other researchers sentations should be adjusted to suit the specific task with similar information needs must repeat the same such and domain. This is commonly done through finetuning, work themselves. where the weights of the LLM are trained jointly with

To alleviate the burden, one can use a shared system the task-specific model. However, finetuning is resourcethat automates query formulation. This mediator intensive and may overwrite the LLM’s knowledge [6]. system acts as a go-between for users and external data Thus, in this paper, we evaluate more parameter eficient sources: a user specifies a local entity (e.g., Zoloft) per- techniques for our online setting. haps through a query or a graphical user interface, and Due to the wide-spread use of keyword query interthe mediator maps the local entity to queries that retrieve faces over external sources, we use an online learning the relevant external entities (e.g., Sertraline) from their method for formulating keyword queries. We evaluate respective external sources. prefix tuning and attribute encoding as parameter ef

To the best of our knowledge, such mediators are cur- ficient techniques for boosting the performance of an rently created by manually writing programs that gener- LLM-based query policy learner. We evaluate the techate queries for specific external sources. These programs niques using Longformer [7] over four pairs of real-world consist of rules that cannot necessarily be reused across datasets. We find the techniques may be highly efective data sources. Thus, they require a significant amount of for select datasets. labor and expert attention to build and maintain.

In this paper, we learn the mediator’s query policy on- 2. Framework line through user interaction. As illustrated in Figure 1, after the user specifies a local entity, the mediator formu- We briefly outline the problem of learning a query policy lates a query to retrieve records from an external source online. A more detailed discussion of the framework, according to its query policy and shows the returned challenges, and related work can be found in [8]. The external records to the user. The user then provides feed- mediator wraps the local dataset and the query interback on the relevance of the returned records to the local face over the external data source. We assume the local entity. The mediator then uses this feedback to improve dataset is a single table where each tuple stores informaits query policy. tion about a distinct entity. We denote the set of local

Of course, online learning of query policies has its dataset entities as ℰ . Given a local entity and exterown set of challenges. First, the mediator must learn a nal dataset , () ∈ represents the external entity suficiently efective policy in the short run so users will that is relevant to the local one, where the definition of continue providing feedback. This challenge is easier to relevance is domain-dependent. For example, Figure 1 meet when the users’ only alternative is tiresome (i.e., shows excerpts of a local (left) and an external (right) manually submitting queries for many local entities) or dataset. ℰ consists of all drugs in FDA-Approved Uses. If there are many users providing feedback. Second, the is Zoloft then the relevant tuple () in Of-Label Uses mediator should continue leveraging user feedback to is Sertraline. ifnd increasingly efective policies in the long run (i.e., Given a user-specified local entity ∈ ℰ , the mediator it should not be prone to under-fitting to local entities). must devise and submit a query to the interface to extract To help overcome these challenges, we use a pretrained (). The set of queries accepted by the given interface LLM to extract features from local entities and terms. is . In this work, we consider keyword queries. A keyThrough pretraining, LLMs encode linguistic knowledge word query is a string comprised of terms. The number within the rich representations of their outputs. How- of terms in a query is its length ℓ. A querying policy is a mapping from local entites to queries : ℰ → . Ide- nated onto each corresponding representation forming ally, the policy should produce queries that efectively = [(, ), ℎ] where [· , · ] denotes concatenation. extract external entities relevant to . One such mea- Vector is then passed through a small fully connected sure of efectiveness is reciprocal rank (RR) 1 where layer to predict reciprocal rank for each term. is the position of the first relevant answer in the results. As discussed in Section 1, we desire parameter efiContinuing our example, given = Zoloft, the mediator cient methods for adjusting the output of the LLM to our must devise a keyword query to extract () = Sertra- specific task and data. We consider two such methods: line. One can use the content of the input entity within prefix tuning and attribute embeddings. the output query. However, terms in Brand are likely Prefix Tuning. We use prefix-tuning as an alternative unique to the local dataset. Given this, assume the policy to updating all weights of the LLM [9]. Before passing ignores those terms and produces the keyword query the base encoding of entity (i.e., ) through the LLM, ="serotonin depression panic". It submits to the query we prepend a prompt consisting of vectors onto . This interface over the external dataset in Figure 1, which contextualizes the output of all tokens in on this conreturns the ranked results (Paroxetine, Sertraline). The tinuous prompt. Feedback is propagated back to these RR of this query would thus be 12 . vectors, resulting in downstream representations that are aligned with our objective. 3. LLM-Based Query Learning Attribute Embeddings. To inject the structural information of local entity within its downstream represenFigure 1 illustrates a single interaction of online query tation, we adjust the base encoding of prior to passing policy learning. The mediator’s policy is refined pro- it through the LLM [10]. Each attribute (column) within gressively over many interactions with the objective of the local dataset is encoded as a vector. These vectors are maximizing the mean reciprocal rank (MRR) of its queries. then added to tokens to provide attribute information. As discussed in Section 1, an optimal method would over- These encodings are updated based on feedback. come two major obstacles. First, it would maintain user Selecting Queries and Updating. To encourage exploengagement by producing efective queries in the short ration, we apply an -greedy approach to query formularun. Second, it would have the capacity to improve its tion [11] — selecting either the next-highest-scoring term policy in the long run. or, with probability , a random term until the desired

We use a pretrained LLM in help meet the aforemen- query length is achieved. User feedback (RR) is used as a tioned challenges. The model may benefit from the LLM’s prediction target for all query terms appearing in the rerich representations of tuples and terms, boosting the turned external matches. Unobserved terms have targets model’s early performance while also allowing it to fit to of 0 assigned. These term-entity-RR tuples are added to the diversity of local entities over time. a first-in-first-out bufer of examples for the last 30 obEncoding Tuples and Scoring Terms. Given an entity served queries. We train the model by stochastic gradient , we concatenate its terms into a single string and pass descent with batches of 8 samples from the bufer at each it through an LLM after standard byte-pair-encoding tok- interaction. enization. The LLM produces a sentence-contextualized We use a pretrained Longformer model from the Hugrepresentation ℎ for each input token. Note that the gingface Transformers library. Parameters are trained byte-pair-encoding may break terms into multiple inputs using Pytorch’s implementation of Adam with default or terms may appear multiple times in the entity, so to hyper-parameters. produce feature ℎ corresponding to term , the output encodings of all these instance are averaged. For con- 4. Empirical Evaluation venience, we write this process as: ℎ1, ..., ℎ = LM().

These representations capture information about each Our datasets are listed in Table 1. Each one contains a term given the context of all terms within the entity. local and an external source. We include the entity count However, they lack contextual information about the and the average number of terms per entity. Each local local data source. Thus, we add this information post- entity has at least one relevant external entity, but some encoding. external sources have additional irrelevant entities that

We define a feature vector (, ), which contains can appear in results. Thus, we also specify the number of distributional and schematic features of terms relative relevant external entities. ChEBI is derived from sources to the local source. One such feature is Inverse docu- used in the NIH project discussed in Section 1. The local ment Frequency (IDF). Let Dataset Frequency (DF) of a source uses DrugBank data, which contains molecular term denote the fraction of entities in the local dataset information about drugs [12]. The external source uses in which the term appears. IDF of a term is the inverse ChEBI data, which contains molecular entities used to of its DF, and it quantifies how well that term identi- intervene in the processes of organisms [13]. WDC is ifes the entity within the dataset. (, ) is concate- derived from the English WDC Product corpus, containing products scraped from many sites [14]. CORD-19 0.8 contains research records related to COVID-19 [15]. We 0.6 split CORD-19 into two sources: one containing abstracts (local) and one containing the remaining attributes (ex- RRM0.4 ternal). Drugs contains reviews from Drugs.com (local) 0.2 [16] and descriptions of the same drugs in Wikipedia 0.0 (external). 0 250 500 (7c5)0 CInthe1Era0Bc0t0Iions1250 1500 1750 2000 Interactions. We simulate a series of interactions. Each interaction is initiated by sampling a local entity. Given 0.4 the entity, the mediator generates a query of length ℓ and submits it to the external source, which returns its top-20 RRM0.2 results using BM25. The query is then scored based on simulated feedback (i.e., ground truth). 0.0 Sampling. Entity preference tends to follow a Zipf dis- 0 250 500 750 1000 1250 1500 1750 2000 Ttrhibuus,tiuosner1s/reqwuehsetrtehe p′opℎumlaorsittyproapnuklaarnedntity≈ ap1p[r1o7x]-. ILDLFM, ℓℓ==44 IILnDLtFMe,r+ℓa=cℓt1=io64ns LLLLMM+ℓ=ℓ1=616 imately twice as often as the ( + 1)′ℎ most popular entity. We simulate user preference by sampling local (d) CORD-19 entities from a Zipf distribution ( = 1). We randomly Figure 2: Longformer and IDF comparison for queries of assign popularity, which is held constant across methods. length 4 and 16. LLM+ uses both prefix tuning and attribute Evaluation Metric. We compute MRR as a sliding av- encoding whereas LLM uses neither. erage over the previous 500 interactions. We report the average of three runs each comprising 2000 interactions. out prefix tuning and attribute encoding along with Static We plot this average against the current interaction. We IDF. include error bands around each line to show a 95% in- Our results indicate that these techniques may drasterval for standard error across runs. tically help the model for some datasets and keyword Hyperparameters. We treat query length as a hyper- lengths. For example, in Figure 2c, we observe LLM+ parameter and use ℓ ∈ {4, 16}. We use = 5 prefix ℓ = 4 exceeding the performance of LLM ℓ = 4 by a large tokens for prefix tuning along with a moderate amount margin. Since ChEBI has 21 attributes in total, it may of exploration ( = 0.05). specifically benefit from the use of attribute encodings. Static IDF. To help contextualize performance, we in- On the other hand, we observe these techniques producclude a naive policy for comparison. Static IDF produces ing worse results on CORD-19 and Drugs. In contrast to queries using the top-ℓ terms in the content of based ChEBI, the local sources for both Drugs and CORD-19 conon their IDF. As explained in Section 3, IDF quantifies tain one long textual field with few to no other attributes. term specificity within a dataset. Besides the review text field, Drugs also contains drugName and condition. Terms from drugName tend to be 4.1. Results efective and since drugName always appears before all We seek to understand whether prefix tuning and at- other attributes, the positional encodings learned by the tribute encoding lead to more efective query policies. pretrained model may be enough to help LLM identify Figure 2 compares the LLM-based model with and with- terms originating from drugName. Since CORD-19 contains a single abstract text field, attribute encodings should have little to no efect on [12] D. Wishart, Y. Feunang, A. Guo, E. Lo, A. Marcu, performance. Thus, prefix tuning likely degraded the J. Grant, T. Sajed, D. Johnson, C. Li, Z. Sayeeda, initial performance of LLM+ ℓ = 4 in Figure 2d. It is et al., Drugbank 5.0: a major update to the drugbank possible that prefix tuning requires more feedback to be database for 2018, in: Nucleic Acids res. 2017 Nov efective. If this is true, than it may be possible to balance 8, 2017. short run and long run performance by adjusting the [13] J. Hastings, G. Owen, A. Dekker, M. Ennis, amount of parameters within the prefix. N. Kale, V. Muthukrishnan, S. Turner, N. Swainston, P. Mendes, C. Steinbeck, Chebi in 2016: Improved References services and an expanding collection of metabolites, Nucleic acids research 44 (2016) D1214–D1219. [1] N. S. Foundation, N. I. of Health, Smart health [14] A. Primpeli, R. Peeters, C. Bizer, The wdc training and biomedical research in the era of arti- dataset and gold standard for large-scale product ifcial intelligence and advanced data science matching, in: Companion Proceedings of The 2019 (sch), 2021. URL: https://www.nsf.gov/pubs/2021/ World Wide Web Conference, 2019, pp. 381–386. nsf21530/nsf21530.htm. [15] L. L. Wang, K. Lo, Y. Chandrasekhar, R. Reas, J. Yang, [2] E. C. Wood, A. K. Glen, L. G. Kvarfordt, F. Wom- D. Eide, K. Funk, R. M. Kinney, Z. Liu, W. Merrill, ack, L. Acevedo, T. S. Yoon, C. Ma, V. Flores, P. Mooney, D. Murdick, D. Rishi, J. Sheehan, Z. Shen, M. Sinha, Y. Chodpathumwan, A. Termehchy, J. C. B. Stilson, A. D. Wade, K. Wang, C. Wilhelm, B. Xie, Roach, L. Mendoza, A. S. Hofman, E. W. Deutsch, D. A. Raymond, D. S. Weld, O. Etzioni, S. Kohlmeier, D. Koslicki, S. A. Ramsey, Rtx-kg2: a system for Cord-19: The covid-19 open research dataset, ArXiv building a semantically standardized knowledge (2020). graph for translational biomedicine, bioRxiv (2021). [16] F. Gräßer, S. Kallumadi, H. Malberg, S. Zaunseder, URL: https://www.biorxiv.org/content/early/2021/ Aspect-based sentiment analysis of drug reviews 11/01/2021.10.17.464747. applying cross-domain and cross-data learning, in: [3] T. T. Ashburn, K. B. Thor, Drug repositioning: iden- International Conference on Digital Health, 2018, tifying and developing new uses for existing drugs, pp. 121–125.

Nature Reviews Drug Discovery 3 (2004) 673–683. [17] C. Cunha, A. Bestavros, M. Crovella, Characteris[4] P. Wang, R. Shea, J. Wang, E. Wu, Progressive deep tics of WWW client-based traces, Technical Report, web crawling through keyword queries for data 1995.

enrichment, in: SIGMOD, 2019, p. 229–246. [5] X. L. Dong, D. Srivastava, Big data integration,

PVLDB 6 (2013). [6] M. McCloskey, N. J. Cohen, Catastrophic interference in connectionist networks: The sequential learning problem, in: Psychology of learning and motivation, volume 24, Elsevier, 1989, pp. 109–165. [7] I. Beltagy, M. E. Peters, A. Cohan, Longformer:

The long-document transformer, arXiv:2004.05150 (2020). [8] C. Buss, J. Mosavi, M. Tokarev, A. Termehchy,

M. David, S. Lee, Efective Entity Augmentation By Querying External Data Sources, Technical Report, 2023. URL: https://web.engr.oregonstate.edu/ ~termehca/papers/entityarg.pdf . [9] X. L. Li, P. Liang, Prefix-tuning: Optimizing continuous prompts for generation, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 4582–4597. [10] P. Dufter, M. Schmitt, H. Schütze, Position information in transformers: An overview, Computational

Linguistics 48 (2022) 733–763. [11] A. Slivkins, Introduction to multi-armed bandits,

Found. Trends Mach. Learn. 12 (2019).