=Paper=
{{Paper
|id=Vol-3931/short3
|storemode=property
|title=The Case for Instance-Optimized LLMs in OLAP Databases
|pdfUrl=https://ceur-ws.org/Vol-3931/short3.pdf
|volume=Vol-3931
|authors=Bardia Mohammadi,Laurent Bindschaedler
|dblpUrl=https://dblp.org/rec/conf/dolap/MohammadiB25
}}
==The Case for Instance-Optimized LLMs in OLAP Databases==
The Case for Instance-Optimized LLMs in OLAP Databases
Bardia Mohammadi1,∗ , Laurent Bindschaedler1
1
Max Planck Institute for Software Systems, Saarbrücken, Germany
Abstract
Large Language Models (LLMs) can enhance analytics systems with powerful data summarization, cleaning, and semantic transformation
capabilities. However, deploying LLMs at scale – processing millions to billions of rows – remains prohibitively expensive in computation
and memory. We present IOLM-DB, a novel system that makes LLM-enhanced database queries practical through query-specific model
optimization. Instead of using general-purpose LLMs, IOLM-DB generates lightweight, specialized models tailored to each query’s
specific needs using representative data samples. IOLM-DB reduces model footprints by up to 76% and increases throughput by up to
3.31× while maintaining accuracy through aggressive compression techniques, including quantization, sparsification, and structural
pruning. We further show how our approach enables higher parallelism on existing hardware and seamlessly supports caching and
batching strategies to reduce overheads. Our prototype demonstrates that leveraging LLM queries inside analytics systems is feasible at
scale, opening new possibilities for future OLAP applications.
Keywords
analytics, cube, OLAP, LLM, instance-optimization, scalability, quantization, sparsification, pruning
1. Introduction We propose a practical approach to mitigate these chal-
lenges: instance-optimized LLMs for databases (IOLM-DB).
LLMs have demonstrated exceptional capabilities in nat- By tailoring models to the specific workloads and data dis-
ural language understanding and generation [1, 2, 3, 4]. tributions of a given query and database instance, we reduce
One promising application in data management is integrat- the cost of LLM inference, making it more practical to use at
ing LLM prompting into database queries. This approach scale. We find that the OLAP setup is an ideal environment
is particularly useful in analytical database systems, en- for creating optimized models because it operates in a con-
abling users to harness the power of LLMs directly in their trolled setting where the workload and data are predictable.
queries [5, 6]. For instance, given a table of unstructured IOLM-DB combines multiple model compression techniques,
product reviews, a user could write a query like: including quantization (reducing numerical precision) [10],
SELECT product_id, user_id,
sparsification (introducing zero elements) [11], and struc-
prompt('summarize in 5 words: ' || review) AS tural pruning (removing non-essential components) [12].
review_summary The resulting models preserve task-relevant capabilities
FROM product_reviews; while being significantly smaller and cheaper to execute
with higher parallelism, helping to narrow the performance
This approach opens up new possibilities for generating, gap for row-by-row LLM execution.
summarizing, cleaning, and transforming structured and We have developed an initial prototype of IOLM-DB that
unstructured data directly within the database. targets Python’s pandas library [13] for rapid iteration. By
However, applying LLM prompts to each row of data working in this simplified environment, we can experiment
presents significant challenges. A typical query, such as with various optimization strategies and gather initial per-
a transformation, classification, or schema extraction, re- formance insights before integrating these strategies into a
quires a separate LLM invocation involving tokenization, production environment. The main objective of this paper
context encoding, and autoregressive decoding. This per- is to provide a proof-of-concept for our approach, allow-
row inference results in high computational overhead, as ing us to evaluate potential performance gains and identify
even simple queries can trigger millions or billions of LLM challenges that may arise when implementing and deploy-
calls, leading to excessive latency and resource consump- ing instance-optimized models. Our preliminary results
tion, especially for large tables. While running local mod- indicate that IOLM-DB can generate compressed models
els [3, 7] instead of cloud-based models [8, 9, 2] may reduce on the fly that are up to 3.28× smaller than the base model
costs and latency, the overheads typically remain significant while sporting similar or better accuracy, achieves higher
relative to conventional database operations, and running parallelism on the same hardware, and increases throughput
large-scale distributed models may prove challenging due between 2.52× and 3.31× on three representative workloads.
to the lack of co-located accelerators and large memory re- The paper makes the following contributions:
quirements. These limitations highlight the importance of • We propose an end-to-end system for prototyping
developing efficient strategies to integrate LLMs into an- LLM prompting in OLAP scenarios and a series of
alytics workflows without compromising performance or workloads to assess its performance.
scalability. • We introduce the first method for generating
instance-optimized LLMs in database environments.
DOLAP 2025: 27th International Workshop on Design, Optimization, Lan-
guages and Analytical Processing of Big Data, co-located with EDBT/ICDT • We evaluate the efficiency of our method on the pro-
2025, March 25, 2025, Barcelona, Spain posed workloads, showing significant performance
∗
Corresponding author. improvements. These results indicate a promising
Envelope-Open bmohamma@mpi-sws.org (B. Mohammadi); bindsch@mpi-sws.org first step toward demonstrating the feasibility of
(L. Bindschaedler)
LLM compression for such applications.
GLOBE https://bardia-mhd.github.io/ (B. Mohammadi); https://binds.ch
(L. Bindschaedler)
Orcid 0009-0001-3658-7291 (B. Mohammadi); 0000-0003-0559-631X Artifact Availability The source code and datasets used
(L. Bindschaedler) in this paper are publicly available at https://github.com/
© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0). mpi-dsg/IOLM-DB.
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
2. Background and Motivation LLM Compression Techniques Recent advances in LLM
optimization provide a foundation for our approach. We
This section briefly overviews key concepts and related leverage three key techniques from the literature: quantiza-
work that motivate our approach. tion, which reduces numerical precision to decrease memory
requirements [19]; sparsification, which introduces strate-
LLM Prompting in Databases Augmenting databases gic zero elements to minimize computational overhead [11];
with the ability to prompt LLMs can facilitate data wran- and structural pruning, which removes non-essential model
gling and analysis by enabling natural language queries and components [12]. While these techniques have proven effec-
transformations [6]. Rather than exporting data for exter- tive for domain-specific optimization, our scenario presents
nal processing or implementing complex client-side integra- unique challenges. We need to optimize at a much finer
tions, bringing LLM capabilities directly into the database granularity – for individual queries or prompts rather than
execution environment offers a more streamlined approach. broad domains – and ensure consistent, predictable behavior
Recent non-peer-reviewed work, such as LOTUS [14], has when working with structured data.
demonstrated the feasibility of extending relational models
with LLM-powered semantic operators, enabling AI-based 3. System Design and Architecture
operations like natural language-driven sorting and aggre-
gation. This paper considers a system that integrates LLM IOLM-DB is an OLAP system that integrates LLM invocation
prompts as first-class operations within the query process- directly into its execution pipeline while ensuring that this
ing pipeline, allowing them to be composed with traditional integration is efficient, scalable, and cost-effective.
database operations. While our focus is on OLAP systems,
the principles extend to other database architectures. 3.1. Overview
New LLM-Based Capabilities LLM integration enables We assume a setup where the user executes queries over
powerful new capabilities for database systems, especially one or more database tables (in our prototype, pandas
when working with unstructured or semi-structured data. DataFrames). These queries can include calls to an LLM, for
These capabilities include summarization, sentiment analy- example, to summarize free-form text columns, transform
sis, data extraction, error correction, and semantic transfor- semi-structured fields, or provide semantic annotations.
mations. LLMs can also enable more flexible abnd intuitive Rather than invoking a general-purpose LLM repeatedly
operations, such as fuzzy matching and semantic joins be- and incurring high computational and memory overhead at
yond exact string matching. runtime, IOLM-DB creates a specialized, instance-optimized
LLM explicitly tailored for the query and dataset. For ex-
Other Approaches Existing approaches to adding these ample, a query performing text summarization on product
capabilities have notable limitations. Code generation tech- reviews requires a different optimization than a data cor-
niques [15, 5], where LLMs generate executable database rection query. To achieve this specialization, the model is
code, can be brittle and struggle with complex transforma- compressed and pruned based on the query type and data
tions that require deep semantic understanding. Alternative characteristics. This approach is necessary because a one-
approaches using simpler models trained on input-output size-fits-all model is inefficient; different queries demand
pairs [16, 17, 18] face challenges handling diverse scenarios varying levels of understanding, precision, and computa-
and require extensive training data curation. While these tional cost.
methods can work for specific use cases, they often fail IOLM-DB works in tandem with the underlying analyt-
to provide the flexibility and generality needed for broad ics engine. While the engine handles standard relational
adoption in database systems. operations efficiently, we introduce custom operators that
intercept LLM prompts in the query. These operators trig-
The Need for Instance-Optimization We argue that
ger a workflow that generates a specialized LLM for that
anything short of directly invoking the LLM for such tasks
particular query and data distribution. This process lever-
is inherently limiting, as it would restrict the system’s ex-
ages a suite of techniques – quantification, sparsification,
pressiveness. Therefore, efficiency is paramount to making
and pruning – to minimize both the memory footprint of
LLMs practical at scale, especially for large-scale analyt-
the LLM and the inference costs, ensuring that at runtime
ics. Many anticipated use cases require invoking the LLM
the optimized LLM can be invoked with minimal latency
at a fine-grained level, such as once per row or more fre-
and resource usage.
quently [6]. This granularity requires a new approach that
We drastically reduce per-row inference overhead by pro-
reduces inference costs while maintaining accuracy, which
ducing a specialized LLM for each query instance. This
motivates our approach of specializing LLMs for specific
approach enables running LLM-based transformations and
prompts and data patterns.
analyses at scale, supporting massive parallelism and effi-
OLAP environments are particularly well-suited for such
cient resource utilization. Ultimately, the goal is to handle
optimizations, as their controlled setting often allows query
large workloads and large numbers of concurrent queries
patterns and data characteristics to be inferred or extracted
while providing near-interactive response times.
in advance, enabling model optimization tailored to these
patterns. Similarly, interactive queries frequently exhibit 3.2. Techniques to Generate
recurring or predictable patterns that can be leveraged to
create instance-optimized models, improving efficiency and
Instance-Optimized Model
accuracy as these patterns evolve over time. Although our Our system combines multiple compression techniques to
current method may introduce too much overhead for some reduce memory consumption and compute costs at the LLM
use cases, we expect it to be highly effective for long-running level. By carefully integrating these methods, we maintain
queries, where significant performance gains outweigh the model accuracy while achieving significant resource savings.
upfront cost of optimization during execution. We combine the following techniques in IOLM-DB.
Quantization By lowering numerical precision (e.g., us- reviews, extracting their essential meaning, and outputting
ing 8-bit weights and sometimes 8-bit activations), we re- a concise summary – e.g., summarizing each product review
duce both the GPU memory footprint and computational into five words. Such summarization helps analysts quickly
overhead. Quantization retains the model’s core capabilities gain insights from large volumes of textual data without
but significantly increases inference throughput. manually sifting through lengthy entries. By operating
on unstructured Amazon Reviews datasets [23], we test
Sparsification Imposing structured or unstructured spar- the system’s ability to scale and maintain accuracy when
sity patterns in model weights reduces the number of active transforming large amounts of unstructured text.
parameters, leading to fewer operations during inference
and, in some cases, allows for hardware-level support for 4.2. Data Correction
sparse computations.
Data correction enhances data quality by addressing errors
Structural Pruning Removing entire components such or inconsistencies. For instance, we can provide a specific
as attention heads or entire layers that contribute little to data type or category and task the system with correcting
the task at hand reduces model depth and complexity. Full typos or mismatches. By applying a per-row invocation of
structural pruning, aided by tools such as LLM-Pruner [20], the language model, we can correct misspelled code or text
enables the construction of ultra-compact models special- records in GitHub for subsequent analysis [24].
ized for the given query patterns. These techniques are
not applied in isolation. Methods such as GPTQ [21] 4.3. Fuzzy Joins (Semantic Mapping)
and SmoothQuant [22] combine pruning and quantization
Fuzzy joins address the problem of integrating data from
steps to preserve accuracy while aggressively reducing size.
multiple tables by understanding semantic similarity rather
SparseGPT [11] applies pruning strategies suitable for large
than relying on exact string matches. For example, the
models in a one-shot manner, maintaining accuracy even
system can determine whether two entries from different
at high sparsity levels. The result is a highly compressed
datasets refer to the same entity despite slight differences
model tailored to the query’s distribution and the dataset’s
in wording or formatting, even when textual fields do not
characteristics. We use calibration data – small, unlabeled
perfectly align, improving data integration and discovery.
samples representing the query’s input domain – to fine-
tune quantization parameters and pruning thresholds. This
process ensures that the specialized LLM efficiently han-
5. Evaluation
dles the target data, reducing resource requirements while We conduct a preliminary evaluation of IOLM-DB through
minimizing accuracy loss. experiments on the three workloads and associated datasets
described in Section 4. Our key objective is to verify the
3.3. Runtime and Invocation Optimizations
viability of our approach for scaling LLM invocations per
At runtime, the optimized LLM is invoked by custom opera- row in a realistic setting.
tors within the OLAP engine’s execution pipeline. IOLM-DB
uses the following optimizations to further reduce overhead. Metrics Our evaluation centers on the following metrics:
• Throughput: The number of rows processed per
second by the system, which reflects its overall effi-
Caching Intermediate results and repeated inputs are
ciency and scalability.
cached so that identical LLM queries need not be recom-
puted, which is especially valuable when data contains fre- • Model Size: The size of the model, which serves
quent duplicates or recurring patterns. as a proxy for GPU memory usage and its capacity
to parallelize execution effectively. Smaller mod-
Batching We batch multiple requests to the LLM together els typically reduce memory pressure and improve
to amortize invocation overhead. By grouping multiple rows resource utilization.
or operations, we minimize switching costs and achieve • Accuracy: The proportion of rows where the sys-
higher throughput. tem produces correct results. For this evaluation,
we assume the baseline model achieves perfect accu-
Cascading (Future Work) We plan to explore cascading racy (accuracy = 1), and we compare the optimized
strategies in the future, where an initial coarse-grained LLM models by normalizing against this standard.
invocation feeds into more specialized or higher-precision
models only where needed. This approach could further Models We conduct our evaluation using Meta’s Llama
refine trade-offs between speed, cost, and accuracy. 3.1 Instruct model with 8 billion parameters instruction-
tuned language model (Llama-3.1-Instruct-8B) [3], which
4. Workloads and Use Cases strikes a balance between size and performance for the work-
We evaluate our approach on three representative work- loads under consideration. The model is compact enough to
loads, each designed to highlight a different aspect of in- fit on a single modern GPU yet powerful enough to deliver
tegrating LLM invocations into OLAP queries. All these strong performance across all tasks. We verified our Llama
workloads operate on real-world datasets (e.g., Amazon Re- model’s suitability for these tasks by comparing it against
view [23], Github Typos [24]) and are designed to stress OpenAI’s GPT4o and Anthropic’s Claude Sonnet 3.5 and
different points along the performance-accuracy spectrum. manually verifying the results.
4.1. Summarization (Text Reduction) Baseline and Configurations We execute IOLM-DB us-
This workload involves condensing verbose free-text fields ing Llama-3.1-Instruct-8B, which serves as our baseline for
into short summaries. An example use case is taking product all metrics. Building upon this foundation, we evaluate two
optimized variants:
Workload Model Model Size Accuracy Score Throughput
Baseline 14.98 GB 1 4.67 rows/s
Summarization IOLM-DB-Perf 8.48 GB 0.91 15.50 rows/s
IOLM-DB-Acc 8.48 GB 1 11.97 rows/s
Baseline 14.98 GB 1 2.73 rows/s
Data Correction IOLM-DB-Perf 8.48 GB 1 7.60 rows/s
IOLM-DB-Acc 8.48 GB 1 7.60 rows/s
Baseline 14.98 GB 1 14.92 rows/s
Fuzzy Join IOLM-DB-Perf 8.48 GB 1 37.72 rows/s
IOLM-DB-Acc 8.48 GB 1 37.72 rows/s
Table 1
Throughput, memory usage, and accuracy score (normalized to the baseline) for each of the three workloads considered in
this paper. IOLM-DB-Perf corresponds to the compressed model with the highest throughput, while IOLM-DB-Acc is the
compressed model with the highest accuracy score. We report the overall throughput in rows per second.
• IOLM-DB-Perf: instance-optimized version of the querying in analytics workloads. Our evaluation also re-
Llama-3.1-Instruct-8B with the best throughput. vealed several areas for improvement that could further
• IOLM-DB-Acc: instance-optimized version of the enhance the performance and scalability of our solution.
Llama-3.1-Instruct-8B with the best accuracy. One key bottleneck identified in our initial prototype is
the space consumed by the vLLM cache. Optimizing the
These two variants highlight an interesting trade-off be- cache design could further reduce memory usage and en-
tween performance and accuracy enabled by IOLM-DB able increased parallelism, particularly for workloads with
which we aim to explore in future work, allowing the user high concurrency demands. Also, the interface between
to select whether to spend more time for better results or to the pandas library and our instance-optimized operators
produce slightly less accurate results faster. presents another area for improvement. Reengineering this
Hardware and Configuration We run these experi- interface to reduce overhead and improve integration could
ments on a machine equipped with an NVIDIA Hopper significantly boost overall performance. We expect to gain
H100 80 GB SXM GPU, two AMD CPUs (EPYC 9654), and one or two orders of magnitude performance improvements
2 TB of DDR5. The operating system is 64-bit Debian Linux from these optimizations.
(kernel version 5.15). We use CUDA 11.8, vLLM 0.6.3, and An interesting observation is that the resulting quality,
HuggingFace 4.46.1 transformers. particularly for IOLM-DB-Acc, often improves compared to
the baseline. However, the normalization of the accuracy
score prevents this improvement from being reflected in
5.1. Performance, Model Size, and Accuracy Table 1. These findings suggest that our instance-optimized
Table 1 compares the baseline with our IOLM-DB instance- approach not only preserves accuracy but can, in some cases,
optimized models. We show the overall model size as mea- enhance it. Further investigation is needed to understand
sured on the GPU, the resulting accuracy in our dataset the underlying factors driving these improvements and to
(normalized to the baseline), and the throughput achieved explore how they can be consistently leveraged.
in each case (in rows processed per second). Finally, we plan to explore additional techniques in the
These results underscore the benefits of instance- LLM compression space, such as advanced quantization
optimization for LLMs across all workloads considered. In methods and knowledge distillation. The time to generate
all cases except summarization, both IOLM-DB-Perf and an instance-optimized LLM is in the order of single-digit
IOLM-DB-Acc achieve an accuracy score of 1, demonstrat- minutes, which we believe is acceptable for most table sizes.
ing that our approach maintains baseline-level accuracy. However, reducing that further could unlock additional use
This outcome highlights that our optimizations preserve cases outside the OLAP space.
model quality while reducing resource requirements.
Our method achieves significant reductions in model size, 6. Conclusion
with compression factors of up to 76%, through quantization, This paper presented IOLM-DB, a proof-of-concept system
sparsification, and pruning. These reductions lower mem- for integrating LLM operations directly into OLAP query
ory requirements and improve computational efficiency. pipelines. The core innovation lies in instance-optimizing
The throughput improvements are substantial: 2.78× for LLMs based on the specific query and dataset at hand. IOLM-
data correction and 2.52× for fuzzy join workloads. For sum- DB can generate specialized models that drastically reduce
marization, despite a slight accuracy decrease in IOLM-DB- inference costs without sacrificing accuracy by leveraging
Perf, we observe a 3.31× throughput improvement. These quantization, sparsification, and pruning techniques. Our
performance gains stem from the compressed models and preliminary results suggest that this approach makes row-
reduced model size, enabling higher parallelism. by-row invocation of LLMs at scale both practical and effi-
The results demonstrate that instance-optimized LLMs cient, bridging the gap between powerful linguistic trans-
can effectively balance accuracy and efficiency, making them formations and large-scale analytics.
suitable for performance-critical applications. This line of work opens up promising research direc-
tions, encouraging the community to refine these techniques
5.2. Discussion further, explore new instance-level optimizations, and ulti-
Although these results are preliminary, they highlight the mately bring high-performance, query-aware LLM capabili-
potential of instance-optimization to enable efficient LLM ties into mainstream analytical workflows.
References ceedings of the ACM on Programming Languages 7
(2023) 952–981.
[1] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka- [17] Q. Chen, A. Banerjee, Ç. Demiralp, G. Durrett, I. Dil-
plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- lig, Data extraction via semantic regular expression
try, A. Askell, et al., Language models are few-shot synthesis, Proceedings of the ACM on Programming
learners, Advances in neural information processing Languages 7 (2023) 1848–1877.
systems 33 (2020) 1877–1901. [18] Y. He, X. Chu, K. Ganjam, Y. Zheng, V. Narasayya,
[2] G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, S. Chaudhuri, Transform-data-by-example (tde) an
R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Milli- extensible search engine for data transformations,
can, et al., Gemini: a family of highly capable multi- Proceedings of the VLDB Endowment 11 (2018)
modal models, arXiv preprint arXiv:2312.11805 (2023). 1165–1177.
[3] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. [19] E. Frantar, S. Ashkboos, T. Hoefler, D. Alistarh, Gptq:
Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, Accurate post-training quantization for generative
F. Azhar, et al., Llama: Open and efficient foundation pre-trained transformers, ArXiv abs/2210.17323
language models, arXiv preprint arXiv:2302.13971 (2022). URL: https://api.semanticscholar.org/CorpusID:
(2023). 253237200.
[4] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, [20] X. Ma, G. Fang, X. Wang, Llm-pruner: On the struc-
E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg, tural pruning of large language models, Advances
et al., Sparks of artificial general intelligence: Early ex- in neural information processing systems 36 (2023)
periments with gpt-4, arXiv preprint arXiv:2303.12712 21702–21720.
(2023). [21] E. Frantar, S. Ashkboos, T. Hoefler, D. Alistarh, Gptq:
[5] X. Li, T. Döhmen, Towards efficient data wrangling Accurate post-training quantization for generative pre-
with llms using code generation, in: Proceedings of trained transformers, in: The Eleventh International
the Eighth Workshop on Data Management for End- Conference on Learning Representations, 2023.
to-End Machine Learning, 2024, pp. 62–66. [22] G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, S. Han,
[6] A. Narayan, I. Chami, L. Orr, S. Arora, C. Ré, Can Smoothquant: Accurate and efficient post-training
foundation models wrangle your data?, arXiv preprint quantization for large language models, in: Interna-
arXiv:2205.09911 (2022). tional Conference on Machine Learning, PMLR, 2023,
[7] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, pp. 38087–38099.
D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, [23] J. Ni, J. Li, J. McAuley, Justifying recommendations
G. Lample, L. Saulnier, et al., Mistral 7b, arXiv preprint using distantly-labeled reviews and fine-grained as-
arXiv:2310.06825 (2023). pects, in: Conference on Empirical Methods in
[8] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, Natural Language Processing, 2019. URL: https://api.
F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, semanticscholar.org/CorpusID:202621357.
S. Anadkat, et al., Gpt-4 technical report, arXiv [24] M. Hagiwara, M. Mita, GitHub typo corpus: A large-
preprint arXiv:2303.08774 (2023). scale multilingual dataset of misspellings and gram-
[9] Anthropic, Claude, 2024. URL: https://www.anthropic. matical errors, in: N. Calzolari, F. Béchet, P. Blache,
com/claude, large language model. K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isa-
[10] K. Egashira, M. Vero, R. Staab, J. He, M. Vechev, Ex- hara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno,
ploiting LLM quantization, in: The Thirty-eighth An- J. Odijk, S. Piperidis (Eds.), Proceedings of the Twelfth
nual Conference on Neural Information Processing Language Resources and Evaluation Conference, Eu-
Systems, 2024. URL: https://openreview.net/forum? ropean Language Resources Association, Marseille,
id=ISa7mMe7Vg. France, 2020, pp. 6761–6768. URL: https://aclanthology.
[11] E. Frantar, D. Alistarh, Sparsegpt: Massive lan- org/2020.lrec-1.835.
guage models can be accurately pruned in one-
shot, ArXiv abs/2301.00774 (2023). URL: https://api.
semanticscholar.org/CorpusID:255372747.
[12] X. Ma, G. Fang, X. Wang, Llm-pruner: On
the structural pruning of large language mod-
els, ArXiv abs/2305.11627 (2023). URL: https://api.
semanticscholar.org/CorpusID:258823276.
[13] W. McKinney, et al., pandas: a foundational python
library for data analysis and statistics, Python for high
performance and scientific computing 14 (2011) 1–9.
[14] L. Patel, S. Jha, C. Guestrin, M. Zaharia, Lotus:
Enabling semantic queries with llms over tables of
unstructured and structured data, arXiv preprint
arXiv:2407.11418 (2024).
[15] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto,
J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brock-
man, et al., Evaluating large language models trained
on code, arXiv preprint arXiv:2107.03374 (2021).
[16] J. Cambronero, S. Gulwani, V. Le, D. Perelman, A. Rad-
hakrishna, C. Simon, A. Tiwari, Flashfill++: Scaling
programming by example by cutting to the chase, Pro-