=Paper= {{Paper |id=Vol-3931/short3 |storemode=property |title=The Case for Instance-Optimized LLMs in OLAP Databases |pdfUrl=https://ceur-ws.org/Vol-3931/short3.pdf |volume=Vol-3931 |authors=Bardia Mohammadi,Laurent Bindschaedler |dblpUrl=https://dblp.org/rec/conf/dolap/MohammadiB25 }} ==The Case for Instance-Optimized LLMs in OLAP Databases== https://ceur-ws.org/Vol-3931/short3.pdf
                         The Case for Instance-Optimized LLMs in OLAP Databases
                         Bardia Mohammadi1,∗ , Laurent Bindschaedler1
                         1
                             Max Planck Institute for Software Systems, Saarbrücken, Germany


                                           Abstract
                                           Large Language Models (LLMs) can enhance analytics systems with powerful data summarization, cleaning, and semantic transformation
                                           capabilities. However, deploying LLMs at scale – processing millions to billions of rows – remains prohibitively expensive in computation
                                           and memory. We present IOLM-DB, a novel system that makes LLM-enhanced database queries practical through query-specific model
                                           optimization. Instead of using general-purpose LLMs, IOLM-DB generates lightweight, specialized models tailored to each query’s
                                           specific needs using representative data samples. IOLM-DB reduces model footprints by up to 76% and increases throughput by up to
                                           3.31× while maintaining accuracy through aggressive compression techniques, including quantization, sparsification, and structural
                                           pruning. We further show how our approach enables higher parallelism on existing hardware and seamlessly supports caching and
                                           batching strategies to reduce overheads. Our prototype demonstrates that leveraging LLM queries inside analytics systems is feasible at
                                           scale, opening new possibilities for future OLAP applications.

                                           Keywords
                                           analytics, cube, OLAP, LLM, instance-optimization, scalability, quantization, sparsification, pruning



                         1. Introduction                                                                                                 We propose a practical approach to mitigate these chal-
                                                                                                                                      lenges: instance-optimized LLMs for databases (IOLM-DB).
                         LLMs have demonstrated exceptional capabilities in nat-                                                      By tailoring models to the specific workloads and data dis-
                         ural language understanding and generation [1, 2, 3, 4].                                                     tributions of a given query and database instance, we reduce
                         One promising application in data management is integrat-                                                    the cost of LLM inference, making it more practical to use at
                         ing LLM prompting into database queries. This approach                                                       scale. We find that the OLAP setup is an ideal environment
                         is particularly useful in analytical database systems, en-                                                   for creating optimized models because it operates in a con-
                         abling users to harness the power of LLMs directly in their                                                  trolled setting where the workload and data are predictable.
                         queries [5, 6]. For instance, given a table of unstructured                                                  IOLM-DB combines multiple model compression techniques,
                         product reviews, a user could write a query like:                                                            including quantization (reducing numerical precision) [10],
                         SELECT product_id, user_id,
                                                                                                                                      sparsification (introducing zero elements) [11], and struc-
                                prompt('summarize in 5 words: ' || review) AS                                                         tural pruning (removing non-essential components) [12].
                             review_summary                                                                                           The resulting models preserve task-relevant capabilities
                         FROM product_reviews;                                                                                        while being significantly smaller and cheaper to execute
                                                                                                                                      with higher parallelism, helping to narrow the performance
                         This approach opens up new possibilities for generating,                                                     gap for row-by-row LLM execution.
                         summarizing, cleaning, and transforming structured and                                                          We have developed an initial prototype of IOLM-DB that
                         unstructured data directly within the database.                                                              targets Python’s pandas library [13] for rapid iteration. By
                            However, applying LLM prompts to each row of data                                                         working in this simplified environment, we can experiment
                         presents significant challenges. A typical query, such as                                                    with various optimization strategies and gather initial per-
                         a transformation, classification, or schema extraction, re-                                                  formance insights before integrating these strategies into a
                         quires a separate LLM invocation involving tokenization,                                                     production environment. The main objective of this paper
                         context encoding, and autoregressive decoding. This per-                                                     is to provide a proof-of-concept for our approach, allow-
                         row inference results in high computational overhead, as                                                     ing us to evaluate potential performance gains and identify
                         even simple queries can trigger millions or billions of LLM                                                  challenges that may arise when implementing and deploy-
                         calls, leading to excessive latency and resource consump-                                                    ing instance-optimized models. Our preliminary results
                         tion, especially for large tables. While running local mod-                                                  indicate that IOLM-DB can generate compressed models
                         els [3, 7] instead of cloud-based models [8, 9, 2] may reduce                                                on the fly that are up to 3.28× smaller than the base model
                         costs and latency, the overheads typically remain significant                                                while sporting similar or better accuracy, achieves higher
                         relative to conventional database operations, and running                                                    parallelism on the same hardware, and increases throughput
                         large-scale distributed models may prove challenging due                                                     between 2.52× and 3.31× on three representative workloads.
                         to the lack of co-located accelerators and large memory re-                                                     The paper makes the following contributions:
                         quirements. These limitations highlight the importance of                                                          • We propose an end-to-end system for prototyping
                         developing efficient strategies to integrate LLMs into an-                                                           LLM prompting in OLAP scenarios and a series of
                         alytics workflows without compromising performance or                                                                workloads to assess its performance.
                         scalability.                                                                                                       • We introduce the first method for generating
                                                                                                                                              instance-optimized LLMs in database environments.
                         DOLAP 2025: 27th International Workshop on Design, Optimization, Lan-
                         guages and Analytical Processing of Big Data, co-located with EDBT/ICDT                                            • We evaluate the efficiency of our method on the pro-
                         2025, March 25, 2025, Barcelona, Spain                                                                               posed workloads, showing significant performance
                         ∗
                              Corresponding author.                                                                                           improvements. These results indicate a promising
                         Envelope-Open bmohamma@mpi-sws.org (B. Mohammadi); bindsch@mpi-sws.org                                               first step toward demonstrating the feasibility of
                         (L. Bindschaedler)
                                                                                                                                              LLM compression for such applications.
                         GLOBE https://bardia-mhd.github.io/ (B. Mohammadi); https://binds.ch
                         (L. Bindschaedler)
                         Orcid 0009-0001-3658-7291 (B. Mohammadi); 0000-0003-0559-631X                                                Artifact Availability The source code and datasets used
                         (L. Bindschaedler)                                                                                           in this paper are publicly available at https://github.com/
                                       © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).                                                     mpi-dsg/IOLM-DB.

CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2. Background and Motivation                                      LLM Compression Techniques Recent advances in LLM
                                                                  optimization provide a foundation for our approach. We
This section briefly overviews key concepts and related           leverage three key techniques from the literature: quantiza-
work that motivate our approach.                                  tion, which reduces numerical precision to decrease memory
                                                                  requirements [19]; sparsification, which introduces strate-
LLM Prompting in Databases Augmenting databases                   gic zero elements to minimize computational overhead [11];
with the ability to prompt LLMs can facilitate data wran-         and structural pruning, which removes non-essential model
gling and analysis by enabling natural language queries and       components [12]. While these techniques have proven effec-
transformations [6]. Rather than exporting data for exter-        tive for domain-specific optimization, our scenario presents
nal processing or implementing complex client-side integra-       unique challenges. We need to optimize at a much finer
tions, bringing LLM capabilities directly into the database       granularity – for individual queries or prompts rather than
execution environment offers a more streamlined approach.         broad domains – and ensure consistent, predictable behavior
Recent non-peer-reviewed work, such as LOTUS [14], has            when working with structured data.
demonstrated the feasibility of extending relational models
with LLM-powered semantic operators, enabling AI-based            3. System Design and Architecture
operations like natural language-driven sorting and aggre-
gation. This paper considers a system that integrates LLM         IOLM-DB is an OLAP system that integrates LLM invocation
prompts as first-class operations within the query process-       directly into its execution pipeline while ensuring that this
ing pipeline, allowing them to be composed with traditional       integration is efficient, scalable, and cost-effective.
database operations. While our focus is on OLAP systems,
the principles extend to other database architectures.            3.1. Overview
New LLM-Based Capabilities LLM integration enables                We assume a setup where the user executes queries over
powerful new capabilities for database systems, especially        one or more database tables (in our prototype, pandas
when working with unstructured or semi-structured data.           DataFrames). These queries can include calls to an LLM, for
These capabilities include summarization, sentiment analy-        example, to summarize free-form text columns, transform
sis, data extraction, error correction, and semantic transfor-    semi-structured fields, or provide semantic annotations.
mations. LLMs can also enable more flexible abnd intuitive        Rather than invoking a general-purpose LLM repeatedly
operations, such as fuzzy matching and semantic joins be-         and incurring high computational and memory overhead at
yond exact string matching.                                       runtime, IOLM-DB creates a specialized, instance-optimized
                                                                  LLM explicitly tailored for the query and dataset. For ex-
Other Approaches Existing approaches to adding these              ample, a query performing text summarization on product
capabilities have notable limitations. Code generation tech-      reviews requires a different optimization than a data cor-
niques [15, 5], where LLMs generate executable database           rection query. To achieve this specialization, the model is
code, can be brittle and struggle with complex transforma-        compressed and pruned based on the query type and data
tions that require deep semantic understanding. Alternative       characteristics. This approach is necessary because a one-
approaches using simpler models trained on input-output           size-fits-all model is inefficient; different queries demand
pairs [16, 17, 18] face challenges handling diverse scenarios     varying levels of understanding, precision, and computa-
and require extensive training data curation. While these         tional cost.
methods can work for specific use cases, they often fail             IOLM-DB works in tandem with the underlying analyt-
to provide the flexibility and generality needed for broad        ics engine. While the engine handles standard relational
adoption in database systems.                                     operations efficiently, we introduce custom operators that
                                                                  intercept LLM prompts in the query. These operators trig-
The Need for Instance-Optimization We argue that
                                                                  ger a workflow that generates a specialized LLM for that
anything short of directly invoking the LLM for such tasks
                                                                  particular query and data distribution. This process lever-
is inherently limiting, as it would restrict the system’s ex-
                                                                  ages a suite of techniques – quantification, sparsification,
pressiveness. Therefore, efficiency is paramount to making
                                                                  and pruning – to minimize both the memory footprint of
LLMs practical at scale, especially for large-scale analyt-
                                                                  the LLM and the inference costs, ensuring that at runtime
ics. Many anticipated use cases require invoking the LLM
                                                                  the optimized LLM can be invoked with minimal latency
at a fine-grained level, such as once per row or more fre-
                                                                  and resource usage.
quently [6]. This granularity requires a new approach that
                                                                     We drastically reduce per-row inference overhead by pro-
reduces inference costs while maintaining accuracy, which
                                                                  ducing a specialized LLM for each query instance. This
motivates our approach of specializing LLMs for specific
                                                                  approach enables running LLM-based transformations and
prompts and data patterns.
                                                                  analyses at scale, supporting massive parallelism and effi-
   OLAP environments are particularly well-suited for such
                                                                  cient resource utilization. Ultimately, the goal is to handle
optimizations, as their controlled setting often allows query
                                                                  large workloads and large numbers of concurrent queries
patterns and data characteristics to be inferred or extracted
                                                                  while providing near-interactive response times.
in advance, enabling model optimization tailored to these
patterns. Similarly, interactive queries frequently exhibit       3.2. Techniques to Generate
recurring or predictable patterns that can be leveraged to
create instance-optimized models, improving efficiency and
                                                                       Instance-Optimized Model
accuracy as these patterns evolve over time. Although our         Our system combines multiple compression techniques to
current method may introduce too much overhead for some           reduce memory consumption and compute costs at the LLM
use cases, we expect it to be highly effective for long-running   level. By carefully integrating these methods, we maintain
queries, where significant performance gains outweigh the         model accuracy while achieving significant resource savings.
upfront cost of optimization during execution.                    We combine the following techniques in IOLM-DB.
Quantization By lowering numerical precision (e.g., us-         reviews, extracting their essential meaning, and outputting
ing 8-bit weights and sometimes 8-bit activations), we re-      a concise summary – e.g., summarizing each product review
duce both the GPU memory footprint and computational            into five words. Such summarization helps analysts quickly
overhead. Quantization retains the model’s core capabilities    gain insights from large volumes of textual data without
but significantly increases inference throughput.               manually sifting through lengthy entries. By operating
                                                                on unstructured Amazon Reviews datasets [23], we test
Sparsification Imposing structured or unstructured spar-        the system’s ability to scale and maintain accuracy when
sity patterns in model weights reduces the number of active     transforming large amounts of unstructured text.
parameters, leading to fewer operations during inference
and, in some cases, allows for hardware-level support for       4.2. Data Correction
sparse computations.
                                                                Data correction enhances data quality by addressing errors
Structural Pruning Removing entire components such              or inconsistencies. For instance, we can provide a specific
as attention heads or entire layers that contribute little to   data type or category and task the system with correcting
the task at hand reduces model depth and complexity. Full       typos or mismatches. By applying a per-row invocation of
structural pruning, aided by tools such as LLM-Pruner [20],     the language model, we can correct misspelled code or text
enables the construction of ultra-compact models special-       records in GitHub for subsequent analysis [24].
ized for the given query patterns. These techniques are
not applied in isolation. Methods such as GPTQ [21]             4.3. Fuzzy Joins (Semantic Mapping)
and SmoothQuant [22] combine pruning and quantization
                                                                Fuzzy joins address the problem of integrating data from
steps to preserve accuracy while aggressively reducing size.
                                                                multiple tables by understanding semantic similarity rather
SparseGPT [11] applies pruning strategies suitable for large
                                                                than relying on exact string matches. For example, the
models in a one-shot manner, maintaining accuracy even
                                                                system can determine whether two entries from different
at high sparsity levels. The result is a highly compressed
                                                                datasets refer to the same entity despite slight differences
model tailored to the query’s distribution and the dataset’s
                                                                in wording or formatting, even when textual fields do not
characteristics. We use calibration data – small, unlabeled
                                                                perfectly align, improving data integration and discovery.
samples representing the query’s input domain – to fine-
tune quantization parameters and pruning thresholds. This
process ensures that the specialized LLM efficiently han-
                                                                5. Evaluation
dles the target data, reducing resource requirements while      We conduct a preliminary evaluation of IOLM-DB through
minimizing accuracy loss.                                       experiments on the three workloads and associated datasets
                                                                described in Section 4. Our key objective is to verify the
3.3. Runtime and Invocation Optimizations
                                                                viability of our approach for scaling LLM invocations per
At runtime, the optimized LLM is invoked by custom opera-       row in a realistic setting.
tors within the OLAP engine’s execution pipeline. IOLM-DB
uses the following optimizations to further reduce overhead.    Metrics   Our evaluation centers on the following metrics:
                                                                     • Throughput: The number of rows processed per
                                                                       second by the system, which reflects its overall effi-
Caching Intermediate results and repeated inputs are
                                                                       ciency and scalability.
cached so that identical LLM queries need not be recom-
puted, which is especially valuable when data contains fre-          • Model Size: The size of the model, which serves
quent duplicates or recurring patterns.                                as a proxy for GPU memory usage and its capacity
                                                                       to parallelize execution effectively. Smaller mod-
Batching We batch multiple requests to the LLM together                els typically reduce memory pressure and improve
to amortize invocation overhead. By grouping multiple rows             resource utilization.
or operations, we minimize switching costs and achieve               • Accuracy: The proportion of rows where the sys-
higher throughput.                                                     tem produces correct results. For this evaluation,
                                                                       we assume the baseline model achieves perfect accu-
Cascading (Future Work) We plan to explore cascading                   racy (accuracy = 1), and we compare the optimized
strategies in the future, where an initial coarse-grained LLM          models by normalizing against this standard.
invocation feeds into more specialized or higher-precision
models only where needed. This approach could further           Models We conduct our evaluation using Meta’s Llama
refine trade-offs between speed, cost, and accuracy.            3.1 Instruct model with 8 billion parameters instruction-
                                                                tuned language model (Llama-3.1-Instruct-8B) [3], which
4. Workloads and Use Cases                                      strikes a balance between size and performance for the work-
We evaluate our approach on three representative work-          loads under consideration. The model is compact enough to
loads, each designed to highlight a different aspect of in-     fit on a single modern GPU yet powerful enough to deliver
tegrating LLM invocations into OLAP queries. All these          strong performance across all tasks. We verified our Llama
workloads operate on real-world datasets (e.g., Amazon Re-      model’s suitability for these tasks by comparing it against
view [23], Github Typos [24]) and are designed to stress        OpenAI’s GPT4o and Anthropic’s Claude Sonnet 3.5 and
different points along the performance-accuracy spectrum.       manually verifying the results.

4.1. Summarization (Text Reduction)                             Baseline and Configurations We execute IOLM-DB us-
This workload involves condensing verbose free-text fields      ing Llama-3.1-Instruct-8B, which serves as our baseline for
into short summaries. An example use case is taking product     all metrics. Building upon this foundation, we evaluate two
                                                                optimized variants:
                      Workload          Model             Model Size     Accuracy Score      Throughput
                                        Baseline           14.98 GB            1              4.67 rows/s
                      Summarization     IOLM-DB-Perf        8.48 GB           0.91           15.50 rows/s
                                        IOLM-DB-Acc         8.48 GB            1             11.97 rows/s
                                        Baseline           14.98 GB            1              2.73 rows/s
                      Data Correction   IOLM-DB-Perf        8.48 GB            1              7.60 rows/s
                                        IOLM-DB-Acc         8.48 GB            1              7.60 rows/s
                                        Baseline           14.98 GB            1             14.92 rows/s
                      Fuzzy Join        IOLM-DB-Perf        8.48 GB            1             37.72 rows/s
                                        IOLM-DB-Acc         8.48 GB            1             37.72 rows/s
    Table 1
    Throughput, memory usage, and accuracy score (normalized to the baseline) for each of the three workloads considered in
    this paper. IOLM-DB-Perf corresponds to the compressed model with the highest throughput, while IOLM-DB-Acc is the
    compressed model with the highest accuracy score. We report the overall throughput in rows per second.



     • IOLM-DB-Perf: instance-optimized version of the            querying in analytics workloads. Our evaluation also re-
       Llama-3.1-Instruct-8B with the best throughput.            vealed several areas for improvement that could further
     • IOLM-DB-Acc: instance-optimized version of the             enhance the performance and scalability of our solution.
       Llama-3.1-Instruct-8B with the best accuracy.                 One key bottleneck identified in our initial prototype is
                                                                  the space consumed by the vLLM cache. Optimizing the
   These two variants highlight an interesting trade-off be-      cache design could further reduce memory usage and en-
tween performance and accuracy enabled by IOLM-DB                 able increased parallelism, particularly for workloads with
which we aim to explore in future work, allowing the user         high concurrency demands. Also, the interface between
to select whether to spend more time for better results or to     the pandas library and our instance-optimized operators
produce slightly less accurate results faster.                    presents another area for improvement. Reengineering this
Hardware and Configuration We run these experi-                   interface to reduce overhead and improve integration could
ments on a machine equipped with an NVIDIA Hopper                 significantly boost overall performance. We expect to gain
H100 80 GB SXM GPU, two AMD CPUs (EPYC 9654), and                 one or two orders of magnitude performance improvements
2 TB of DDR5. The operating system is 64-bit Debian Linux         from these optimizations.
(kernel version 5.15). We use CUDA 11.8, vLLM 0.6.3, and             An interesting observation is that the resulting quality,
HuggingFace 4.46.1 transformers.                                  particularly for IOLM-DB-Acc, often improves compared to
                                                                  the baseline. However, the normalization of the accuracy
                                                                  score prevents this improvement from being reflected in
5.1. Performance, Model Size, and Accuracy                        Table 1. These findings suggest that our instance-optimized
Table 1 compares the baseline with our IOLM-DB instance-          approach not only preserves accuracy but can, in some cases,
optimized models. We show the overall model size as mea-          enhance it. Further investigation is needed to understand
sured on the GPU, the resulting accuracy in our dataset           the underlying factors driving these improvements and to
(normalized to the baseline), and the throughput achieved         explore how they can be consistently leveraged.
in each case (in rows processed per second).                         Finally, we plan to explore additional techniques in the
   These results underscore the benefits of instance-             LLM compression space, such as advanced quantization
optimization for LLMs across all workloads considered. In         methods and knowledge distillation. The time to generate
all cases except summarization, both IOLM-DB-Perf and             an instance-optimized LLM is in the order of single-digit
IOLM-DB-Acc achieve an accuracy score of 1, demonstrat-           minutes, which we believe is acceptable for most table sizes.
ing that our approach maintains baseline-level accuracy.          However, reducing that further could unlock additional use
This outcome highlights that our optimizations preserve           cases outside the OLAP space.
model quality while reducing resource requirements.
   Our method achieves significant reductions in model size,      6. Conclusion
with compression factors of up to 76%, through quantization,      This paper presented IOLM-DB, a proof-of-concept system
sparsification, and pruning. These reductions lower mem-          for integrating LLM operations directly into OLAP query
ory requirements and improve computational efficiency.            pipelines. The core innovation lies in instance-optimizing
The throughput improvements are substantial: 2.78× for            LLMs based on the specific query and dataset at hand. IOLM-
data correction and 2.52× for fuzzy join workloads. For sum-      DB can generate specialized models that drastically reduce
marization, despite a slight accuracy decrease in IOLM-DB-        inference costs without sacrificing accuracy by leveraging
Perf, we observe a 3.31× throughput improvement. These            quantization, sparsification, and pruning techniques. Our
performance gains stem from the compressed models and             preliminary results suggest that this approach makes row-
reduced model size, enabling higher parallelism.                  by-row invocation of LLMs at scale both practical and effi-
   The results demonstrate that instance-optimized LLMs           cient, bridging the gap between powerful linguistic trans-
can effectively balance accuracy and efficiency, making them      formations and large-scale analytics.
suitable for performance-critical applications.                      This line of work opens up promising research direc-
                                                                  tions, encouraging the community to refine these techniques
5.2. Discussion                                                   further, explore new instance-level optimizations, and ulti-
Although these results are preliminary, they highlight the        mately bring high-performance, query-aware LLM capabili-
potential of instance-optimization to enable efficient LLM        ties into mainstream analytical workflows.
References                                                               ceedings of the ACM on Programming Languages 7
                                                                         (2023) 952–981.
 [1] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka-             [17] Q. Chen, A. Banerjee, Ç. Demiralp, G. Durrett, I. Dil-
     plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas-                lig, Data extraction via semantic regular expression
     try, A. Askell, et al., Language models are few-shot                synthesis, Proceedings of the ACM on Programming
     learners, Advances in neural information processing                 Languages 7 (2023) 1848–1877.
     systems 33 (2020) 1877–1901.                                   [18] Y. He, X. Chu, K. Ganjam, Y. Zheng, V. Narasayya,
 [2] G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu,                S. Chaudhuri, Transform-data-by-example (tde) an
     R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Milli-            extensible search engine for data transformations,
     can, et al., Gemini: a family of highly capable multi-              Proceedings of the VLDB Endowment 11 (2018)
     modal models, arXiv preprint arXiv:2312.11805 (2023).               1165–1177.
 [3] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A.          [19] E. Frantar, S. Ashkboos, T. Hoefler, D. Alistarh, Gptq:
     Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro,               Accurate post-training quantization for generative
     F. Azhar, et al., Llama: Open and efficient foundation              pre-trained transformers, ArXiv abs/2210.17323
     language models, arXiv preprint arXiv:2302.13971                    (2022). URL: https://api.semanticscholar.org/CorpusID:
     (2023).                                                             253237200.
 [4] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke,             [20] X. Ma, G. Fang, X. Wang, Llm-pruner: On the struc-
     E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg,        tural pruning of large language models, Advances
     et al., Sparks of artificial general intelligence: Early ex-        in neural information processing systems 36 (2023)
     periments with gpt-4, arXiv preprint arXiv:2303.12712               21702–21720.
     (2023).                                                        [21] E. Frantar, S. Ashkboos, T. Hoefler, D. Alistarh, Gptq:
 [5] X. Li, T. Döhmen, Towards efficient data wrangling                  Accurate post-training quantization for generative pre-
     with llms using code generation, in: Proceedings of                 trained transformers, in: The Eleventh International
     the Eighth Workshop on Data Management for End-                     Conference on Learning Representations, 2023.
     to-End Machine Learning, 2024, pp. 62–66.                      [22] G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, S. Han,
 [6] A. Narayan, I. Chami, L. Orr, S. Arora, C. Ré, Can                  Smoothquant: Accurate and efficient post-training
     foundation models wrangle your data?, arXiv preprint                quantization for large language models, in: Interna-
     arXiv:2205.09911 (2022).                                            tional Conference on Machine Learning, PMLR, 2023,
 [7] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford,                pp. 38087–38099.
     D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel,        [23] J. Ni, J. Li, J. McAuley, Justifying recommendations
     G. Lample, L. Saulnier, et al., Mistral 7b, arXiv preprint          using distantly-labeled reviews and fine-grained as-
     arXiv:2310.06825 (2023).                                            pects, in: Conference on Empirical Methods in
 [8] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya,               Natural Language Processing, 2019. URL: https://api.
     F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman,               semanticscholar.org/CorpusID:202621357.
     S. Anadkat, et al., Gpt-4 technical report, arXiv              [24] M. Hagiwara, M. Mita, GitHub typo corpus: A large-
     preprint arXiv:2303.08774 (2023).                                   scale multilingual dataset of misspellings and gram-
 [9] Anthropic, Claude, 2024. URL: https://www.anthropic.                matical errors, in: N. Calzolari, F. Béchet, P. Blache,
     com/claude, large language model.                                   K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isa-
[10] K. Egashira, M. Vero, R. Staab, J. He, M. Vechev, Ex-               hara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno,
     ploiting LLM quantization, in: The Thirty-eighth An-                J. Odijk, S. Piperidis (Eds.), Proceedings of the Twelfth
     nual Conference on Neural Information Processing                    Language Resources and Evaluation Conference, Eu-
     Systems, 2024. URL: https://openreview.net/forum?                   ropean Language Resources Association, Marseille,
     id=ISa7mMe7Vg.                                                      France, 2020, pp. 6761–6768. URL: https://aclanthology.
[11] E. Frantar, D. Alistarh, Sparsegpt: Massive lan-                    org/2020.lrec-1.835.
     guage models can be accurately pruned in one-
     shot, ArXiv abs/2301.00774 (2023). URL: https://api.
     semanticscholar.org/CorpusID:255372747.
[12] X. Ma, G. Fang, X. Wang,                Llm-pruner: On
     the structural pruning of large language mod-
     els, ArXiv abs/2305.11627 (2023). URL: https://api.
     semanticscholar.org/CorpusID:258823276.
[13] W. McKinney, et al., pandas: a foundational python
     library for data analysis and statistics, Python for high
     performance and scientific computing 14 (2011) 1–9.
[14] L. Patel, S. Jha, C. Guestrin, M. Zaharia, Lotus:
     Enabling semantic queries with llms over tables of
     unstructured and structured data, arXiv preprint
     arXiv:2407.11418 (2024).
[15] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto,
     J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brock-
     man, et al., Evaluating large language models trained
     on code, arXiv preprint arXiv:2107.03374 (2021).
[16] J. Cambronero, S. Gulwani, V. Le, D. Perelman, A. Rad-
     hakrishna, C. Simon, A. Tiwari, Flashfill++: Scaling
     programming by example by cutting to the chase, Pro-