The Impact of Quantization on Retrieval-Augmented Generation: An Analysis of Small LLMs Mert Yazan1,2,∗ , Suzan Verberne2 and Frederik Situmeang1 1 Amsterdam University of Applied Sciences, Fraijlemaborg 133, 1102 CV Amsterdam, Netherlands 2 Leiden University, Einsteinweg 55, 2333 CC Leiden, Netherlands Abstract Post-training quantization reduces the computational demand of Large Language Models (LLMs) but can weaken some of their capabilities. Since LLM abilities emerge with scale, smaller LLMs are more sensitive to quantization. In this paper, we explore how quantization affects smaller LLMs’ ability to perform retrieval-augmented generation (RAG), specifically in longer contexts. We chose personalization for evaluation because it is a challenging domain to perform using RAG as it requires long-context reasoning over multiple documents. We compare the original FP16 and the quantized INT4 performance of multiple 7B and 8B LLMs on two tasks while progressively increasing the number of retrieved documents to test how quantized models fare against longer contexts. To better understand the effect of retrieval, we evaluate three retrieval models in our experiments. Our findings reveal that if a 7B LLM performs the task well, quantization does not impair its performance and long-context reasoning capabilities. We conclude that it is possible to utilize RAG with quantized smaller LLMs. Keywords Retrieval Augmented Generation, Quantization„ Efficiency, Large Language Models, Personalization 1. Introduction it might not be feasible to deploy a 70B LLM as it is com- putationally demanding. To decrease the computational Large Language Model (LLM) outputs can be enhanced by demand of LLMs, post-training quantization can be used. fetching relevant documents via a retriever and adding them Quantization drastically reduces the required amount of as context for the prompt. The LLM can generate an output RAM to load a model and can increase the inference speed grounded with relevant information with the added con- by more than 3 times [12, 13]. Despite the benefits, quan- text. This process is called Retrieval Augmented Generation tization affects LLMs differently depending on their size (RAG). RAG has many benefits such as improving effective- [14]. For capabilities that are important to RAG, such as ness in downstream tasks [1, 2, 3, 4], reducing hallucinations long-context reasoning, smaller LLMs (<13B) are found to [5], increasing factuality [6], by-passing knowledge cut-offs, be more sensitive to quantization [14]. and presenting proprietary data that is not available to the In this paper, we investigate the effectiveness of quan- LLMs. tization on RAG-enhanced 7B and 8B LLMs. We evaluate The performance of RAG depends on the number, quality, the full (FP16) and quantized (INT4) versions of multiple and relevance of the retrieved documents [7]. To perform LLMs on two personalization tasks taken from the LaMP RAG, many tasks demand a lot of passages extracted from [15] benchmark. To better study how quantized LLMs per- multiple, unstructured documents: For question-answering form in longer contexts, we compared the performance gap tasks, the answer might be scattered around many docu- between FP16 and INT4 models with an increasing number ments because of ambiguity or the time-series nature of the of retrieved documents. We chose personalization because question (eg. price change of a stock). For more open-ended it is a challenging task to perform with RAG as it demands tasks like personalization, many documents from different long-context reasoning over many documents. Contrary sources might be needed to capture the characteristics of to question-answering where the LLM has to find the cor- the individual. Therefore to handle RAG in these tasks, an rect answer from a couple of documents, personalization LLM needs to look at multiple sources, identify the relevant requires the LLM to carefully study a person’s style from parts, and compose the most plausible answer [7]. all the provided documents. Our findings show that the LLMs do not pay the same attention to their whole con- effect of quantization depends on the model and the task: text windows, meaning the placement of documents in the we find almost no drop in performance for OpenChat while prompt directly affects the final output [8]. On top of that, LLaMA2 seems to be more sensitive. Our experiments show some of the retrieved documents may be unrelated to the that quantized smaller LLMs can be good candidates for task, or they may contain contradictory information com- RAG pipelines, especially if efficiency is essential. pared to the parametric knowledge of the LLM [9]. An LLM has to overcome these challenges to leverage RAG to its ad- vantage. Xu et al. [4] have shown that an open-source 70B 2. Approach LLM [10] equipped with RAG can beat proprietary models, meaning it is not necessary to use an LLM in the caliber 2.1. LLMs of GPT-4 [11] to implement RAG. Still, for many use cases, Starting with LLaMA2-7B (Chat-hf) [10] to have a baseline, IR-RAG@SIGIR’ 24: The 47th International ACM SIGIR Conference on we experiment with the following LLMs: LLaMA3-8B [16], Research and Development in Information Retrieval, July 14–18, 2024, Zephyr (Beta) [17], OpenChat (3.5) [18], and Starling (LM- Washington D.C., USA alpha) [19]. These models were chosen because they were ∗ Corresponding author. the highest-ranked 7B and 8B LLMs in the Chatbot Arena Envelope-Open m.yazan@hva.nl (M. Yazan); s.verberne@liacs.leidenuniv.nl (S. Verberne); f.b.i.situmeang@uva.nl (F. Situmeang) Leaderboard [20] according to the Elo ratings at the time Orcid 0009-0004-3866-597X (M. Yazan); 0000-0002-9609-9505 of writing. Since all models except LLaMA are finetuned (S. Verberne); 0000-0002-2156-2083 (F. Situmeang) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings LaMP-3U LaMP-5U Give a score between [1, 2, 3, 4, 5] to the following review. Only output the score and Your task is to generate a title for the given abstract. You will only output the title and nothing else. Review: I purchased this elsewhere but wanted to leave my review nothing else. Abstract: Code diversification is an effective mitigation against here. I LOVE this product. Score: 5 return-oriented programming attacks […] Title: Instruction Displacement: A Code Diversification Technique for Improved ROP Mitigation Here are a couple of review-rating pairs of a user: {examples}. With the given Here are a couple of abstract-title pairs of an author: {examples}. With the given examples, give a score between [1, 2, 3, 4, 5] to the following review by the same examples, generate a title for the given abstract by the same author. Only output the user. Only output the score and nothing else. Review: I purchased this elsewhere but title and nothing else: Abstract: Code diversification is an effective mitigation against wanted to leave my review here. I LOVE this product. Score: 5 return-oriented programming attacks […] Title: Protecting Third-Party Binaries from Code Reuse Attacks through Instruction Displacement Figure 1: Prompts used for both datasets. The ones on the top represent 𝑘 = 0 (zero-shot, no retrieved documents) and the ones on the bottom are for 𝑘 > 0 settings (RAG). The green text is the model output. Line endings are not shown for space reasons. variants of Mistral-7B [21], we add Mistral-7B (Instruct-0.1)1 MAE and RMSE is 0.94, and between Rouge-1 and Rouge-L to our experiments too. We use Activation-aware Weight is 0.99. Therefore, we do not include those metrics in our Quantization (AWQ) as it outperforms other methods [22]. results. The prompts we use are shown in Figure 1. Even though the LLMs are instructed to output only the score 2.2. Tasks and Datasets or the title, we notice that some are prone to give lengthy answers such as “Sure, here is the title for the given abstract, We use the LaMP benchmark that offers 7 personalization Title: (generated title)”. We apply a post-processing step on datasets with either a classification or a generation task the LLM outputs to extract only the score or the title before [15]. To represent both types of tasks, we chose one dataset evaluation. from each: LaMP-3 (“Personalized Product Rating”) and LaMP-5 (“Personalized Scholarly Title Generation”). LaMP- 2.3. Retrieval 3 is composed of product reviews and their corresponding scores. For each user, one of the review–score pairs is cho- We conduct the experiments with the following number sen as the target and other pairs become the user profile. of retrieved documents: 𝑘 ∈ {0, 1, 3, 5, 𝑚𝑎𝑥_4𝐾 , 𝑚𝑎𝑥_8𝐾 }. 0 The LLM’s task, in this case, is to predict the score given refers to zero-shot without any retrieval and 𝑚𝑎𝑥 is the a review using the other review–score pairs of the same maximum number of documents that can be put into the user. LaMP-5 aims to generate a title for an academic paper prompt, given the context window of the LLM. LLaMA2 has based on the abstract. In this case, the user profile consists a context window of 4096 tokens while other models have of abstract–title pairs that demonstrate the writing style 8192 tokens. To make it fair, we include two options for of the user (scholar). The task of the LLM is to generate the 𝑚𝑎𝑥 setting: 4K and 8K. For 𝑚𝑎𝑥_4𝐾, we assume that all a title for the given abstract by incorporating the writing models have a 4096 token context window, we use the origi- style of the scholar. Those datasets were chosen because nal 8192 token context windows for 𝑚𝑎𝑥_8𝐾. Consequently, compared to the other ones, on average, they had more LLaMA2-7B is not included in the 𝑚𝑎𝑥_8𝐾 experiments. To samples in their user profiles, and the samples were longer. put it into perspective, the number of retrieved documents Therefore, they represented a better opportunity to evaluate varies between 15-18 for 𝑚𝑎𝑥_4𝐾, and between 25-28 for RAG effectiveness as the retrieval part would be trickier. 𝑚𝑎𝑥_8𝐾 in LaMP-5U, depending on the average length of We work with the user-based splits (LaMP-3U, LaMP- documents in the user profile. As retrievers, we evaluate 5U) where the user appears only in one of the data splits BM25 [24] (BM25 Okapi) 2 , Contriever [25] (finetuned on [15]. The labels for the test sets are not publicly available MS-Marco), and DPR [26] (finetuned on Natural Questions)3 . (results can be obtained by submitting the predictions to Since we focus on efficiency by reducing the computational the leaderboard) and since we did not fine-tune our models, load, the retrievers are not finetuned on the datasets. we chose to use the validation sets for evaluation. For both datasets, we noticed that some samples do not fit in the context windows. After analyzing the overall length of the 3. Results samples, we concluded that those cases only represent a tiny minority and removed data points that are not in the 3.1. LLMs 0.995th percentile. For LaMP-5U, we also removed abstracts The results are shown in Table 1. We see the dominance that consisted only of the text “no abstract available”. There of OpenChat in both datasets. Zephyr performs very close are 2500 samples in the validation sets, and we have 2487 to OpenChat in LaMP-3U but falls far behind in LaMP-5U. samples left after the preprocessing steps for both datasets. The same can be said for Starling but reversed. Mistral- 7B performs stable in both datasets albeit being slightly 2.2.1. Evaluation behind OpenChat. Overall, LLaMA2 performs the worst as it is below average in both datasets. Despite being the We used mean absolute error (MAE) for LaMP-3 and Rouge- dominant small LLM currently, LLaMA3 is not the best for L [23] for LaMP-5, following the LaMP paper [15]. Their both tasks, despite performing reasonably well in LaMP- experiments also include root mean square error (RMSE) and 3U. Interestingly, LLaMA3 has the best zero-shot score in Rouge-1 scores, but we found that the correlation between 2 https://pypi.org/project/rank-bm25/ 1 3 Although there is an updated v0.2 version of Mistral-7B, we used v0.1 https://huggingface.co/facebook/dpr-question_ to match the other LLMs that are finetuned on it encoder-single-nq-base Table 1 The absolute percentage change between FP16 and INT4 scores, using Contriever. More than a 5% drop in performance is highlighted in red. For MAE, the lower is better while the inverse is true for Rouge-L. LLaMA2 OpenChat Starling Zephyr Mistral LLaMA3 Dataset Metric k FP16 INT4 FP16 INT4 FP16 INT4 FP16 INT4 FP16 INT4 FP16 INT4 0 0.684 +2.9% 0.440 -7.8% 1.603 +45% 0.435 -14.7% 0.569 -2.5% 0.481 -5.9% 1 0.453 -1.1% 0.312 +5.5% 0.800 +7.1% 0.300 +1.9% 0.461 -9.3% 0.364 -10.8% 3 0.637 -7.6% 0.256 +2.8% 0.718 -30.0% 0.273 +2.6% 0.404 -8.0% 0.320 -9.2% LaMP-3U MAE ↓ 5 0.724 -23.3% 0.238 +1.8% 0.797 -32.0% 0.266 +0.8% 0.380 -8.1% 0.305 -13.0% max_4K 0.508 -80.2% 0.224 +1.8% 0.985 -57.1% 0.237 -4.4% 0.346 -14.7% 0.285 -23.1% max_8K - - 0.257 -3.9% 1.352 -1.1% 0.392 -6.3% 0.368 -16.1% 0.288 -23.4% 0 0.338 -0.6% 0.361 -0.5% 0.359 -2.3% 0.335 -0.4% 0.361 +1.2% 0.384 -2.5% 1 0.380 -9.7% 0.404 -1.0% 0.397 -1.0% 0.360 +0.9% 0.400 -0.9% 0.402 -5.6% 3 0.385 -11.6% 0.415 0.0% 0.412 -0.2% 0.360 -0.8% 0.410 -0.5% 0.404 -5.2% LaMP-5U Rouge-L ↑ 5 0.374 -10.3% 0.422 -0.7% 0.419 -1.4% 0.365 -2.1% 0.415 -0.7% 0.397 -4.5% max_4K 0.337 -16.8% 0.419 -1.1% 0.402 -0.5% 0.357 -7.7% 0.410 -1.1% 0.376 -2.6% max_8K - - 0.395 -1.0% 0.379 -1.9% 0.326 -18.7% 0.387 -0.8% 0.384 -7.2% (a) MAE results for LaMP-3U, the lower the better (b) Rouge-L results for LaMP-5U, the higher the better Figure 2: Results for both datasets. The upper and lower borders of each colored area represent the quantized and not- quantized performances of the models, and the corresponding lines are the mean of both. LaMP-5U but it struggles to improve itself with retrieval. with retrieved documents. The LaMP-3U results (Figure 2a) continue to improve after adding more than 5 documents as 3.2. Quantization can be observed from 𝑚𝑎𝑥_4𝐾 scores, but MAE also starts to get worse for 𝑚𝑎𝑥_8𝐾. How much an LLM gets affected by quantization seems to be We analyze whether a longer context window hurts the related to how well it performs the task. OpenChat suffers quantized variants more and find that there seems to be a almost no performance degradation from quantization. On peculiar relationship. INT4 LLaMA2 suffers from longer the contrary, LLaMA2 seems very sensitive, especially when contexts, while INT4 OpenChat performs well and acts al- the number of retrieved documents is increased. Starling most the same as its FP16 counterpart. INT4 Mistral and suffers no significant consequence from quantization in LLaMA3 act very similar to their FP16 counterparts in LaMP- LaMP-5U where it performs well, but it does suffer in LaMP- 5U but in LaMP-3U, they get progressively worse with more 3U. There also seems to be a disparity between the tasks as documents. Overall, quantization can increase the risk of quantized LLMs perform much worse in LaMP-3U than in worsened long-context capabilities but there is not a direct LaMP-5U. relationship as it is highly task and context-dependent. 3.3. Number of retrieved documents 3.4. Retrievers Figure 2 shows that LLM performance is saturated with a Figure 2 shows that the three retrievers gave almost iden- couple of documents, and the improvement obtained from tical results, albeit BM25 being marginally behind the oth- more is marginal. In LaMP-5U, adding more than 5 docu- ers. Also in LaMP-5U, the gap between the FP16 and INT4 ments starts to hurt the performance: Figure 2b shows an LLaMA2 varies slightly. Other than that, the retriever model inverse-U-shaped distribution for all LLMs except LLaMA3. does not have a noticeable impact on the personalization For some models, performance even drops below the zero- tasks we experimented with. The patterns we found re- shot setting when all the available context window is filled garding LLMs, quantization, and the number of retrieved Table 2 This may explain its superior zero-shot performance. LLMs Our results compared with LaMP. Results indicated with * are not suffer from a knowledge conflict between their paramet- significantly lower than the reported best result (FlanT5-XXL). A ric information and the contextual information presented quantized 7B LLM can perform on par with a larger model while through retrieval [9]. If LLaMA3 had already memorized being much more efficient. some of the titles of the abstracts in LaMP-5U, it might re- LaMP-3U LaMP-5U Required sult in a knowledge conflict when similar abstract-title pairs (MAE) ↓ (Rouge-L) ↑ VRAM ↓ of the same author are presented. This may explain the ChatGPT reduced improvement in its performance with retrieval. 0.658 0.336 ? [15] LLMs have been shown to struggle with too many re- FlanT5-XXL trieved documents [8], and our findings are in accordance. 0.282 0.424 43 GB [15] Our results indicate that more than > 5 documents do not OpenChat 0.238 0.423* 28 GB help and can even hurt performance. From prior studies, (FP16) we know that LLMs focus more on the bottom and the top OpenChat 0.234 0.419* 4.2 GB of their context window [8]. We progressively put the most (INT4) relevant documents starting from the bottom to the top. Therefore especially for 𝑘 = 𝑚𝑎𝑥 settings, the less relevant documents are the same for all the retrievers. documents are put on the top. This situation might hurt the LLM performance as it would focus on the most and the least related information in this case. That being said, 3.5. Benchmark comparison state-of-the-art LLMs with more than 7B parameters also Finally, we compared our findings with the RAG results from suffer from the same phenomenon even when not quantized LaMP [15]. Table 2 shows that OpenChat (Contriever, 𝑘 = 5) [8]. Although quantization increases the risk of worsened can beat FlanT5-XXL in LaMP-3U and performs very close in long-context performance, we cannot conclude that it is the LaMP-5U. More importantly, its quantized counterpart has sole perpetrator, as this is an inherent problem for all LLMs. very similar results. Since we do not have the per-sample scores for the baseline models from LaMP, we perform a one-sample t-test on the Rouge-L scores. The corresponding 5. Conclusion 𝑝 value of 0.29 shows a non-significant difference between We have shown that quantized smaller LLMs can use RAG the results. Moreover, the results from the LaMP paper are to perform complex tasks such as personalization. Even with finetuned retrievers while our results are with non- though quantization might decrease the ability of LLMs to finetuned retrievers. This indicates that a quantized-7B analyze long contexts, it is task- and LLM-dependent. An LLM can compete and even outperform a bigger model on LLM that performs well on a task does not lose much of its personalization with RAG. long-context abilities when quantized. Thus, we conclude Table 2 shows how much GPU VRAM is needed to deploy that quantized 7B LLMs can be the backbones of RAG with each model. With this comparison, the benefit of quanti- long contexts. The reduced computational load obtained zation becomes more pronounced: multiple high-level con- from quantization would make it possible to run RAG appli- sumer GPUs or an A100 is necessary for running even a cations with more affordable and accessible hardware. For 7B LLM while a mid-level consumer GPU (eg. RTX 3060) future work, more quantization methods can be included would be enough to run it with quantization. According to in the experiments to see if the findings can be replicated the scores taken from LaMP, both FlanT5-XXL and Open- across different methods. We can also extend the number Chat decisively beat ChatGPT, but the authors warn that the set of k, especially between 𝑘 = 5 and 𝑘 = 𝑚𝑎𝑥_4𝐾, and prompts used for ChatGPT may not be ideal and may con- change the order of the documents to better understand tribute to a sub-optimal performance. Therefore, our results how quantized LLMs use their context windows. should not be used to make a comparison with ChatGPT. Acknowledgments 4. Discussion This research is part of the project LESSEN with project num- Our results show that some LLMs (in particular OpenChat) ber NWA.1389.20.183 of the research program NWA_ORC can be successful in RAG pipelines, even after quantiza- 202021, which is (partly) funded by the Dutch Research tion, but the performance is LLM- and task-dependent. Council (NWO). The method of quantization affects LLMs differently [14]. Thus, the relationship between quantization and RAG per- formance is not straightforward and can be studied more References extensively. Still, our results indicate that when a small LLM performs the task well, its AWQ-quantized counterpart per- [1] J. Huang, W. Ping, P. Xu, M. Shoeybi, K. C.-C. Chang, forms on par. B. Catanzaro, Raven: In-context learning with retrieval The differing performance of some LLMs between the augmented encoder-decoder language models, 2023. datasets may be partly due to prompting. LLMs are sensitive arXiv:2308.07922 . to prompts, and a prompt that works for one LLM may not [2] X. Ma, Y. Gong, P. He, H. Zhao, N. Duan, Query rewrit- work for another one [27]. The most peculiar result is the ing for retrieval-augmented large language models, lackluster performance of LLaMA3 in LaMP-5U. LLaMA3 2023. arXiv:2305.14283 . is a recently released model trained with an extensive pre- [3] W. Shi, S. Min, M. Yasunaga, M. Seo, R. James, training corpus [16]. It has a higher chance of seeing the M. Lewis, L. Zettlemoyer, W. tau Yih, Replug: abstracts presented in the LaMP-5U in its pretraining data. Retrieval-augmented black-box language models, 2023. Openchat: Advancing open-source language models arXiv:2301.12652 . with mixed-quality data, 2023. arXiv:2309.11235 . [4] P. Xu, W. Ping, X. Wu, L. McAfee, C. Zhu, Z. Liu, S. Sub- [19] B. Zhu, E. Frick, T. Wu, H. Zhu, J. Jiao, Starling-7b: ramanian, E. Bakhturina, M. Shoeybi, B. Catanzaro, Improving llm helpfulness & harmlessness with rlaif, Retrieval meets long context large language models, 2023. 2023. arXiv:2310.03025 . [20] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, [5] Z. Proser, Retrieval augmented generation Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. (rag): Reducing hallucinations in genai applica- Gonzalez, I. Stoica, Judging llm-as-a-judge with mt- tions, 2023. URL: https://www.pinecone.io/learn/ bench and chatbot arena, 2023. arXiv:2306.05685 . retrieval-augmented-generation/. [21] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, [6] R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saunders, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, X. Jiang, K. Cobbe, T. Eloundou, G. Krueger, K. Button, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, W. E. M. Knight, B. Chess, J. Schulman, Webgpt: Browser- Sayed, Mistral 7b, 2023. arXiv:2310.06825 . assisted question-answering with human feedback, [22] J. Lin, J. Tang, H. Tang, S. Yang, X. Dang, C. Gan, 2022. arXiv:2112.09332 . S. Han, Awq: Activation-aware weight quantiza- [7] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, tion for llm compression and acceleration, 2023. J. Sun, M. Wang, H. Wang, Retrieval-augmented gen- arXiv:2306.00978 . eration for large language models: A survey, 2024. [23] C.-Y. Lin, ROUGE: A package for automatic evaluation arXiv:2312.10997 . of summaries, in: Text Summarization Branches Out, [8] N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, Association for Computational Linguistics, Barcelona, F. Petroni, P. Liang, Lost in the middle: How language Spain, 2004, pp. 74–81. URL: https://aclanthology.org/ models use long contexts, 2023. arXiv:2307.03172 . W04-1013. [9] R. Xu, Z. Qi, C. Wang, H. Wang, Y. Zhang, W. Xu, [24] S. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, Knowledge conflicts for llms: A survey, 2024. M. Gatford, Okapi at trec-3., 1994, pp. 0–. arXiv:2403.08319 . [25] G. Izacard, M. Caron, L. Hosseini, S. Riedel, P. Bo- [10] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma- janowski, A. Joulin, E. Grave, Unsupervised dense in- hairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, formation retrieval with contrastive learning, 2022. S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, arXiv:2112.09118 . G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, [26] V. Karpukhin, B. Oğuz, S. Min, P. Lewis, L. Wu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Edunov, D. Chen, W. tau Yih, Dense passage re- S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, trieval for open-domain question answering, 2020. M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, arXiv:2004.04906 . M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, [27] M. Sclar, Y. Choi, Y. Tsvetkov, A. Suhr, Quantifying Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, language models’ sensitivity to spurious features in Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, prompt design or: How i learned to start worrying A. Schelten, R. Silva, E. M. Smith, R. Subramanian, about prompt formatting, 2023. arXiv:2310.11324 . X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kam- badur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, T. Scialom, Llama 2: Open foundation and fine-tuned chat models, 2023. arXiv:2307.09288 . [11] OpenAI, Gpt-4 technical report, 2023. arXiv:2303.08774 . [12] T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, Qlora: Efficient finetuning of quantized llms, 2023. arXiv:2305.14314 . [13] E. Frantar, S. Ashkboos, T. Hoefler, D. Alistarh, Gptq: Accurate post-training quantization for generative pre- trained transformers, 2023. arXiv:2210.17323 . [14] S. Li, X. Ning, L. Wang, T. Liu, X. Shi, S. Yan, G. Dai, H. Yang, Y. Wang, Evaluating quantized large language models, 2024. arXiv:2402.18158 . [15] A. Salemi, S. Mysore, M. Bendersky, H. Zamani, Lamp: When large language models meet personalization, 2023. arXiv:2304.11406 . [16] Meta, Introducing meta llama 3: The most capable openly available llm to date, 2024. URL: https://ai.meta. com/blog/meta-llama-3/. [17] L. Tunstall, E. Beeching, N. Lambert, N. Rajani, K. Ra- sul, Y. Belkada, S. Huang, L. von Werra, C. Fourrier, N. Habib, N. Sarrazin, O. Sanseviero, A. M. Rush, T. Wolf, Zephyr: Direct distillation of lm alignment, 2023. arXiv:2310.16944 . [18] G. Wang, S. Cheng, X. Zhan, X. Li, S. Song, Y. Liu,