=Paper= {{Paper |id=Vol-3740/paper-68 |storemode=property |title=Evaluating Poro-34B-Chat and Mistral-7B-Instruct-v0.1: LLM System Description for ELOQUENT at CLEF 2024 |pdfUrl=https://ceur-ws.org/Vol-3740/paper-68.pdf |volume=Vol-3740 |authors=Vasumathi Neralla,Sander Bijl de Vroe |dblpUrl=https://dblp.org/rec/conf/clef/NerallaV24 }} ==Evaluating Poro-34B-Chat and Mistral-7B-Instruct-v0.1: LLM System Description for ELOQUENT at CLEF 2024== https://ceur-ws.org/Vol-3740/paper-68.pdf
                         Evaluating Poro-34B-Chat and Mistral-7B-Instruct-v0.1:
                         LLM System Description for ELOQUENT at CLEF 2024
                         Vasumathi Neralla1 , Sander Bijl de Vroe1
                         1
                             Silo AI


                                          Abstract
                                          In the following discussion, we demonstrate the implementation of the multilingual Large Language Model (LLM)
                                          Poro-34B-Chat and Mistral-7B-Instruct-v0.1 across different tasks under the ELOQUENT test suite. Poro-34B
                                          is currently the leading LLM for Finnish and has been further fine-tuned on a collection of English instruction
                                          datasets and automatically translated Finnish instructions. In this article, we delve into the processing of user
                                          and system prompts tailored to each task, focusing on topical proficiency, robustness, and Voight-Kampff tests.
                                          Furthermore, we provide an explanation of the inference engine responsible for generation, including all relevant
                                          sampling parameters to ensure reproducibility of our results.

                                          Keywords
                                          Large Language Models, Evaluation, Multilingual, Low-Resource




                         1. Introduction
                         LLMs have recently made rapid advances, achieving state-of-the-art results in virtually every NLP
                         task as well as proving their usefulness in widespread practical applications. However, evaluation of
                         LLMs still leaves much to be desired, given the challenging nature of evaluating free-form generated
                         output, especially when compared to the traditional closed-form output requirements common to NLP
                         classification tasks. In support of development of more strategically aimed generative evaluation efforts
                         we submit Poro-34B-Chat to the ELOQUENT evaluation shared task.
                            At the time of publication Poro 34B was the state-of-the-art Finnish LLM, even at a small fraction of
                         its training tokens, constituting clear evidence that multilingual LLM training can provide a substantial
                         benefit over monolingual training for under-resourced languages.
                            As far as we are aware, this is the first evaluation of this model in a shared task, and the first evaluation
                         of its interaction with other LLM outputs. We also evaluate the Mistral-7B-Instruct-v0.1 in the same
                         shared task, providing comparison to another open-source model. In the following section we provide
                         some background on both Poro 34B, Poro-34B-Chat and Mistral-7B-Instruct-v0.1, before sharing details
                         of our model inference setup per task.


                         2. Background
                         2.1. Poro 34B
                         The base model Poro 34B uses a decoder-only architecture closely matching BLOOM [1] and FinGPT
                         [2]. The model uses 56 attention heads, 54 layers and a hidden dimension of 7168. As the positional
                         encoding method it uses AliBi [3], as well as additional layer normalization after the input embedding
                         layer for more robust training.
                            Poro 34B is pre-trained with 1T tokens. The majority of its English tokens are derived from SlimPajama
                         [4], while the Finnish tokens were collected from Finnish portions of ParseBank, Wikipedia and Reddit,
                         and the Finnish media company YLE, among other sources. The StarCoder [5] dataset provides training
                         tokens for code. A sequence length of 2048 as well as a batch size of 2048 are used. A decaying cosine
                         learning rate scheduler was employed, with a 10B token linear warmup, a maximum of 1.5e-4 and a

                          CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                          $ vasumathi.neralla@silo.ai (V. Neralla); sander.bijldevroe@silo.ai (S. Bijl de Vroe)
                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
minimum of 1e-5 for the last 10B training tokens. Training was performed on the LUMI supercomputer
using the FinGPT [2] fork of Megatron-DeepSpeed, compatible with the cluster’s AMD MI250X-GPUs.
A 128 node configuration was chosen, matching a data parallel degree of 128. LUMI runs on fully
renewable electricity, so that the total carbon emissions for GPU usage amounted to 0 tCO2eq.
  At the time of publication Poro-34B was the strongest Finnish language model of its size, outper-
forming FinGPT on FIN-Bench [2] after just 10% of training tokens. It also shows strong performance
on English in spite of its Finnish training, and due to its relatively large proportion of code training
tokens, exhibits favorable performance on programming tasks.

2.2. Poro-34B-Chat
The base model alone does not consistently manage to follow the various task instructions without
further training, so we evaluate the fine-tuned version of Poro 34B, Poro-34B-Chat [6, 7] (unpublished).
Poro-34B-Chat was instruction fine-tuned [8] on both Finnish and English instruction-response pairs,
using code built on the Alignment Handbook [9]. Full-parameter supervised fine-tuning (SFT) was
employed.
   Since Finnish is under-resourced when it comes to instruction data, a machine translation approach
was used, with the Poro 34B base model itself translating instruction datasets from English to Finnish.
The English instruction data was compiled from OASST2 [10], the Dolly dataset [11], and an Argilla
SFT dataset [12].
   The Finnish and cross-lingual data [13] were constructed using a variety of methods. The instruction
data was obtained by automatically translating a curated set of the English instructions with Poro 34B
(alongside a number of heuristics and filtering steps). Furthermore, Poro 34B was used to generate
English-Finnish translation data and language identification data, and the collection was supplemented
with Finnish paraphrase data, see [13] for details. Combined these sources result in a data mixture of
40% English, 40% Finnish, and a further 20% cross-lingual data.

2.3. Mistral-7B-Instruct-v0.1
We also assessed Mistral-7B-Instruct-v0.1 [14], a fine-tuned version of the Mistral-7B base model, on
the same tasks. Mistral does not publicly share the specific training data composition for either model.
However, they state that Mistral-7B-Instruct-v0.1 was fine-tuned without proprietary data, using only
instruction datasets publicly available on HuggingFace.


3. Methods and Tasks
To produce model responses we use an internal model service based on OpenAI compatibility. The
service utilizes vLLM (v0.2.4) as the inference engine.
  A large number of the sampling parameters for vLLM can be kept constant between tasks:

 SamplingParams(
     n=1,     best_of=1,     presence_penalty=0.0,      frequency_penalty=0.0,
     repetition_penalty=1.0,            temperature=0.7,            top_p=1.0,
     top_k=-1,       min_p=0.0,       seed=None,        use_beam_search=False,
     length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[],
     include_stop_str_in_output=False,    ignore_eos=False,   max_tokens=1630,
     logprobs=None,      prompt_logprobs=None,       skip_special_tokens=True,
     spaces_between_special_tokens=True
     )

   Then the main parameters of interest per task are the user_prompt and the the system_prompt.
Below we describe handling of these input parameters per task. Note that we choose not to participate
in the HalluciGen task.
3.1. Topical Competence
There were two steps in the Topical Competence task:

   1. Generate questions on the specified topic

   2. Generate answers for the questions generated in Step 1

  When generating questions (Step 1) we found that the system benefited from using both the title and
description fields. Our final user_prompt was defined as the following string:
  user_prompt = “Title:” + topic_title + “∖n” + “Description: ” + topic_description

  We include an example of the task below:
  Step 1
  system_prompt:
  “Create a set of questions that can be used to assess if someone knows about the following topic:”
  user_prompt:
  “Title: risks and benefits of keeping a pet cat
  Description: A set of questions to help decide if one should get a pet cat or not.”
  model_output:
  “Sure, here are some questions that could be used to assess someone’s knowledge about the risks and
benefits of keeping a pet cat: list_of_output_questions

  Step 2
  system_prompt:
  “Answer the following questions:”
  user_prompt:
  model_output_from_step_1


3.2. Robustness
For the robustness user_prompt we utilized the prompt provided with the entry. Since Poro-34B-chat
was trained on English, Finnish, and code tokens, we evaluated the model only for the English variants
of entries. We found that the model was able to provide sensible outputs with an empty system prompt.

3.3. Voight Kampff
The output of this task will serve as source material for assessing the capability to discern between text
authored by humans and text generated mechanically. About 500 words are generated based on the
given summary or topic. Again for Voight-Kampff we choose to use the provided prompt, and do not
apply a system prompt.


4. Conclusion
We have presented here a straightforward application of Poro-34B-Chat and Mistral-7B-Instruct-v0.1 to
the tasks in the ELOQUENT test suite. Poro 34B is the state-of-the-art multilingual LLM for Finnish,
and its fine-tuned version Poro-34B-Chat was created using a mixture of English instruction datasets
and automatically translated Finnish instructions. Mistral-7B-Instruct-v0.1 provides an open-source
model for comparison. With a simple prompting approach Poro-34B-Chat is able to follow instructions
on the chosen tasks.
Acknowledgments
We would like to thank Jonathan Burdge, Kai Hakala, Mark van Heeswijk, Andrey Ivannikov, Jussi
Karlgren, Aku Rouhe, Mittul Singh and Elaine Zosa for their insightful comments.


References
 [1] T. Le Scao, A. Fan, C. Akiki, E. Pavlick, S. Ilić, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon,
     M. Gallé, et al., Bloom: A 176b-parameter open-access multilingual language model (2023).
 [2] R. Luukkonen, V. Komulainen, J. Luoma, A. Eskelinen, J. Kanerva, H.-M. Kupari, F. Ginter, V. Laip-
     pala, N. Muennighoff, A. Piktus, T. Wang, N. Tazi, T. Scao, T. Wolf, O. Suominen, S. Sairanen,
     M. Merioksa, J. Heinonen, A. Vahtola, S. Antao, S. Pyysalo, FinGPT: Large generative models
     for a small language, in: H. Bouamor, J. Pino, K. Bali (Eds.), Proceedings of the 2023 Confer-
     ence on Empirical Methods in Natural Language Processing, Association for Computational
     Linguistics, Singapore, 2023, pp. 2710–2726. URL: https://aclanthology.org/2023.emnlp-main.164.
     doi:10.18653/v1/2023.emnlp-main.164.
 [3] O. Press, N. A. Smith, M. Lewis, Train short, test long: Attention with linear biases enables input
     length extrapolation, arXiv preprint arXiv:2108.12409 (2021).
 [4] D. Soboleva, F. Al-Khateeb, R. Myers, J. R. Steeves, J. Hestness, N. Dey, Slimpajama: A 627b token
     cleaned and deduplicated version of redpajama, 2023.
 [5] R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim,
     et al., Starcoder: may the source be with you!, arXiv preprint arXiv:2305.06161 (2023).
 [6] Silogen, Poro 34b chat is here, 2024. URL: https://www.silo.ai/blog/poro-34b-chat-is-here, accessed
     on May 28th 2024.
 [7] LumiOpen, Poro-34b-chat, 2024. URL: https://huggingface.co/LumiOpen/Poro-34B-chat, accessed
     on May 28th 2024.
 [8] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal,
     K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welin-
     der, P. F. Christiano, J. Leike, R. Lowe, Training language models to follow instructions
     with human feedback, in: S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, A. Oh
     (Eds.), Advances in Neural Information Processing Systems, volume 35, Curran Associates,
     Inc., 2022, pp. 27730–27744. URL: https://proceedings.neurips.cc/paper_files/paper/2022/file/
     b1efde53be364a73914f58805a001731-Paper-Conference.pdf.
 [9] L. Tunstall, E. Beeching, N. Lambert, N. Rajani, S. Huang, K. Rasul, A. M. Rush, T. Wolf, The
     alignment handbook, https://github.com/huggingface/alignment-handbook, 2023.
[10] A. Köpf, Y. Kilcher, D. von Rütte, S. Anagnostidis, Z. R. Tam, K. Stevens, A. Barhoum, D. Nguyen,
     O. Stanley, R. Nagyfi, et al., Openassistant conversations-democratizing large language model
     alignment, Advances in Neural Information Processing Systems 36 (2024).
[11] M. Conover, M. Hayes, A. Mathur, J. Xie, J. Wan, S. Shah, A. Ghodsi, P. Wendell, M. Zaharia,
     R. Xin, Free dolly: Introducing the world’s first truly open instruction-tuned llm, Company Blog
     of Databricks (2023).
[12] Argilla, instruction-collection-fin, 2024. URL: https://huggingface.co/datasets/argilla/10k_
     prompts_ranked_mistral_large_responses, accessed on May 28th 2024.
[13] LumiOpen, instruction-collection-fin, 2024. URL: https://huggingface.co/datasets/LumiOpen/
     instruction-collection-fin, accessed on May 28th 2024.
[14] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand,
     G. Lengyel, G. Lample, L. Saulnier, et al., Mistral 7b, arXiv preprint arXiv:2310.06825 (2023).