<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>NCU-IISR: Prompt Engineering on GPT-4 to Stove Biological Problems in BioASQ 11b Phase B</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chun-Yu Hsueh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yu Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yu-Wei Lu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jen-Chieh Han</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wilailack Meesawad</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Richard Tzong-Han Tsai</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Information Engineering, National Central University</institution>
          ,
          <country country="TW">Taiwan</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Research Center for Humanities and Social Sciences</institution>
          ,
          <addr-line>Academia Sinica</addr-line>
          ,
          <country country="TW">Taiwan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we present our system applied in BioASQ 11b phase b. We showcase prompt engineering strategies and outline our experimental steps. Building upon the success of ChatGPT/GPT-4 in answer generation and the field of biology, we developed a system that utilizes GPT-4 to answer biomedical questions. The system leverages OpenAI's ChatCompletions API and combines Prompt Engineering methods to explore various prompts. In addition, we also attempted to incorporate GPT-4 into our system from last year, which combines a BERT-based model and BERTScore. However, the standalone GPT-4 method outperformed this approach by a large margin. Ultimately, in our submission, we adopted what we believe to be the optimal prompts and achieved the highest scores in the second batch.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Biomedical Question Answer</kwd>
        <kwd>Large language models (LLMs)</kwd>
        <kwd>Generative Pre-trained Transformer</kwd>
        <kwd>Zeroshot</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        question. Thus, we framed the task as a query-based multi-document extraction (for the exact
answer) and summarization (for the ideal answer). In the previous year, we achieved the highest
result in generating ideal answers by using the BioBERT model in combination with linear
regression [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>This year, we observed GPT-4’s comprehension capabilities in the field of biology and its
advantages in answer generation. We therefore used GPT-4 for answer generation. Specifically,
in each batch we developed three or more systems. Particularly in System 1 and System 2, we
investigated the impact of prompts on answer generation. We employed Prompt Engineering
techniques to select the most suitable prompt. Both systems shared the same prompt, with
the only diference being that System 1 utilized GPT-3.5 while System 2 utilized GPT-4. As
for System 3, based on the results from the previous year’s competition, we found that our
research from last year performed well in generating ideal answers. Therefore, we improved
upon the previous year’s model and utilized its ideal answer for response generation. We relied
on System 2’s answers for exact answers.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Biomedical knowledge is often acquired by reading academic papers. This process is
timeconsuming and labor-intensive, and it requires a high level of professional expertise.
Biomedical professionals cannot quickly obtain the required knowledge in a short period of time. The
general public is also unable to acquire biomedical knowledge without expert assistance. QA
in natural language processing tasks has the potential to solve these problems by providing
direct answers to users’ questions. This tests machine learning systems’ ability to semantically
understand, retrieve, and generate answers from existing text. Many QA models based on deep
learning have been developed and applied in the past [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>Well-trained Large language models: Well-trained large language models have emerged
as a powerful tool in natural language processing (NLP) tasks, particularly in question
answering (QA). These language models, such as GPT-3 and GPT-4, are trained on vast amounts of
text data and can understand and generate human-like responses.</p>
      <p>In NLP QA tasks, well-trained large language models have shown remarkable performance,
surpassing traditional methods and achieving state-of-the-art results. These models excel in
comprehending complex questions and generating accurate and contextually relevant answers.
They leverage their vast knowledge base to provide detailed explanations, supporting evidence,
and even generate creative responses.</p>
      <p>
        According to the GPT-4 Technical Report[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], GPT-4 demonstrates a high level of
understanding in the medical domain. This is evidenced by its 75% score on the Medical Knowledge
SelfAssessment Program test. Additionally, it obtained an outstanding score of 5 in the AP Biology
Exam, a feat accomplished by only 15% of the test takers. This indicates its strong performance
in biology-related questions. Therefore, we anticipate that GPT-4 will also deliver favorable
results in the BioASQ 11b Phase B task.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <sec id="sec-3-1">
        <title>3.1. Systems</title>
        <p>We use diferent systems in diferent batch. The detailed configuration of each system can be
seen in Table 1</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Dataset</title>
        <p>This year’s competition provided 4,719 training data samples. Among them, there were 1,130
samples of the summary type, 1,417 samples of the factoid type, 901 samples of the list type,
and 1,271 samples of the yes/no type. On average, each question consisted of 12 snippets, with
an average length of 203 characters per snippet.</p>
        <p>Considering the token limit imposed by OpenAI’s API, we extracted only partial information
from the snippets. Specifically, in the first two batches, we selected the first 5 snippets and
truncated any excessively long sentences to 250 characters. In the subsequent two batches,
we input all the snippets. However, for snippets exceeding 250/300 characters, we utilized the
ChatCompletions API to perform summarization tasks. This ensured that sentence lengths
remained within 250/300 characters.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Prompting</title>
        <p>OpenAI’s ChatCompletions API adheres to a predefined format, requiring specific fields for
each message. In addition to the ”text” field, the ”role” field must be configured, which can be
categorized as ”system”, ”assistant”, or ”user”.</p>
        <p>• The system message: As an optional component, configures the assistant’s behavior. It
can alter the assistant’s personality or furnish explicit instructions regarding its conduct
throughout the conversation.
• The user message: Convey requests or comments that require responses from the
assistant.
• The assistant message: Retain prior assistant responses, while also allowing developers
to compose them as illustrative instances of desired behavior.</p>
        <p>Snippets: We observed that ChatGPT incorporates past responses, and we aim to
leverage this feature to achieve a similar efect of having ChatGPT read through snippets before
answering questions. Therefore, for the snippets, we adopt the format of assistant messages,
separating snippets from questions, and directly appending them before the questions. We do
not include any additional prompts.</p>
        <p>Questions: We experimented with various prompts to guide ChatGPT in generating the
desired responses. Ultimately, we opted for a direct approach where ChatGPT generates
responses in a fixed JSON format. This decision was driven by our observation that the Exact
Answer and Ideal Answer often have a certain degree of overlap. By combining both
questions in a single prompt, we encouraged ChatGPT to avoid generating completely unrelated
responses. Additionally, imposing a fixed response format greatly improved data processing
efifciency. Across the four batches, we employed similar prompts without significant variations.
Please refer to Table 2 for the details of the relevant prompts.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Strategy</title>
        <p>Prompts
Reply to the answer clearly and easily in less than 3 sentences. The first
question is:{QUESTION_BODY}
You can only use JSON format to answer my questions. The format must be
{”exact_answer”:””, ”ideal_answer”:””}, where exact_answer should be ”yes” or
”no”, and ideal_answer is a short conversational response starting with yes/no
then follow on the explain. The first question is:{QUESTION_BODY}
You can only use JSON format to answer my questions. The format must be
{”exact_answer”:[], ”ideal_answer”:””}, where exact_answer is a list of precise
key entities to answer the question, and ideal_answer is a short conversational
response containing an explanation. The first question is:{QUESTION_BODY}
You can only use JSON format to answer my questions. The format must be
{”exact_answer”:[], ”ideal_answer”:””}. where exact_answer is a list of precise
key entities to answer the question. ideal_answer is a short conversational
response containing an explanation. The first question is:{QUESTION_BODY}
Conclusion and summarize this context in less than {MAX_SNIPPET_LEN}
letters: {SNIPPET_BODY}
In terms of prompt engineering, we can incorporate certain cues or guidelines, in accordance
with the competition rules, to enhance the efectiveness of the responses. The following are
some of the strategies we have employed:
• In yes/no type questions, we restrict ChatGPT to only provide responses of ’yes’ or ’no.’</p>
        <p>
          This approach ensures ChatGPT avoids ambiguous answers.
• We enable ChatGPT to simultaneously respond to both an exact answer and an ideal
answer. This approach prevents situations where there are starkly diferent responses
between the two question-answer pairs. Additionally, simultaneous responses
encourage ChatGPT to explain its exact answer within the ideal answer. While we do not
have explicit experimental evidence, we believe that, similar to the concept of
Chainof-Thought[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], having the language model explain its own answers can enhance the
accuracy of the responses.
• When presenting JSON format, we use quotation marks and square brackets to represent
strings and lists, respectively. We also provide additional textual descriptions to help
ChatGPT understand the expected type of answer it should provide.
• We have observed that the length of a code snippet can impact the length of the
generated response. Therefore, in summary-type questions, we limit ChatGPT to providing
answers in three sentences. This implicitly avoids excessively long responses without
explicitly specifying a specific word count. This approach helps prevent ChatGPT
artificially elongating short answers to the question or generating extremely long responses.
When formulating prompts, we intentionally avoid defining rules or restrictions in excessive
detail or complexity. Doing so could potentially result in responses lacking diversity.
Therefore, we leave some room for ChatGPT to explore and generate more varied answers, allowing
creativity within certain boundaries.
        </p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Procedure</title>
        <p>Prompt engineering is an experimental and iterative process that requires continuous
experimentation, evaluation, and improvement. Depending on the specific task and dataset, diferent
steps and combinations of methods may be necessary. The key is to adjust and optimize based
on actual circumstances to achieve the best model outputs. In our experiment, we followed the
following steps:
1. Definition: Confirm the specific task objectives and define the model’s input and output.
2. Analysis: Analyze the characteristics and specifications of the dataset.
3. Design: Design a prompt that combines diferent strategies.
4. Evaluation: Examine and analyze the output results.
5. Optimization: Attempt to optimize the strategies and explore combinations of diferent
methods.</p>
        <p>6. Iteration: Repeat steps 3 to 5 continuously until satisfactory output results are achieved.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Result</title>
      <p>The final scores can be obtained from the BioASQ competition results page. These scores are
categorized into Exact Answer (Table 3) and Ideal Answer (Table 4). In the Exact Answer
category, we included an additional FIN Score, which utilizes the same final ranking score
calculation method as the previous year. Although we do not yet have access to the Manual
Scores in the Ideal Answer category, we found that the IISR-2 system in Batch 2 achieved the
highest score in the FIN metric within the Exact Answer category. This suggests that if the final
ranking score calculation remains the same as last year, we would secure the first position in
Batch 2 Exact Answer.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion And Conclusions</title>
      <p>In this year’s competition, we observed widespread use of Generative Transformers. However,
training a Generative Transformer model efectively is often challenging in typical scenarios.
Therefore, we heavily rely on pre-trained large-scale language models that already demonstrate
a certain level of generality. Our results in this competition reflect this observation, as the GPT
model far exceeded our fine-tuned BioBert model.</p>
      <p>When most participants utilize OpenAI’s API to generate results, it becomes crucial to guide
ChatGPT in providing the expected answers. Specifically, the most critical aspect is how to
provide key prompts without exceeding the token limit.</p>
      <p>Our high performance in Batch 2 indirectly indicates our strategies’ efectiveness. While
it is dificult to precisely analyze which strategy contributed the most to the improvement
in performance, the summarized explanation of our strategies includes: 1) Using the Assistant
role to directly incorporate snippets, 2) Simultaneously addressing both Exact Answer and Ideal
Answer, 3) Having ChatGPT respond in a fixed JSON format, and 4) Summarizing excessively
long snippets before processing.</p>
      <p>Despite our eforts, we have observed that some other teams performed better in this
competition. Therefore, we have been reflecting on why there were disparities in performance
despite using the same model. We believe that there are still many areas for improvement.
For example, we can employ more scientific methods to determine which snippets should be
referenced, conduct more rigorous validation and evaluation of experimental results, and even
explore whether to use simple English words or include subject pronouns in the prompts.</p>
      <p>By continuously seeking ways to enhance our approach, we aim to bridge the performance
gap and achieve better results in future iterations.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Katsimpras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lima-López</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Farré-Maduell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gasco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Krallinger</surname>
          </string-name>
          , G. Paliouras, Overview of bioasq
          <year>2023</year>
          :
          <article-title>The eleventh bioasq challenge on large-scale biomedical semantic indexing and question answering, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction</article-title>
          .
          <source>Proceedings of the Fourteenth International Conference of the CLEF Association (CLEF</source>
          <year>2023</year>
          ),
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nentidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Katsimpras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krithara</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Paliouras, Overview of bioasq tasks 11b and synergy11 in clef2023</article-title>
          ,
          <source>in: Working Notes of CLEF 2023 - Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.-H.</given-names>
            <surname>Ting</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J.-C. Han, R. T.
          <string-name>
            <surname>-H. Tsai</surname>
          </string-name>
          ,
          <article-title>Ncu-iisr/as-gis: Using bertscore and snippet score to improve the performance of pretrained language model in bioasq 10b phase b</article-title>
          ,
          <source>CEUR Workshop Proceedings</source>
          <volume>3180</volume>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Möller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jayakumar</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Pietsch, COVID-QA: A question answering dataset for COVID-19</article-title>
          , in:
          <source>Proceedings of the 1st Workshop on NLP for COVID-19 at ACL</source>
          <year>2020</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .nlpcovid19-acl.18.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5] OpenAI, Gpt-4
          <source>technical report</source>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2303</volume>
          .
          <fpage>08774</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ichter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Chain-ofthought prompting elicits reasoning in large language models</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2201</volume>
          .
          <fpage>11903</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>