<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Conference and Labs of the Evaluation Forum, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Prompt Engineering Approach to Scientific Text Simplification: CYUT at SimpleText2023 Task3</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shih-Hung Wu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hong-Yi Huang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Chaoyang University of Technology</institution>
          ,
          <addr-line>Taichung</addr-line>
          ,
          <country country="TW">Taiwan, R.O.C</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>1</volume>
      <fpage>8</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>This paper reports our approach to the SimpleText lab. In year 2023, we focus on the Task 3: Rewrite this!, Our system adopts the GPT3.5 and GPT4 generation models to rewrite the original sentences. We used diferent prompts to guide the model to generate simplified sentence with diferent guidelines. During system development, we used three metrics to evaluate the results. The oficial results are also reported. Understanding scientific texts requires proper background knowledge and academic terminology that makes the scientific texts hard to read. How to simplify the scientific text in an automatic way is the key point of the SimpleText lab. The CLEF 2023 SimpleText lab is an evaluation campaign that aims to assess the quality and usability of text simplification systems. Text simplification is the task of rewriting a text in a simpler way, while preserving its meaning and information content. The lab will consist of three challenges of automatic text simplification in the following tasks: • TASK 1: What is in (or out)? The goal of task 1 is given a query, a system has to find passages to include in a simplified summary. • TASK 2: What is unclear? Given a passage and a query, a system has to rank terms that are required to be explained for understanding this passage. • TASK 3: Rewrite this! Given a passage from scientific abstracts, a system has to rewrite it into a simplify passage.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Simple Text Generation</kwd>
        <kwd>Prompt Engineering</kwd>
        <kwd>Evaluation of Text Simplification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>and dataset for comparing diferent approaches and measuring their impact on various aspects
of text simplification, such as readability, comprehension, and preservation of meaning.</p>
      <p>In this year, our team focus on Task3. We will describe our approach, how we evaluate our
results and the oficial results in the following sections.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Techniques in Our Approach</title>
      <p>
        The deep neural network that can generate natural language texts on various topics and
domains. It is based on the transformer architecture, which uses attention mechanisms to learn
the relationships between words and sentences. Transformer-based models have achieved
stateof-the-art performance for abstractive summarization [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. To our knowledge GPT4 [4]
and T5 are the best ones. T5, or Text-to-Text Transfer Transformer [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], is a Transformer based
architecture that uses a text-to-text approach. T5 can convert all NLP tasks in a Text-to-Text
way with the help of prompt, a leading text to specify the goal that the user want this time.
Usually, the prompts are trained tasks. The latest GPT models provide a more flexible way to
give prompt in natural language, the use can give detail instruction on new tasks. Thus, how to
design good prompt become an important issue on using the models [5] [6]. In SimpleText 2022,
our system adopt the T5 model and get promising generation results [7]. The major drawback
of the GPT4 model is it may not always produce factual or ethical texts, as it may inherit some
biases or errors from the data it was trained on. The GPT4 model may also not be able to capture
the nuances and contexts of human communication, such as sarcasm, humor, irony, etc. The
GPT4 model may also require a lot of computational resources and energy to run and maintain.
However, since it is a remarkable achievement in natural language processing and artificial
intelligence, we explore its potential on the scientific text simplification.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Model Comparison</title>
        <p>One of the main diferences between T5 and GPT4 is their pre-training objectives. T5 is trained
on a large corpus of text using a text-to-text framework, where it learns to map any input text
to any output text. This allows T5 to perform a wide range of natural language tasks, such
as summarization, translation, question answering, and text generation, by simply changing
the output format. GPT4, on the other hand, is trained on a large corpus of text using an
autoregressive language modeling objective, where it learns to predict the next word given the
previous words. This allows GPT4 to generate fluent and coherent texts from a given prompt or
context, but it also limits its ability to perform other natural language tasks that require more
than word-level prediction.</p>
        <p>Another diference between T5 and GPT4 is their model architectures. T5 is based on the
Transformer encoder-decoder architecture, where it has two separate modules for encoding
the input text and decoding the output text. This enables T5 to capture both the semantic and
syntactic information from the input text and use it to generate the output text. GPT4 is based
on the Transformer decoder-only architecture, where it has a single module for generating the
output text from the input text. This simplifies the model design and reduces the computational
cost, but it also makes it harder for GPT4 to incorporate external knowledge or information
from the input text into the output text.</p>
        <p>T5 and GPT4 are both powerful natural language processing models that can generate
highquality texts from various inputs. We tested both models on the dataset with our self-evaluation
and find that GPT4 generate better results. Therefore, we use the GPT4 in formal test.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Prompt engineering</title>
        <p>Prompt engineering is the process of carefully designing prompts that are provided to machine
learning models, especially large language models like GPT-4, in order to guide their responses
or behavior. The prompts go with the inputs to the model and the responses can vary greatly
depending on how the prompts are crafted.</p>
        <p>There are several techniques that can be used in prompt engineering, including but not
limited to:
1. In-Context Learning [8]: Learning from the prompt and previous interactions to produce
relevant responses.
2. Zero-Shot Prompting [9]: Providing a task unseen during training, testing the model’s
ability to use its learnt knowledge to respond.
3. Few-Shot Prompting [5]: Providing a few examples of the task before giving the actual
prompt, helping the model understand the task.
4. Chain-of-Thought (CoT) Prompting [10]: Conditioning each prompt on the entire
preceding conversation, maintaining context throughout a dialogue.</p>
        <p>These techniques enable models to deliver more accurate and contextually relevant responses.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Datasets</title>
      <p>SimpleText’s data use the Citation Network Dataset: DBLP+Citation, ACM Citation network
(12th version) as source of scientific documents to be simplified. Scientific textual content and
authorship on any topic related to computer science can be extracted from this corpus. Detail
description please read the overview paper [11]. The test set consist of two parts, we refer them
as the large data set and the small data set. The large data set contains 152,072 records, and the
small data set contains 2,234 records, 335KB.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Our Approach</title>
      <p>In year 2023, we focus on Task3, here we give the detail of prompts used in our runs and the
self-evaluation results.</p>
      <sec id="sec-4-1">
        <title>4.1. Prompts Tested During System Development</title>
        <p>Since the generation results can be very diferent under diferent prompts. We tried three
prompts and two models listed in the following Table 1. It is an essential process to use natural
language models efectively and responsibly. We manually ask GPT4 model to improve the
original prompt, (simplify the text), and get these prompts. Our system put every test sentence
into the text slot and call the API provided by OpenAI. As we can see, GPT4 suggest that prompt</p>
        <p>Your task is to simplify the following sentences to make them
easier to understand. Please note that your response should
be flexible enough to allow for various relevant and creative
simplifications, as long as they accurately convey the intended
meaning.</p>
        <p>Please simplify this sentence:{text}
Your task is to simplify the following sentences to make them
easier to understand. Please provide a clear and concise
response that retains the original meaning of each sentence while
removing any unnecessary complexity or jargon. Please note
that your response should be flexible enough to allow for various
relevant and creative simplifications, as long as they accurately
convey the intended meaning.</p>
        <p>Please simplify this sentence:{text}
Simplify these sentences to make them easier to understand
while retaining their meaning and avoiding complex language.</p>
        <p>Be creative in your simplification.</p>
        <p>Please simplify this sentence:{text}
should give detailed instruction, such as “easier to understand”, “various relevant”, “creative
simplification”, “removing . . . jargon”, and “retaining . . . meaning”.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Self-Evaluation</title>
        <p>Before sending the generated text to the organizers, we evaluate them with three metrics.
Table 2 gives the evaluation results of the source dataset and our runs. Flesch reading ease
is an index that measures the level of sentences; the higher the score, the easier it is for the
reader [12]. In Table 2, we can see that the reading ease index value for run 2 is the highest. The
Flesch-Kincaid grade level (FKGL) is an index that measures the corresponding reader level; the
lower the score, the younger the reader [13]. This is a grade index where a score of 10.x means
that a tenth-grader would be able to read the text. Run 2 also has the lowest level, which drops
from 14.61 to 9.59. The evaluation results in Table 2 show that, with the same model and input
data, the prompt afects the results significantly. In the last column of Table 2, we can find the
percentage of academic words according to a list provided by Coxhead [14]. The percentage of
academic words drops to 9.65% in run 2.</p>
        <p>Table 3 shows a generation results example in our evaluation, where FRE is the Flesch Reading
Ease, FKGL is the Flesch-Kincaid Grade Level, and length is the number of characters in the
sentence. We can see the increase of FRE and decrease of the FKGL and the shorten of the length
in all four runs. The generated sentences preserve the meaning of the source sentence in this
case. The country name, Greece, in the source sentence is replaced by the adjective, Greek, in
all the three GPT4 generated sentences. The last one is a little bit creative, it added “improving</p>
        <p>The application is to be used by firefighting personnel in Greece
and is potentially expected to contribute towards a more
sophisticated transferring of information and knowledge between
wildfire confrontation operation centers and firefighting units
in the field.</p>
        <p>This app will be used by firefighters in Greece to improve the
sharing of information and knowledge between wildfire re- 52.53
sponse centers and those working in the field.</p>
        <p>The app helps Greek firefighters share information and
knowledge between wildfire centers and field teams more easily.</p>
        <p>The app helps Greek firefighters share important information
and knowledge between operation centers and units fighting
wildfires.</p>
        <p>The app will help Greek firefighters share information and
knowledge between operation centers and field units, improving
wildfire response.
54.22
wildfire response”, which is not in the source sentence.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Oficial Results</title>
      <p>We participated in the SimpleText challenge under the name "CYUT". The oficial evaluation
result of our runs in the Task3 is listed in the following tables. Where the Max, Min, and Avg are
the maximum value, the minimum value, and the average values of all runs in SimpleText2023.
The major metrics are FKGL and SARI, two public available automatic evaluation metrics, which
runs
runs
CYUT_task_3_run1
CYUT_task_3_run2
CYUT_task_3_run3
CYUT_task_3_run4
Max
Min
Avg
are not used in the evaluation last year. We also used FKGL at self-evaluation while developing
our system, but we never use SARI before.</p>
      <p>SARI is a metric that evaluates how well automatic text simplification systems rewrite
sentences to make them easier to read and understand [15]. SARI compares the predicted
simplified sentences with the original sentences and the human references. SARI calculates
the quality of the words that are added, deleted, or kept by the system, based on how they
match the human references. SARI is a general metric that can capture the efects of diferent
simplification operations, such as lexical paraphrasing, syntactic restructuring, or information
deletion. This year, our first run gives the highest SARI score 47.98 among all participant runs.
The corresponding FKGL levels are around 9 grade and the lexical complexity scores are also at
the lowest level.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Discussions</title>
      <p>In terms of generating sentences for Task 3, the results are much improved from the results in
last year. We observed the excess parts of the sentence removed, the academic terminology is
replaced with common words and finally the simplified sentence is obtained. The Simplified
sentence fully express the meaning of the original sentence. The GPT4 model gives better
results on task 3 than T5 model with the help of prompt engineering [5].</p>
      <p>Text simplification is the process of transforming a complex text into a simpler one, while
runs
preserving its meaning and information content. Current deep neural network models can
give better results than old ones. However, evaluating the quality of simplified texts is getting
dificult.</p>
      <p>There are diferent aspects of simplicity, such as lexical, syntactic, semantic, and pragmatic.
Lexical simplicity refers to the use of common and familiar words, syntactic simplicity refers
to the use of short and simple sentences, semantic simplicity refers to the use of clear and
unambiguous meanings, and pragmatic simplicity refers to the use of appropriate and relevant
information for the intended audience. However, these aspects are not independent and may
interact with each other in complex ways. To balance simplicity with other quality criteria,
such as adequacy, coherence, and informativeness, may cause confliction between metrics.</p>
      <p>Automatic metrics, such as readability formulas, lexical diversity measures, or compression
ratios, can provide objective and quantitative scores for some aspects of simplicity and quality,
may not capture all the nuances and subtleties of human language. Human judgments, such as
ratings, rankings, or preferences, can provide subjective and qualitative feedback for various
aspects of simplicity and quality, costly but still necessary.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments References</title>
      <p>This study was supported by the National Science and Technology Council under the grant
number NSTC 112-2221-E-324-014.
[4] OpenAI, Gpt-4 technical report, 2023. arXiv:2303.08774.
[5] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan,
P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in
neural information processing systems 33 (2020) 1877–1901.
[6] J. Huang, S. S. Gu, L. Hou, Y. Wu, X. Wang, H. Yu, J. Han, Large language models can
self-improve, arXiv preprint arXiv:2210.11610 (2022).
[7] S.-H. Wu, H.-Y. Huang, Cyut team2 simpletext shared task report in clef-2022, Proceedings
of the Working Notes of CLEF (2022).
[8] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, Z. Sui, A survey for
in-context learning, arXiv preprint arXiv:2301.00234 (2022).
[9] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, Q. V. Le,</p>
      <p>Finetuned language models are zero-shot learners, arXiv preprint arXiv:2109.01652 (2021).
[10] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al.,
Chainof-thought prompting elicits reasoning in large language models, Advances in Neural
Information Processing Systems 35 (2022) 24824–24837.
[11] L. Ermakova, E. SanJuan, S. Huet, O. Augereau, H. Azarbonyad, J. Kamps, Overview of
simpletext - clef-2023 track on automatic simplification of scientific texts, in: E. Kanoulas,
T. Tsikrika, S. Vrochidis, A. Giachanou, D. Li, M. Aliannejadi, M. Vlachos, G. lielmo Faggioli,
N. Ferro (Eds.), Avi Arampatzis, Experimental IR Meets Multilinguality, Multimodality,
and Interaction. Proceedings of the Fourteenth International Conference of the CLEF
Asso-ciation (CLEF 2023), 2023.
[12] R. Flesch, How to write plain english: Let’s start with the formula, University of Canterbury
(1979).
[13] J. P. Kincaid, R. P. Fishburne Jr, R. L. Rogers, B. S. Chissom, Derivation of new readability
formulas (automated readability index, fog count and flesch reading ease formula) for navy
enlisted personnel (1975).
[14] A. Coxhead, A new academic word list, TESOL quarterly 34 (2000) 213–238.
[15] W. Xu, C. Napoles, E. Pavlick, Q. Chen, C. Callison-Burch, Optimizing statistical machine
translation for text simplification, Transactions of the Association for Computational
Linguistics 4 (2016) 401–415.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Matena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Exploring the limits of transfer learning with a unified text-to-text transformer</article-title>
          ,
          <source>The Journal of Machine Learning Research</source>
          <volume>21</volume>
          (
          <year>2020</year>
          )
          <fpage>5485</fpage>
          -
          <lpage>5551</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghazvininejad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          , L. Zettlemoyer, Bart:
          <article-title>Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension</article-title>
          , arXiv preprint arXiv:
          <year>1910</year>
          .
          <volume>13461</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>