<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Xinpeng Qiu);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Research on Paper Semantic Novelty Measurement Based on Large Language Model⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xinpeng Qiu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jing Li</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Sun Yat-sen University</institution>
          ,
          <addr-line>No. 132, Outer Ring East Road</addr-line>
          ,
          <institution>University Town</institution>
          ,
          <addr-line>Guangzhou, Guangdong Province</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This paper proposes a semantic novelty measurement model for scientific papers using a large language model to generate question and method words semi-supervisedly. LoRA and prompt words enhance keyword generation accuracy and structural measurement. The model achieves 66.0% recall, 63.6% precision, and 65.9% sum, improving with more training samples. At 3000 samples, the training set is costeffective. The proposed method, leveraging fine-tuned large language models, is effective and robust.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;paper evaluation</kwd>
        <kwd>semantic novelty</kwd>
        <kwd>large language model</kwd>
        <kwd>natural language generation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Instruct represents the description of the keyword generation task, Example represents the
instance, Input represents the input text, and Output represents the output result requirement. In
this paper, the prompt engineering design template is designed to clearly and specifically describe
the task objectives and task contents in "Instruct", and the specific paper abstract and expected
keyword generation results are given in "Example".</p>
    </sec>
    <sec id="sec-2">
      <title>2. Paper novelty score calculation</title>
      <p>After early data acquisition, processing and fine-tuning of LLaMA3 large language model, this
study summarizes the keywords of each sample paper generated, takes the publication year of the
sample paper as the reference point, and uses the fine-tuning large language model LLaMA3 to
calculate the semantic similarity. Specifically, We compared the occurrence frequency of keywords
in other research literature in the same research field earlier than the publication time of the
current sample papers. This comparison process uses the fine-tuned trained large language model
LLaMA3 to ensure the accuracy and validity of the comparison. We record the frequency of these
keywords and mark them as specific reference data. Let's call it. Substituting into the calculation
formula, the score obtained is the semantic novelty score of the sample papers and the paper
novelty measure is shown below.</p>
      <p>Novn=
|Q| 1
∑
a=1 ln [ n ( Qk )+1]+1
|Q|
( 3 )</p>
      <p>Where, Novn represents the novelty score of the paper to be tested, n ( Qk ) Represents the
frequency of occurrence of keywords in the paper to be tested compared with that in the paper
before publication.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Data collection and parameter setting</title>
      <p>
        This paper uses the Web of Science core set as a data source, selecting scientific and technological
papers from top-level Computer Science disciplines published in 2018-2019. The search criteria
retrieved 15,348 papers. To evaluate semantic novelty, it obtained a complete database of the field,
including paper titles, abstracts, citation frequencies, and JCR partitions. The study divided the
sample dataset into a training set
        <xref ref-type="bibr" rid="ref2">(6,100 papers from 2018, with subsets of 500, 1500, and 3000
samples)</xref>
        and a test set (9,300 papers from 2019) to build and evaluate a fine-tuning model.The GPU
used in the experimental environment of this paper is NVIDIA A800 SXM4, Python version 3.10,
and Pytorch version 2.0.1. The parameter Settings of the large model training in this paper are
shown in Table 1.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Empirical analysis</title>
      <sec id="sec-4-1">
        <title>4.1. Analysis with general language model</title>
        <p>The current representative general large language models Gemma2, Phi3, GPT-4 and LLaMA3 are
selected for comparison with the fine-tuning model in this paper, and the results are shown in
Table 2.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Ablation experiment</title>
        <p>Conclusion: LoRA fine-tuning and hint word fine-tuning can significantly improve the
generation effect of the model in this paper.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Analysis of validity of paper novelty results</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgements</title>
      <p>Through the above empirical analysis, it is proved that the fine-tuned LLaMA3 can effectively
improve the keyword generation effect of the paper, and further optimize the semantic novelty
measurement of the paper.</p>
      <p>This research is supported by grants from the National Social Science Foundation of China
(22BTQ097).</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used LLaMA3 in order to: The text data is
processed and the abstract keywords are extracted. After using this tool/service, the author(s)
reviewed and edited the content as needed and take(s) full responsibility for the publication’s
content.
[5] Peter D. Turney. 2000. Learning algorithms for keyphrase extraction. Inf. Retr. 2, 4 (May,
2000), 303-336. DOI: https://doi.org/10.1023/A:1009976227802.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Chuanjun</given-names>
            <surname>Suo</surname>
          </string-name>
          , Miao Yu, Yanxin Pai and
          <string-name>
            <given-names>Juntao</given-names>
            <surname>Rong</surname>
          </string-name>
          .
          <year>2024</year>
          .
          <article-title>Theoretical framework of datadriven academic evaluation</article-title>
          .
          <source>Libr. Inf. Serv</source>
          .
          <volume>68</volume>
          ,
          <issue>1</issue>
          (
          <issue>Jan</issue>
          ,
          <year>2024</year>
          ),
          <fpage>5</fpage>
          -
          <lpage>12</lpage>
          . DOI: https://doi.org/10.13266/j.issn.
          <volume>0252</volume>
          -
          <fpage>3116</fpage>
          .
          <year>2024</year>
          .
          <volume>01</volume>
          .001.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Zara</given-names>
            <surname>Nasar</surname>
          </string-name>
          ,
          <source>Syed Waqar Jaffry and Muhammad Kamran Malik</source>
          .
          <year>2018</year>
          .
          <article-title>Information extraction from scientific articles: A survey</article-title>
          .
          <source>Scientometrics</source>
          <volume>117</volume>
          ,
          <issue>3</issue>
          (
          <issue>Sept</issue>
          ,
          <year>2018</year>
          ),
          <fpage>1931</fpage>
          -
          <lpage>1990</lpage>
          . DOI: https://doi.org/10.1007/s11192-018-2921-5.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Jiajia</given-names>
            <surname>Qian</surname>
          </string-name>
          , Zhuoran Luo and
          <string-name>
            <given-names>Wei</given-names>
            <surname>Lu</surname>
          </string-name>
          .
          <year>2021</year>
          .
          <article-title>Novelty measurement and innovation type identification of scientific literature based on question-method combination</article-title>
          .
          <source>Libr. Inf. Serv</source>
          .
          <volume>65</volume>
          ,
          <issue>14</issue>
          (
          <issue>Jul</issue>
          ,
          <year>2021</year>
          ),
          <fpage>82</fpage>
          -
          <lpage>89</lpage>
          . DOI: https://doi.org/10.13266/j.issn.
          <volume>0252</volume>
          -
          <fpage>3116</fpage>
          .
          <year>2021</year>
          .
          <volume>14</volume>
          .010.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Hong</given-names>
            <surname>Huang</surname>
          </string-name>
          , Chong Chen and
          <string-name>
            <given-names>Jingying</given-names>
            <surname>Zhang</surname>
          </string-name>
          .
          <year>2022</year>
          .
          <article-title>Review on identifying the semantics of scientific literature content</article-title>
          .
          <source>J. China Soc. Sci. Tech</source>
          . Inf.
          <volume>41</volume>
          ,
          <issue>9</issue>
          (
          <issue>Jan</issue>
          ,
          <year>2022</year>
          ),
          <fpage>991</fpage>
          -
          <lpage>1002</lpage>
          . DOI: https://doi.org/10.13266/j.issn.
          <volume>0252</volume>
          -
          <fpage>3116</fpage>
          .
          <year>2024</year>
          .
          <volume>01</volume>
          .001.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>