<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>G. F. Russo);</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Dataset for the Fine-tuning of LLM for the NER Task in the Cyber Security Domain</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stefano Silvestri</string-name>
          <email>stefano.silvestri@icar.cnr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Felice Russo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Tricomi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mario Ciampi</string-name>
          <email>mario.ciampi@icar.cnr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Cyber Threat Intelligence, Named Entity Recognition, Cybersecurity, Large Language Model, NLP</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>111</institution>
          ,
          <addr-line>Naples, 80131</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute for High Performance Computing and Networking, National Research Council of Italy (ICAR-CNR)</institution>
          ,
          <addr-line>Via Pietro Castellino</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>The increasing complexity of cyber threats necessitates robust cyber security measures. Efective threat detection and mitigation depend on Cyber Threat Intelligence, which includes structured and unstructured data critical for proactive defense strategies. While databases like the NVD and ExploitDB ofer structured security information, a significant amount of vital intelligence initially appears in unstructured formats, such as blogs, mailing lists, and news sites. Extracting meaningful information from these sources is particularly challenging in cyber security, requiring specialized Named Entity Recognition (NER) tools to identify domain-specific entities. This paper presents a NER dataset obtained by merging two cyber security domain datasets, CyNER and APTNER, creating a unified resource that enhances NER model training. Experimental results with advanced NER models show significant performance gains, underscoring the value of the proposed dataset in advancing cyber security practices, and highlighting the needs of such kind of resources.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In today’s interconnected digital landscape, cyber security remains a critical concern, due to the
proliferation of sophisticated cyber threats and vulnerabilities across global networks. The timely
identification and mitigation of these threats rely heavily on Cyber Threat Intelligence (CTI), which
encompasses the structured and unstructured information essential for preemptive defense strategies.
While structured databases like NVD [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] or ExploitDB [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] provide valuable and well-defined security
information, a significant amount of critical intelligence emerges initially in unstructured formats, often
in natural language, such as blogs, mailing lists, and news sites [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. These sources contain valuable
and constantly updated information about cyber threats, vulnerabilities, risks, mitigation strategies,
but their nature, due to the intrinsic complexity of natural language, often delays the classification
and integration of new information into structured databases. The challenge of extracting actionable
intelligence from unstructured sources is exacerbated in the cyber security, where domain-specific
entities require specialized Named Entity Recognition (NER) tools, often integrated with ontologies
and Knowledge Bases [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ], because general-purpose NER tools trained on broad corpora often fail to
capture the specialized terminology and entity types in cyber security reports [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. As a result, there is
a need for domain-specific datasets that facilitate the training and evaluation of NLP models capable
of extracting cyber threat indicators with high accuracy and relevance [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. In recent years, eforts
such as the development of the APTNER dataset [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] have aimed to address this need, providing a
substantial corpus for training and evaluating NER models in the CTI domain. However, existing
datasets often lack the scale and diversity necessary to comprehensively cover the breadth of cyber
threat scenarios encountered in practice. The advent of Large Language Models (LLMs) has introduced
∗Corresponding author.
      </p>
      <p>CEUR</p>
      <p>
        ceur-ws.org
new perspectives in the NLP domain, significantly improving, among the others, the understanding and
extracting domain-specific entities from complex unstructured texts. These models ofer the potential
to bridge the gap between structured and unstructured CTI sources, enabling more timely and accurate
threat detection and response and allowing the development of efective NLP-based cyber security
tools and methods [
        <xref ref-type="bibr" rid="ref10 ref11 ref9">9, 10, 11</xref>
        ]. On the other hand, LLMs needs to be fine-tuned on the specific task
and domain, but, as mentioned above, in cyber security doimain there is a lack of efective annotated
NER resources. Therefore, this paper proposes to merge two prominent cyber security NER datasets:
CyNER [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and APTNER [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. CyNER aggregates a wide array of openCTI from diverse sources, and
complements APTNER’s focus on structured NER tasks. By merging these datasets, we aim to establish a
more comprehensive resource that enriches the availability of NER datasets for the research community,
and also supports enhanced threat detection, incident response, and vulnerability mitigation strategies.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset</title>
      <p>
        APTNER [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and CyNER [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], two datasets for NER in the cyber security domain, were combined to
create a new merged dataset. Each source has diferent definitions and classifications for entities
and it is required a mapping scheme for these diferent labels. To balance eficiency and simplicity,
while improving coverage, the decision was made to utilize the CyNER scheme and extend it with
one entity type from APTNER (i.e., Secteam). The APTNER annotation scheme comprises a larger
label set that could be remapped into the final annotation scheme, and standardizing entity types and
annotations makes it easier to create strong NER models for improved cyber security analysis and
response. Moreover, the labels used in APTNER include several subtypes of CyNER labels, so a natural
mapping approach and association were made using the mapping scheme shown in Table 1, aggregating
some subtypes into a single type. The combined cyber security NER dataset’s annotation scheme
resulted in seven labels, with the following meanings:
1. Indicator represents information useful to identify the resource compromised or the technology
afected by the attack.
2. Malware represents all possible threat elements extracted from the corpora, such as action, actors,
software, techniques, and so on.
3. Secteam represents the group announcing the vulnerability identified.
4. System represents operating system, software, and hardware.
5. Vulnerability represents both CVE ID and mention of exploits.
6. Organization represents companies, organizations, institutions, brands, and others.
7. Other includes the additional and generic entity types that are not annotated in one of the
considered dataset and cannot be mapped in a specific category of the other one.
      </p>
      <p>As result, we obtained a merged dataset, further split in a training set and a test set (approximately
70% and 30%), whose statistics and distributions of the various entity types are summarized in Table 2.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental Evaluation</title>
      <p>
        The assessment of the proposed augmented dataset presented in this study is performed by using
it in the NER fine-tuning of LLMs, comparing the obtained results with the performances obtained
using the original two datasets. We fine-tuned two NER models tailored to the cyber security domain,
namely SecBert1 and SecureBERT [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], both available on the Hugging Face LLM repository. These
models, respectively built upon BERT [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and RoBERTa [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] architectures, have been pre-trained on
large corpora in the cyber security domain, demonstrating that they are both able to provide improved
results when fine-tuned for NLP tasks in the same domain. In addition to these models, we also trained
two baseline models, BERT-base-cased and RoBERTa-base, using them as benchmarks for a comparison
against the specialized models. The evaluation is based on standard NER metrics (P, R, and F1) calculated
at the token level. The obtained results, summarized in Table 3, provided insights into each model’s
efectiveness and generalization capability on the cyber security NER task. When comparing the
performance across all datasets, it is evident that the merged dataset significantly improves the NER
model results, showcasing its ability to enhance the performance through richer and more diverse
entity coverage, also when used with general-domain LLMs. On the other hand, the datasets (including
the original ones) have some very unbalanced classes, and this can limit their performances, as well as
their generalization capabilities, causing in some experiments lower performances, compared with the
expected ones.
      </p>
      <p>The merged dataset, including the documentation on the annotation scheme, and the fine-tuned NER
models are publicly available on the SoBigData research infrastructure2.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions and Future Works</title>
      <p>This work presented a cyber security NER dataset, obtained by combining APTNER and CyNER datasets,
with the purposes of addressing the scarcity of open cyber security NER corpora and improving the
performances of the original ones. The merged dataset standardizes and harmonizes entity types
across diferent sources, providing a comprehensive and diverse set of annotations. Our experiments
demonstrated that the proposed dataset significantly enhances model performance, highlighting its
ability to improve the recognition capabilities of NER models in the cyber security domain.</p>
      <p>
        Future work could focus on further expanding the dataset by integrating additional cyber security
corpora to cover a wider range of entities and scenarios, as well as to reduce the unbalance of the dataset,
also leveraging augmenting techniques or semi-supervised annotation approaches [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Additionally,
exploring transfer learning techniques to apply the knowledge gained from this dataset to other related
tasks in cyber security, such as threat detection and incident response, could be highly beneficial. Finally,
it would be valuable to investigate the impact of diferent annotation schemes and entity definitions on
model performance, to refine further and optimize the dataset.
      </p>
      <sec id="sec-4-1">
        <title>1https://github.com/jackaduma/SecBERT</title>
        <p>2https://data.d4science.org/ctlg/ResourceCatalogue/cybersecurity_ner_securebert_model
https://data.d4science.org/ctlg/ResourceCatalogue/cybersecurity_ner_roberta-base_model
https://data.d4science.org/ctlg/ResourceCatalogue/cybersecurity_ner_bert-base-cased_model
https://data.d4science.org/ctlg/ResourceCatalogue/cybersecurity_ner_dataset
This work is supported by the European Union - NextGenerationEU - National Recovery and Resilience
Plan (Piano Nazionale di Ripresa e Resilienza, PNRR) - Project: “SoBigData.it - Strengthening the Italian
RI for Social Mining and Big Data Analytics” - Prot. IR0000013 - Avviso n. 3264 del 28/12/2021. We
thank Simona Sada and Giuseppe Trerotola for the administrative and technical support provided.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <sec id="sec-5-1">
        <title>The author(s) have not employed any Generative AI tools.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T. B.</given-names>
            <surname>Robert Byers</surname>
          </string-name>
          , Chris Turner,
          <article-title>National vulnerability database, national institute of standards and technology</article-title>
          , https://nvd.nist.
          <source>gov (accessed on 1/9/</source>
          <year>2024</year>
          ),
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Ofsec</surname>
          </string-name>
          ,
          <article-title>Exploit data base</article-title>
          , https://www.exploit-db.
          <source>com (accessed on 1/9/</source>
          <year>2024</year>
          ),
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Wu</surname>
          </string-name>
          ,
          <article-title>Cybersecurity named entity recognition using multimodal ensemble learning</article-title>
          ,
          <source>IEEE Access 8</source>
          (
          <year>2020</year>
          )
          <fpage>63214</fpage>
          -
          <lpage>63224</lpage>
          . doi:
          <volume>10</volume>
          .1109/ACCESS.
          <year>2020</year>
          .
          <volume>2985625</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>T.-M. Georgescu</surname>
          </string-name>
          ,
          <article-title>Natural language processing model for automatic analysis of cybersecurityrelated documents</article-title>
          ,
          <source>Symmetry</source>
          <volume>12</volume>
          (
          <year>2020</year>
          ). doi:
          <volume>10</volume>
          .3390/sym12030354.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Satyapanich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ferraro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Finin</surname>
          </string-name>
          ,
          <article-title>Casie: Extracting cybersecurity event information from text</article-title>
          ,
          <source>Proceedings of the AAAI Conference on Artificial Intelligence</source>
          <volume>34</volume>
          (
          <year>2020</year>
          )
          <fpage>8749</fpage>
          -
          <lpage>8757</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>I.</given-names>
            <surname>Deliu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leichter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Franke</surname>
          </string-name>
          ,
          <article-title>Extracting cyber threat intelligence from hacker forums: Support vector machines versus convolutional neural networks</article-title>
          ,
          <source>in: 2017 IEEE International Conference on Big Data (Big Data)</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>3648</fpage>
          -
          <lpage>3656</lpage>
          . doi:
          <volume>10</volume>
          .1109/BigData.
          <year>2017</year>
          .
          <volume>8258359</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bhusal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Park</surname>
          </string-name>
          , N. Rastogi,
          <article-title>CyNER: A Python library for cybersecurity named entity recognition</article-title>
          ,
          <year>2022</year>
          . arXiv:
          <volume>2204</volume>
          .
          <fpage>05754</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Jiang,</surname>
          </string-name>
          <article-title>APTNER: A specific dataset for NER missions in cyber threat intelligence field</article-title>
          ,
          <source>in: Proceedings of the 2022 IEEE 25th International Conference on Computer Supported Cooperative Work in Design (CSCWD)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>1233</fpage>
          -
          <lpage>1238</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Silvestri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Islam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amelin</surname>
          </string-name>
          , G. Weiler,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papastergiou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ciampi</surname>
          </string-name>
          ,
          <article-title>Cyber threat assessment and management for securing healthcare ecosystems using natural language processing</article-title>
          ,
          <source>International Journal of Information Security</source>
          <volume>23</volume>
          (
          <year>2024</year>
          )
          <fpage>31</fpage>
          -
          <lpage>50</lpage>
          . doi:
          <volume>10</volume>
          .1007/s10207- 023- 00769- w.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Silvestri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Islam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papastergiou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tzagkarakis</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. Ciampi,</surname>
          </string-name>
          <article-title>A machine learning approach for the NLP-based analysis of cyber threats and vulnerabilities of the healthcare ecosystem</article-title>
          ,
          <source>Sensors</source>
          <volume>23</volume>
          (
          <year>2023</year>
          ). URL: https://www.mdpi.com/1424-8220/23/2/651. doi:
          <volume>10</volume>
          .3390/s23020651.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>N.</given-names>
            <surname>Capodieci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sanchez-Adames</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Harris</surname>
          </string-name>
          , U. Tatar,
          <article-title>The impact of generative AI and LLMs on the cybersecurity profession</article-title>
          ,
          <source>in: 2024 Systems and Information Engineering Design Symposium (SIEDS)</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>448</fpage>
          -
          <lpage>453</lpage>
          . doi:
          <volume>10</volume>
          .1109/SIEDS61124.
          <year>2024</year>
          .
          <volume>10534674</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>E.</given-names>
            <surname>Aghaei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Niu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Shadid</surname>
          </string-name>
          , E. Al-Shaer,
          <article-title>SecureBERT: A domain-specific language model for cybersecurity</article-title>
          ,
          <source>in: International Conference Security and Privacy in Communication Networks (SecureComm)</source>
          , Springer, Cham,
          <year>2023</year>
          , pp.
          <fpage>39</fpage>
          -
          <lpage>56</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <article-title>BERT: pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics NAACL-HLT</source>
          <year>2019</year>
          , ACL, Minneapolis, MN, USA,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . doi:
          <volume>10</volume>
          .18653/V1/N19- 1423.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          , V. Stoyanov,
          <article-title>RoBERTa: A robustly optimized BERT pretraining approach</article-title>
          , CoRR abs/
          <year>1907</year>
          .11692 (
          <year>2019</year>
          ). arXiv:
          <year>1907</year>
          .11692.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>G.</given-names>
            <surname>Aracri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Folino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Silvestri</surname>
          </string-name>
          ,
          <article-title>Integrated use of KOS and deep learning for data set annotation in tourism domain</article-title>
          ,
          <source>Journal of Documentation</source>
          <volume>79</volume>
          (
          <year>2023</year>
          )
          <fpage>1440</fpage>
          -
          <lpage>1458</lpage>
          . doi:
          <volume>10</volume>
          .1108/JD- 02- 2023- 0019.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>