<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Max:</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Taxonomy for Patent Classification: A Step Towards Intelligent Patent Analytics</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elham Motamedi</string-name>
          <email>elham.motamedi@ijs.si</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Inna Novalija</string-name>
          <email>inna.koval@ijs.si</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luis Rei</string-name>
          <email>luis.rei@ijs.si</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Jožef Stefan Institute</institution>
          ,
          <addr-line>Ljubljana</addr-line>
          ,
          <country country="SI">Slovenia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Jožef Stefan Institute</institution>
          ,
          <addr-line>Ljubljana</addr-line>
          ,
          <country country="SI">Slovenia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Jožef Stefan Institute</institution>
          ,
          <addr-line>Ljubljana</addr-line>
          ,
          <country country="SI">Slovenia</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Knowledge Taxonomy</institution>
          ,
          <addr-line>Knowledge Tracking, Patent Classification, Hierarchical Classification, Multi-label Classi-</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>128</volume>
      <issue>417</issue>
      <fpage>26</fpage>
      <lpage>28</lpage>
      <abstract>
        <p>In this study, we proposed a knowledge taxonomy for patents, called KnowMap, which aligns with the CPC schema and reduces the number of classes to 83 at the lowest hierarchical level. We classified patents into these ifne-grained classes within a multi-label setting, fine-tuning a distilled version of the RoBERTa model for this purpose. We employed two sampling techniques: (i) random sampling and (ii) conditional random sampling, and found that conditional random sampling led to less pronounced class imbalance, resulting in more generalisable outcomes. Additionally, our results showed higher F1-Macro scores for minority classes, which will be further explored in future work.</p>
      </abstract>
      <kwd-group>
        <kwd>ifcation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Exploring and leveraging patent-related data is a key task in both scientific and industrial domains.
Patent analytics ofers a comprehensive view of emerging innovative technologies across various fields.
Consequently, business and research initiatives, including European projects, depend on analysing and
enhancing patent datasets with specialised innovation-related taxonomies.</p>
      <p>
        One such initiative, the enRichMyData project [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], provides an open software toolbox with practical,
robust, and scalable components. This toolbox supports organisations in enriching their data with
reference information they may not fully understand and aids data providers in making their data
reusable and accessible for data enrichment processes.
      </p>
      <p>
        In this paper, we propose a novel hierarchical knowledge taxonomy that aligns with the widely
used Cooperative Patent Classification (CPC) schema. The CPC classification system organises patents
into hierarchical taxonomies, which helps streamline internal processes and enhances the eficiency of
search queries. In the first level of the CPC hierarchy, there are nine sections, which are divided into
classes, subclasses, groups, and subgroups. Each level of this hierarchy can have several codes ending
in approximately 250,000 classification labels [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Our taxonomy merges several class entities within
the CPC schema based on the scope of the knowledge field and the number of patents associated with
each class. This approach addresses the challenge of reducing the large number of class entities in the
CPC schema in a way that difers from previous works and provides a benchmark taxonomy for future
research. In this study, we also classified patents into the fine-grained classes defined by our proposed
taxonomy in a multi-label setting.
      </p>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Patent documents contain various types of information, including text [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The textual content of a
patent is divided into several sections, such as the title, abstract, claim, and description [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The title
and abstract are shorter than the description but still provide relevant information for classification. Li
et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] evaluated various lengths of the abstract and title, finding that using the first 100 words of title
and abstract resulted in the best classification performance in their study.
      </p>
      <p>
        Various classification systems exist for organising patents [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In this work, we focus on the CPC
schema. Kamateri et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] discussed several potential challenges that artificial intelligence technologies
face in patent classification. One such challenge is the extensive number of class labels. As an example,
the CPC has around 250,000 labels.
      </p>
      <p>
        Patent classification is a multi-label classification problem since every patent can belong to several
knowledge fields [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. Given the large number of classes at the lowest level of the taxonomy tree,
the performance of automatic models in predicting such fine-grained categories is limited [
        <xref ref-type="bibr" rid="ref4 ref8 ref9">4, 8, 9</xref>
        ].
Several previous studies have focused on higher levels of the hierarchy, limiting classification to broader
categories such as sections, classes, or subclasses within the taxonomy [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Bekamiri et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
finetuned the SBERT model to predict labels at the subclass level (i.e., 663 class labels) using a multi-label
formulation. Aroyehun et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] similarly truncated the IPC hierarchy at the subclass level and
predicted these labels by transferring knowledge from two higher levels (section and class) to the lower
level (subclass). While it remains valuable to use an automatic model that can narrow down applications
to higher levels of the taxonomy tree, this approach has limitations. One such limitation is that the
choice of target class labels does not depend on the scope of the knowledge area. More established and
expansive areas may benefit from directing experts to detailed groups, while less developed areas may
be adequately served by broader classifications.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods and Materials</title>
      <p>In this work, we developed a knowledge field taxonomy using CPC schema labels. We also classified
patents into KnowMap’s fine-grained classes by fine-tuning some pre-trained models.</p>
      <sec id="sec-3-1">
        <title>3.1. Data Acquisition and Pre-processing</title>
        <p>
          We used the Google Patents Public Datasets on BigQuery 1 and applied preprocessing and sampling
techniques. The dataset contains various information, with the abstract ofering a brief overview of the
patent’s novelty and the description providing more detail. For classification, we concatenated the title,
abstract, and description, filtering out documents with fewer than 100 words, as prior studies suggest
this improves classifier performance [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>
          In developing the taxonomy, we considered both the shared knowledge across fields and the
distribution of documents within each defined class. To have suficiently abstract classes, we set a threshold for
the minimum number of patents in each detailed group at the lowest level of the hierarchy. Prior to
counting the documents in each class, we applied a deduplication step as part of the preprocessing to
remove duplicate and near-duplicate texts, which may refer to the same patent [
          <xref ref-type="bibr" rid="ref12 ref13 ref14">12, 13, 14</xref>
          ].
        </p>
        <p>
          Deduplication was performed using Locality Sensitive Hashing (LSH) [
          <xref ref-type="bibr" rid="ref15 ref16 ref17">15, 16, 17</xref>
          ]. In particular,
we used MinHash to approximate the similarities between the documents. Each document was first
transformed into a set of n-grams (i.e., in our case 1-grams, 2-grams, and 3-grams). LSH then grouped
documents with similar signatures into the same buckets, ensuring that only documents within the
same bucket were compared in detail. A Jaccard similarity threshold of 0.9 was set, meaning documents
with a similarity score greater than 0.9 were considered duplicates. After deduplication, we generated
a dataset sample using two techniques: (i) random sampling and (ii) conditional random sampling,
1https://github.com/google/patents-public-data
which included documents in the sample only if their class had fewer than 20,000 documents. Random
sampling resulted in 1,092,991 samples, and conditional random sampling, resulted in 1,244,469 samples.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Knowmap Taxonomy Generation</title>
        <p>We developed the KnowMap taxonomy by refining the CPC hierarchy and its class entities to create a
more abstract representation of patents. Starting from the highest level, we manually merged groups
at each level based on shared knowledge and document counts. While all major CPC sections were
retained at the first level, groups with fewer than 40,000, 20,000, and 9,000 documents were merged at
levels 2, 3, and 4, respectively.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Patent Classification Method and Experimental Setup</title>
        <p>
          We formulated the classification problem as a multi-label problem, in which each document is assigned
to one or multiple knowledge fields. In this study, we aimed to classify the patents into the fine-grained
classes in the lowest level of the proposed taxonomy (i.e., 83 classes). We used the pre-trained language
model distilroberta-base, a distilled version of RoBERTa [
          <xref ref-type="bibr" rid="ref18 ref19">18, 19</xref>
          ]. To adapt this model for our classification
task, we fine-tuned it by adding a classification head using the AutoModelForSequenceClassification
class from the Hugging Face library 2. This classification head processes the hidden state of the first
token through a fully connected dense layer. Given that our task is multi-label classification, we applied
a sigmoid function to the output logits for each class to obtain probabilities. The implementation of
classification method is available online 3.
        </p>
        <p>
          For model training, we used a learning rate of 4e-5 with a linear scheduler, a weight decay of 0.1, and
trained for up to 5 epochs with early stopping. The best checkpoint was selected to prevent overfitting,
based on validation accuracy. The sampled datasets were split into training, validation, and test sets
with ratios of 0.8, 0.1, and 0.1, respectively. To maintain the class distribution across these sets, we used
stratified splitting 4 proposed by Sechidis et al. [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Analysis</title>
      <p>In this section, we first present the KnowMap taxonomy and then evaluate the performance of classifiers
in categorising patents into fine-grained classes of the taxonomy.</p>
      <sec id="sec-4-1">
        <title>4.1. KnowMap Taxonomy</title>
        <p>Following the methodology described in Sec. 3.2, we established a hierarchy with the root node as
level 0 and level 4 as the lowest level. There are nine classes at level 1 and 83 classes at the lowest level
of the hierarchy.</p>
        <p>Our hierarchical labels, including merged classes and document counts, are available online 5. The
taxonomy retains the nine CPC sections at the first level, while subsequent levels include merged CPC
groups, all detailed in the shared online source.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Classification Results</title>
        <p>In this study, we classified patents into the fine-grained classes at the lowest level of the hierarchical
taxonomy, which includes 83 labels. To gain further insights into the datasets generated by the two
sampling techniques, we analysed the number of samples per class in each dataset. Fig. 1 illustrates this
information through box plots for both sampling techniques. The plots highlight the first quartile (Q1),
median, third quartile (Q3), minimum, and maximum values for each sampling method.
2https://huggingface.co/
3https://github.com/elmotamedi/KnowMap-Taxonomy
4https://github.com/trent-b/iterative-stratification?tab=readme-ov-file#multilabelstratifiedkfold
5https://github.com/elmotamedi/KnowMap-Taxonomy</p>
        <p>Max: 81,241
Q3: 20,028
Q1: 4,862</p>
        <p>Median: 8,507</p>
        <p>Min: 59
Random Sampling</p>
        <p>Q3: 23,928
Q1: 10,726</p>
        <p>Min: 1,086
Conditional Random Sampling</p>
        <p>Median: 15,006
Samples</p>
        <p>Based on Fig. 1, random sampling resulted in a broader range of document counts per class, with a
minimum of 59 samples and a maximum of 128,417 samples per class. The conditional random sampling
technique produced a narrower range, with a minimum of 1,086 samples and a maximum of 81,241
samples per class. For our analysis, we categorise classes into three groups: small classes (those in
the first quartile), medium classes (those in the second and third quartiles), and large classes (those
above the third quartile). With this categorisation, conditional random sampling appears to ofer more
balanced class distributions compared to random sampling, potentially enhancing the generalisability of
classification models trained on this dataset. We present the classification results on both the validation
and test sets, applied to the datasets generated by the two sampling techniques in Tab. 1.</p>
        <p>As observed from the results, the Macro-F1 score is higher than the Micro-F1 score, which may show
that the model performs better for minority classes compared to majority classes. To gain more insights
into these results, we generated a plot (see Fig.2), showing the F1 scores compared to the number of
documents in each class.</p>
        <p>The plot shows that the Macro-F1 score is higher for minority classes compared to majority classes
for both sampling techniques. The gap between the line plots for random sampling and conditional
random sampling (Fig. 2b) highlights the presence of larger classes in the dataset created by conditional
random sampling. To provide further insights into the F1-macro scores for small, medium, and large
classes across each sample, we have created box plots summarising these scores for each class group
and sampling method. The minimum, median, and maximum F1-macro scores for each class group are
presented in Fig. 3. Although the increasing trend in F1-macro scores for smaller classes is still visible,
it is less pronounced compared to the random sampling technique.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion and Conclusions</title>
      <p>In this work, we proposed a knowledge taxonomy that aligns with the CPC schema, reducing the
number of classes to 83 at the lowest level while ensuring a minimum number of documents for each
class in the studied dataset.</p>
      <p>We created two datasets from the preprocessed original data using two sampling techniques: (i)
random sampling and (ii) conditional random sampling. The conditional random sampling technique
resulted in class entities with a minimum of 1,086 samples, substantially more than the minimum
sample size achieved through random sampling. This suggests that the results from conditional random
sampling may be more generalisable compared to those from random sampling.</p>
      <p>In terms of performance, classifiers showed comparable results with both sampling techniques. Both
datasets were unbalanced, with the imbalance being less pronounced in the dataset created through
conditional random sampling. The classification results exhibited higher F1-Macro scores compared
to F1-Micro scores, likely due to the unbalanced nature of the datasets. We conjecture that the lower
F1-Macro scores for larger classes may result from the varied nature of documents within those classes,
possibly due to imprecise patent assignments in the CPC system or the broader scope of these knowledge
ifelds. Our future research will focus on analysing the classes that the classifier struggles with.</p>
      <p>To improve classification performance, we plan to address the dataset imbalance using techniques
specifically designed for multi-label classification with long-tailed distributions. Additionally, we aim to
explore the use of a larger or alternative pre-trained model to potentially enhance classification results.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was supported by the Slovenian Research and Innovation Agency under grant agreements
CRP V2-2272, V5-2264, CRP V2-2146 and the European Union through enrichMyData EU HORIZON-IA
project under grant agreement No 101070284.</p>
    </sec>
    <sec id="sec-7">
      <title>A. Online Resources</title>
      <p>The sources referenced in this paper, including the proposed taxonomy and the classification
implementation, are available at:
• KnowMap taxonomy and classification implementation
• Multi-label stratified K-fold implementation
• Google Patents Public Datasets on BigQuery
• Hugging Face library</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <article-title>[1] enRichMyData consortium, enrichmydata project</article-title>
          , https://enrichmydata.eu, ????
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Kamateri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Salampasis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez-Molina</surname>
          </string-name>
          ,
          <article-title>Will AI solve the patent classification problem?</article-title>
          ,
          <source>World Patent Information</source>
          <volume>78</volume>
          (
          <year>2024</year>
          )
          <article-title>102294</article-title>
          . URL: https://doi.org/10.1016/j.wpi.
          <year>2024</year>
          .
          <volume>102294</volume>
          . doi:
          <volume>10</volume>
          . 1016/j.wpi.
          <year>2024</year>
          .
          <volume>102294</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Suzgun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Melas-Kyriazi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Sarkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. D.</given-names>
            <surname>Kominers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Shieber</surname>
          </string-name>
          ,
          <article-title>The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications</article-title>
          ,
          <source>in: 37th Conference on Neural Information Processing Systems (NeurIPS</source>
          <year>2023</year>
          )
          <article-title>Track on Datasets and Benchmarks</article-title>
          , NeurIPS,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>39</lpage>
          . arXiv:
          <volume>2207</volume>
          .
          <fpage>04043</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cui</surname>
          </string-name>
          , J. Hu,
          <article-title>DeepPatent: patent classification with convolutional neural networks and word embedding</article-title>
          ,
          <source>Scientometrics</source>
          <volume>117</volume>
          (
          <year>2018</year>
          )
          <fpage>721</fpage>
          -
          <lpage>744</lpage>
          . doi:
          <volume>10</volume>
          .1007/s11192-018-2905-5.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Moens</surname>
          </string-name>
          ,
          <source>A survey of automated hierarchical classification of patents, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)</source>
          <volume>8830</volume>
          (
          <year>2014</year>
          )
          <fpage>215</fpage>
          -
          <lpage>249</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -12511-4_
          <fpage>11</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Roudsari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Afshar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lee</surname>
          </string-name>
          <article-title>, Multi-label patent classification using attention-aware deep learning model</article-title>
          ,
          <source>in: Proceedings - 2020 IEEE International Conference on Big Data and Smart Computing</source>
          ,
          <source>BigComp</source>
          <year>2020</year>
          ,
          <year>2020</year>
          , pp.
          <fpage>558</fpage>
          -
          <lpage>559</lpage>
          . doi:
          <volume>10</volume>
          .1109/BigComp48618.
          <year>2020</year>
          .
          <volume>000</volume>
          -
          <fpage>2</fpage>
          . arXiv:arXiv:
          <year>1910</year>
          .01108.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>G.</given-names>
            <surname>Jung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Impact of preprocessing and word embedding on extreme multi-label patent classification tasks</article-title>
          ,
          <source>Applied Intelligence</source>
          <volume>53</volume>
          (
          <year>2023</year>
          )
          <fpage>4047</fpage>
          -
          <lpage>4062</lpage>
          . doi:
          <volume>10</volume>
          .1007/s10489-022-03655-5.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Fall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Törcsvári</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Benzineb</surname>
          </string-name>
          , G. Karetka,
          <article-title>Automated categorization in the international patent classification</article-title>
          ,
          <source>ACM SIGIR Forum</source>
          <volume>37</volume>
          (
          <year>2003</year>
          )
          <fpage>10</fpage>
          -
          <lpage>25</lpage>
          . doi:
          <volume>10</volume>
          .1145/945546.945547.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Haghighian Roudsari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Afshar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>PatentNet: multi-label classification of patent documents using deep learning based language understanding</article-title>
          ,
          <source>Scientometrics</source>
          <volume>127</volume>
          (
          <year>2022</year>
          )
          <fpage>207</fpage>
          -
          <lpage>231</lpage>
          . doi:
          <volume>10</volume>
          .1007/s11192-021-04179-4.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Bekamiri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Hain</surname>
          </string-name>
          , R. Jurowetzki,
          <article-title>PatentSBERTa: A deep NLP based hybrid model for patent distance and classification using augmented SBERT</article-title>
          ,
          <source>Technological Forecasting and Social Change</source>
          <volume>206</volume>
          (
          <year>2024</year>
          )
          <article-title>123536</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.techfore.
          <year>2024</year>
          .
          <volume>123536</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S. T.</given-names>
            <surname>Aroyehun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Angel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gelbukh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hussain</surname>
          </string-name>
          ,
          <article-title>Leveraging label hierarchy using transfer and multi-task learning: A case study on patent classification</article-title>
          ,
          <source>Neurocomputing</source>
          <volume>464</volume>
          (
          <year>2021</year>
          )
          <fpage>421</fpage>
          -
          <lpage>431</lpage>
          . doi:
          <volume>10</volume>
          .1016/j.neucom.
          <year>2021</year>
          .
          <volume>07</volume>
          .057.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>G.</given-names>
            <surname>Costa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cuzzocrea</surname>
          </string-name>
          , G. Manco,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ortale</surname>
          </string-name>
          , Data De-duplication :
          <article-title>A Review Data Deduplication : A Review, Learning structure and schemas from documents (</article-title>
          <year>2011</year>
          ). doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>642</fpage>
          -22913-8.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>N.</given-names>
            <surname>Kandpal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Wallace</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <article-title>Deduplicating Training Data Mitigates Privacy Risks in Language Models</article-title>
          , in: International Conference on Machine Learning, Baltimore, volume
          <volume>162</volume>
          ,
          <year>2022</year>
          , pp.
          <fpage>10697</fpage>
          -
          <lpage>10707</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ippolito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nystrom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Eck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Callison-Burch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Carlini</surname>
          </string-name>
          ,
          <string-name>
            <surname>Deduplicating Training Data Makes Language Models Better</surname>
          </string-name>
          ,
          <source>Proceedings of the Annual Meeting of the Association for Computational Linguistics</source>
          <volume>1</volume>
          (
          <year>2022</year>
          )
          <fpage>8424</fpage>
          -
          <lpage>8445</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>577</volume>
          . arXiv:
          <volume>2107</volume>
          .
          <fpage>06499</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>B.</given-names>
            <surname>Gyawali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Anastasiou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Knoth</surname>
          </string-name>
          ,
          <article-title>Deduplication of scholarly documents using locality sensitive hashing and word embeddings</article-title>
          ,
          <source>in: Proceedings of the 12th Conference on Language Resources and Evaluation (LREC</source>
          <year>2020</year>
          ),
          <source>European Language Resources Association</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>894</fpage>
          -
          <lpage>903</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>O.</given-names>
            <surname>Jafari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Maurya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nagarkar</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. M. Islam</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Crushev</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          <article-title>Survey on Locality Sensitive Hashing Algorithms and their Applications, ACM Computing Surveys (</article-title>
          <year>2021</year>
          ). arXiv:
          <volume>2102</volume>
          .
          <fpage>08942</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Aydar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ayvaz</surname>
          </string-name>
          ,
          <article-title>An improved method of locality-sensitive hashing for scalable instance matching</article-title>
          ,
          <source>Knowledge and Information Systems</source>
          <volume>58</volume>
          (
          <year>2019</year>
          )
          <fpage>275</fpage>
          -
          <lpage>294</lpage>
          . doi:
          <volume>10</volume>
          .1007/s10115- 018- 1199- 5.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          , ArXiv abs/
          <year>1907</year>
          .11692 (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          , T. Wolf,
          <article-title>Distilbert, a distilled version of bert: Smaller, faster, cheaper and lighter</article-title>
          .
          <source>arxiv</source>
          <year>2019</year>
          , arXiv preprint arXiv:
          <year>1910</year>
          .
          <volume>01108</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>K.</given-names>
            <surname>Sechidis</surname>
          </string-name>
          , G. Tsoumakas,
          <string-name>
            <surname>I. Vlahavas</surname>
          </string-name>
          ,
          <article-title>On the stratification of multi-label data</article-title>
          , in: D.
          <string-name>
            <surname>Gunopulos</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Hofmann</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Malerba</surname>
          </string-name>
          , M. Vazirgiannis (Eds.),
          <source>Machine Learning and Knowledge Discovery in Databases</source>
          , Springer Berlin Heidelberg, Berlin, Heidelberg,
          <year>2011</year>
          , pp.
          <fpage>145</fpage>
          -
          <lpage>158</lpage>
          . doi:
          <volume>10</volume>
          .1007/ 978- 3-
          <fpage>642</fpage>
          - 23808- 6_
          <fpage>10</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>