<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Digital) Patents Identification: an Automated Patent Landscaping Method</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Francesca Ghinami</string-name>
          <email>francesca.ghinami@unica.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Patent Landscaping, Rule-based, Topic-guided pruning</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Cagliari, Department of Economics and Business</institution>
          ,
          <addr-line>Via Sant'Ignazio 17, 09123</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <fpage>9</fpage>
      <lpage>11</lpage>
      <abstract>
        <p>Identifying green, digital, and twin-transition patents is essential for tracking innovation and assessing policy impact, yet existing code-based and machine-learning approaches often yield non-overlapping results, undermining comparability and reproducibility. This study introduces a scalable framework that combines configurable keyword and technology rules for candidate identification, a rule-guided seed and antiseed definition, and bidirectional citation expansion. Patent texts are encoded with a domain-specific transformer, and final selection is achieved through topic-guided pruning based on a contrastive cosine rule applied to topic-level representations. Validation against proxy labels on a held-out split indicates high precision under a conservative threshold and balanced performance under a data-driven threshold. The workflow is automated, largely unsupervised, and tractable at the scale of millions of patent families, with results robust to sensible hyperparameter choices and threshold selection, thereby improving transparency and comparability for patent landscaping in the green and digital domains.</p>
      </abstract>
      <kwd-group>
        <kwd>with digital strategies (CPC Y04</kwd>
        <kwd>selected technological groups from the International Patent</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Automated Patent Landscaping</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Identifying patents at the intersection of environmental and digital technologies (“twin patents”) is
pivotal for tracking innovation dynamics and evaluating policies that foster sustainability and
digitalization. Yet current approaches—ranging from examiner- or expert-selected technological codes and
curated keywords, to citation-based heuristics and machine-learning pipelines—often select markedly
diferent sets of documents [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], hampering comparability across studies and policy evaluations. Overlaps
between sets built with diferent methods are very low—with observed Jaccard indices below random
expectation, as reported in Table 3—underscoring the need for transparent, reproducible pipelines.
      </p>
      <p>To address fragmentation while preserving transparency and scalability, this work introduces an
automated workflow that minimizes manual intervention and combines rule-guided seed construction,
two-level bidirectional citation expansion of the candidates, and topic-guided semantic pruning with
patent-specific transformer embeddings.</p>
      <p>First, candidate “twin” patents are identified by systematically crossing green selection strategies from
the literature—namely the Cooperative Patent Classification (CPC) code Y02 and targeted sustainability
Classification (IPC), and digital/AI keywords). This
Candidates set is then used to automatically derive
three working sets: (i) the seed set, comprising high-precision exemplars of twin patents; a family enters
the Seed if it is flagged by more than two independent modules (i.e., at least three; strict voting rule); (ii)
the expansion set, obtained by expanding the Candidates via forward and backward family-level citations
in two waves to collect plausibly related families; and (iii) the antiseed set, a size-matched control
sampled outside both the Seed and the Expansion, designed to include mostly random non-twin patents
and a 10% share of hard negatives. These hard-to-classify negatives are sampled from patents tagged
as green or digital through Y02 and Y04 CPC codes, but not classified as jointly green-and-digital by
https://sites.google.com/view/francescaghinami/ (F. Ghinami)</p>
      <p>CEUR
Workshop</p>
      <p>
        ISSN1613-0073
any method (see Section 3.2). A patent-tailored encoder (PaECTER, [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]), built on Bidirectional Encoder
Representations from Transformers (BERT), is used to obtain text embeddings, and a BERT-based topic
model (BERTopic, [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) projects documents into topic space. The expansion set is then pruned via a
topic-guided criterion based on maximum cosine similarity to seed versus antiseed topics. Unsupervised
diagnostics indicate clearer topic separation and a rightward shift in cosine-similarity densities toward
seed topics after pruning. Finally, a pseudo-labeled evaluation set based on CPC Y02 ∩Y04 codes—used
as an expert-curated proxy for twin patents—is constructed for validation and for selecting a robust
operating point by maximizing the Matthews correlation coeficient (MCC) on a stratified held-out
split. On the same pseudo-labeled set, supervised-style metrics (precision, recall, F1, MCC) remain high
under both a conservative threshold ( = 0 ) and the MCC-maximizing threshold. A hyperparameter
sensitivity study across the topic-discovery pipeline—including the text representation, low-dimensional
projection, and density-based clustering stages—indicates that selection performance is insensitive to
reasonable variation in these settings. At scale, the pipeline runs on PATSTAT Autumn 2024, leveraging
structured metadata (technological codes, citations, abstracts and titles) from approximatively 47M
patent families with an English abstract. Simple patent families, grouping patent applications and
publications sharing the same priority, are used as unit of analysis.
      </p>
      <p>This article (i) proposes a reproducible, weakly-unsupervised seed definition procedure that integrates
diferent sets of conditions derived from the literature, with a strict voting rule to balance breadth and
precision; (ii) develops a citation-network aware expansion and matched random antiseed construction
to enable unsupervised, contrastive pruning at scale; (iii) introduces a topic-guided semantic pruning
strategy that couples domain-specific patent embeddings (PaECTER) with BERTopic topic distributions
and a seed-vs-antiseed cosine decision rule, yielding cleaner, more coherent landscapes; and (iv) delivers
a scalable, transparent pipeline with practical diagnostics and an accompanying Python implementation
for researchers and policy analysts.</p>
      <p>Section 2 reviews related methods and the overlap problem; Section 3 details seed rules, expansion,
and topic-guided pruning; Section 4 describes data; Section 5 reports diagnostics; Section 5.5 presents
the pseudo-label evaluation and robustness; Section 6 discusses limitations and future directions; Section
7 concludes.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Literature Review: Existing Methods, Potential and Limitations</title>
      <p>
        The classification and identification of green and digital technologies have become central to
understanding innovation dynamics in the context of the “twin transition”, which couples sustainability
objectives with digital transformation [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Patents are widely used as proxies for innovation because
they contain detailed technical descriptions and structured classification codes [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. However, accurately
identifying relevant patents—especially those that simultaneously address environmental and digital
domains—remains methodologically challenging.
      </p>
      <p>
        One group of approaches relies primarily on classification codes such as the International Patent
Classification (IPC) or the Cooperative Patent Classification (CPC). These codes are assigned by examiners
and enable systematic, replicable searches [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. While efective in principle, code-based methods often
misalign with industrial categories and fragment technologies across classes, limiting their precision
[
        <xref ref-type="bibr" rid="ref5 ref8">5, 8</xref>
        ].
      </p>
      <p>
        Keyword-based searches ofer a more flexible alternative, capable of capturing emerging or
crosscutting technologies [
        <xref ref-type="bibr" rid="ref1 ref9">9, 1</xref>
        ]. However, this flexibility introduces challenges, including linguistic variability,
ambiguity, and potential biases in terminology that evolve over time or difer across jurisdictions [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Keyword methods also depend heavily on expert input to construct comprehensive queries, and can
sufer from endogeneity when used in combination with machine learning [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        Citation-based techniques leverage references among patents to identify related inventions and trace
knowledge flows [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. While citations can reveal important technological linkages, they are also shaped
by examiner practices and strategic applicant behavior, leading to noise and incomplete coverage [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        Recent work has increasingly combined these strategies. Integrated approaches, such as those
underpinning the IPC Green Inventory or ENV-TECH classification systems, attempt to blend the
strengths of codes, keywords, and expert rules to improve recall and precision [
        <xref ref-type="bibr" rid="ref12 ref4">12, 4</xref>
        ]. However,
evidence suggests that even these comprehensive frameworks often yield low overlap across methods,
which limits their comparability and robustness [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Finally, advances in machine learning have introduced new possibilities for automating patent
classification. Supervised and semi-supervised models trained on expert-labeled seed sets can extend
coverage and reduce manual efort [
        <xref ref-type="bibr" rid="ref10 ref8">10, 8</xref>
        ]. Yet, the performance of these models depends critically on the
quality and representativeness of the training data [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Further, computational resource requirements
remain significant, particularly for models based on large transformer architectures such as BERT
-Bidirectional Encoder Representations from Transformers [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and its domain-adapted variants [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ];
[
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]; [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>No single approach fully resolves coverage, accuracy, and scalability. This fragmentation, together
with low overlap across methods, motivates integrated, open, and reproducible pipelines such as the
one proposed here.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Methodology: towards an integrated approach</title>
      <p>
        This study builds on prior semi-supervised patent landscaping methods [
        <xref ref-type="bibr" rid="ref10 ref13 ref2 ref8">10, 8, 13, 2</xref>
        ] and introduces
several adaptations designed to improve replicability while minimizing human intervention. The
proposed framework substitutes manual seed and antiseed selection with rule-based criteria, integrates
bidirectional citation expansion, and applies transformer-based embeddings in combination with topic
modeling and a pruning strategy based on the cosine similarity [
        <xref ref-type="bibr" rid="ref17 ref18">17, 18</xref>
        ] of topic-probability distributions.
This design seeks to balance coverage, scalability, and semantic coherence in patent identification.
      </p>
      <p>
        Earlier approaches typically relied on human-curated seeds and antiseeds [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], followed by
expansions targeting overrepresented technological classes and citation networks. Subsequent classification
models, trained on these curated examples, distinguished relevant patents based on semantic features.
While efective, this process remained labor-intensive and sensitive to subjective decisions. To address
these limitations, a generalizable rule-based approach for both seed and antiseed selection is adopted.
A further improvement concerns the text classification model. Prior work often used static embeddings
(e.g., Word2Vec), which struggled with ambiguity and polysemy. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] tested alternative
architectures—including MLP, CNN, and BERT— and identified BERT Transformers as the most consistent performers.
Building on [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the proposed method relies on PaECTER, a transformer optimized for patent texts and
citations.
      </p>
      <p>
        Finally, given the heterogeneity of twin patents combining digital and sustainability-related
technologies, novel unsupervised methods are introduced to test the adherence of the expanded set to the seed.
Specifically, BERTopic modeling [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]is applied to the seed, antiseed and expansion set texts (abstract and
title), and expansion candidates are pruned based on cosine similarity to topic distribution vectors. The
following sections describe each step in detail, while an overview of the method is shown in Figure 1.The
replication code to run this pipeline is available at https://github.com/GhinamiF/TwinPatentLandscape,
release v1.0.
      </p>
      <sec id="sec-4-1">
        <title>3.1. Rule based seed selection</title>
        <p>
          The seed set forms the foundation of the patent landscape, making its composition critical to ensure
both accuracy and representativeness. To reduce reliance on manual curation and enhance replicability,
a rule-based approach inspired by the strategy proposed by [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] is applied.
        </p>
        <p>In their method, patents are identified as “twin” if they combine digital and green characteristics, based
on CPC codes and keyword presence. Specifically, they apply six selection rules that combine Y02 (green
technologies) and Y04 CPC codes (digital technologies), with keywords. Additionally, IPC codes are used
to capture relevant groups and subclasses. Given the comprehensive nature of this framework1, their
1Details of the keyword and code lists are reported in Table 1.
method is adopted and extended to allow broader generalization to other technological domains. Unlike
the original formulation, which applied a fixed set of combinations, here an adaptation is suggested to
systematically combine any green identification strategy with any digital strategy in all possible pairings.
This means that every sustainability-related rule (e.g., Y02 codes or green keywords) was crossed with
every digital or AI-related rule (e.g., Y04 codes, digital IPC codes, or digital keywords), generating an
expanded and more granular set of inclusion criteria. This ensures a balanced and comprehensive
coverage of patents that may reflect diverse configurations of digital and green technologies. A patent
was thus included in the initial candidate pool if it matched any of these combined rules. To improve
precision and avoid reliance on any single identification method, a stringent inclusion criterion is
applied: only patents identified as twin by more than two of the resulting combinations were retained
in the seed set. This procedure yielded a final set of 9,847 unique patent families. These patents were
considered suficiently diverse and representative of the target technological intersection for subsequent
expansion and pruning phases.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.2. Expansion</title>
        <p>The expansion methodology implemented in this study largely follows the approach outlined by Abood
and Feltenberger (2018), which involves a two-tiered expansion process. However, the first level of
expansion based on the most relevant CPC codes is here excluded, to mitigate the risk of over-relying
on this information, as CPC codes are already incorporated as a rule in the definition of seed patents.</p>
        <p>The first level of expansion (Level 1) involves identifying patents related to seed patents through
backward and forward citations. This level includes all patents that cite the seed patents or are cited by
them. Moreover, it is augmented by all the patents identified by any of the candidate selection method,
but not included in the seed. The second level of expansion (Level 2) further extends this network by
including patents that are related to the patents identified in Level 1 through their own backward and
forward citations.</p>
        <p>The antiseed set, serving as negative examples for later semantic comparison, was generated by
randomly sampling an equal number of patent families not included in either the seed or expansion
sets.</p>
      </sec>
      <sec id="sec-4-3">
        <title>3.3. Transformer-based embeddings</title>
        <p>
          In patent classification and pruning, models can be trained from scratch or adapted from domain-specific
encoders. Training from scratch ofers flexibility but typically requires substantial compute. A practical
alternative is a pre-trained model tailored to patents. This study uses the PaECTER encoder [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], as
BERT-based encoders have shown strong and consistent performance in patent analytics [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], and
PaECTER, in particular, is reported as a top-performing patent-specific transformer [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. As described in
[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], PaECTER is trained on a patent-focused vocabulary and fine-tuned using a citation graph over a
large corpus of English-language patent families (PATSTAT 2023 Spring).
        </p>
        <p>
          Documents in the Seed, Antiseed, and Expansion sets—formed by concatenating title and abstract2—are
2All numerals are removed and text is lowercased.
encoded jointly with PaECTER to obtain a shared embedding space that enables comparable topic
modeling and pruning. They are then organized with BERTopic [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], which combines UMAP [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]
and HDBSCAN [
          <xref ref-type="bibr" rid="ref20 ref21">20, 21</xref>
          ] to produce a topic-probability vector   ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] for each document. As a
starting point, commonly used defaults are adopted—UMAP  neighbors=15, min_dist=0.1,  components=5;
HDBSCAN min_cluster_size=10, min_samples=None; BERTopic’s default vectorizer—and robustness
is assessed in sensitivity checks.
        </p>
        <p>The BERTopic mapping serves two roles: (i) it enables unsupervised diagnostics via topic-level
metrics (Section 3.5); and (ii) it provides the representation on which the topic-guided pruning operates
(Section 3.4 below).</p>
      </sec>
      <sec id="sec-4-4">
        <title>3.4. Topic-guided pruning via a multi-prototype contrast</title>
        <p>
          Prior work [
          <xref ref-type="bibr" rid="ref10 ref8">8, 10</xref>
          ] commonly prunes expansion sets with a global cosine-similarity rule (e.g., keep items
suficiently close to a seed centroid). For twin-transition patents—topically heterogeneous and lexically
overlapping with near domains—such global rules can drop legitimate variants and retain marginal
cases; empirically, seed and antiseed embedding distributions are only weakly separated.
        </p>
        <p>
          To select relevant documents from a large, noisy expansion set in an unsupervised manner, we
apply a topic-guided pruning strategy inspired by weak supervision and semi-supervised learning
[
          <xref ref-type="bibr" rid="ref22 ref23 ref24">22, 23, 24</xref>
          ]. Two reference groups are considered: seeds  (twin exemplars) and antiseeds  . To
capture sub-themes, multiple prototypes are learned by clustering seed and antiseed topic vectors into
  and   groups (MiniBatch  -means [
          <xref ref-type="bibr" rid="ref25 ref26">25, 26</xref>
          ]), yielding L2-normalized centers Θ = { 1 , … ,   
()
() } and
denote its L2-normalized version so that cosine similarity reduces to a dot product.
Θ

= { 1
() , … ,
        </p>
        <p>() }. Each document   is represented by its topic-probability vector p ; let p̃ = p /‖p ‖2
The pruning score is a best-seed vs. best-antiseed cosine margin:
Δ() =
max ⟨p̃ , 
⏟≤⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟

() ⟩
−
max ⟨p̃ , 
⏟ℓ⏟≤⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
() ⟩
ℓ</p>
        <p>.</p>
        <p>closest seed prototype closest antiseed prototype
A document is retained when Δ() ≥  . The primary operating point on the unlabeled corpus is the
zero-contrast rule (=0 ), which keeps a document when it is at least as similar to seed prototypes as to
antiseed prototypes. This choice is parameter-light, interpretable, and does not require labels.</p>
      </sec>
      <sec id="sec-4-5">
        <title>3.5. Evaluation: semantic coherence, separation, and dispersion</title>
        <p>
          We evaluate pruning within the BERTopic framework using established topic-model diagnostics. First,
an intertopic distance map projects topic representations into two dimensions to visualize distinctiveness
and dispersion, assessing whether pruning improves semantic coherence and separation [
          <xref ref-type="bibr" rid="ref27 ref3">27, 3</xref>
          ]. Second,
to assess document-level alignment, we plot kernel density estimates of cosine similarity between
expansion documents with respect to seed and antiseed prototypes in topic space, examining whether
pruning increases alignment with seeds and reduces overlap with antiseeds [
          <xref ref-type="bibr" rid="ref28 ref29">28, 29</xref>
          ]. Together, these
analyses provide global (topic structure) and local (document–prototype alignment) perspectives on
pruning quality in high-dimensional, embedding-based topic models.
        </p>
        <p>
          This approach combines the high recall of unsupervised expansion with the precision gains of
topic-guided selection, and builds on guided topic discovery [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ], weakly supervised labeling [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ], and
zero-/few-shot cross-domain classification [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ].
        </p>
      </sec>
      <sec id="sec-4-6">
        <title>3.6. Validation and robustness</title>
        <p>In the absence of costly and hard-to-replicate human annotation, we validate our topic-guided pruning
against a proxy gold standard derived from CPC Y02/Y04 codes. These codes are curated jointly by
two patent authorities, the European Patent Ofice (EPO) and the United States Patent and Trademark
Ofice (USPTO), and assigned by trained examiners, providing a practical proxy for human labels.</p>
        <p>The evaluation pool comprises 29,032 patent families tagged with both Y02 and Y04 (treated as
positives) and 9,847 antiseed families (treated as negatives, by construction a mix of random non-twins
and hard negatives). We report standard retrieval metrics: precision (the fraction of selected patents
that are true twins), recall (the fraction of true twins that are selected), and F1 (the harmonic mean of
precision and recall, summarizing the precision–recall trade-of).</p>
        <p>The evaluation pool also allows us to fine-tune and test diferent thresholds for pruning.</p>
        <sec id="sec-4-6-1">
          <title>Threshold selection on a pseudo-labeled evaluation set.</title>
          <p>For quantitative assessment, a
pseudolabeled evaluation pool is built as Y02∩Y04 CPC families (positives) together with antiseeds (negatives).
A single 75/25 stratified train–test split is drawn from the pseudo-labeled evaluation pool. Thresholds
are selected by maximizing MCC on the train split; all metrics are reported on the held-out test split.</p>
          <p>On the training split, an MCC-optimal operating point is selected by</p>
          <p>MΔCC ∈ arg max MCC( , {Δ ≥  }
),
and all supervised-style metrics (precision, recall, F1, MCC, accuracy, Jaccard and the confusion matrix)
are reported for both =0 and  MΔCC.</p>
        </sec>
        <sec id="sec-4-6-2">
          <title>Baseline for comparison (TF–IDF margin).</title>
          <p>
            As a robustness check, a sparse lexical baseline
is included that operates at the individual-document level in TF–IDF space [
            <xref ref-type="bibr" rid="ref30">30</xref>
            ]. Let x be the
L2normalized TF–IDF vector of document  (built on seeds ∪antiseeds). The TF–IDF margin scorer
is
() =
⏟⏟∈⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
max cos(x , x ) − max cos(x , x ),
          </p>
          <p>⏟⏟∈⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
closest seed doc
closest antiseed doc
retain if () ≥  .
serving as robustness checks.</p>
          <p>On the same training split,  MCC</p>
          <p>TF–IDF is obtained by maximizing MCC, and test metrics are reported at =0
and  MTFC–CIDF. Because TF–IDF compares documents directly at the lexical level, it commonly achieves
slightly higher supervised metrics on the evaluation split; however, Δ remains the preferred primary
scorer for the full unlabeled corpus, as it enables unsupervised diagnostics, topic-level interpretability,
and a principled =0 operating point independent of labels, with both  MΔCC and the TF–IDF baseline</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>4. Data</title>
      <p>
        The empirical analysis is conducted using the Patstat Autumn 2024 database [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ], a comprehensive
dataset of global patent records maintained by the European Patent Ofice (EPO). This database includes
detailed bibliographic information, classification codes, citation linkages, and legal status for millions
of patent families worldwide. The database comprises 85,195,446 patent applications grouped into
66,798,016 DOCDB simple families, 47,068,344 of which include an English abstract. We restrict our
analysis to these families to enable text-based semantic analysis. In Patstat 2024 [32], every patent
application is assigned to a simple family, also known as the DOCDB family, which links applications
that share exactly the same priority [32]. This difers from the extended family (INPADOC family),
which links applications sharing a priority either directly or indirectly through a third application. The
choice of using the simple family as unit of analysis ofers several advantages. It avoids the issue of
over-representation of seed inventions that are published under multiple IDs in diferent jurisdictions,
ensuring a more accurate representation of the related patents in the expansion set. Additionally, it
allows for the identification of all citation linkages, ensuring that the citation network is fully captured.
According to the EPO’s Data Catalog [32], simple family citations encompass both citations to patent
publications and applications.
      </p>
      <p>An example of the data collected is shown in Table 2.</p>
    </sec>
    <sec id="sec-6">
      <title>5. Results</title>
      <sec id="sec-6-1">
        <title>5.1. Candidates selection methods</title>
        <p>
          As an initial analysis, the identification strategies proposed by [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] are replicated. Using their modules,
it is possible to classify a total of 238,717 patent families as ‘twin’ by at least one module. However, as it
can be shown, there is minimal to no overlap among the patent sets identified by their diferent methods.
This highlights a challenge regarding the representativeness of the resulting seed set. To address this,
the strategies detailed in Section 3 are proposed, which combine the same set of information, reported
in Table 1, using alternative configurations. As reported in Table 3 This procedure yields 223,575 unique
patent families classified as twin by at least one method. The low overlap across methods remains, as
shown by the average Jaccard metric computed for all methods (0.064) which is lower than the random
assignment one (obtained through same set sizes and 3,000 repetitions).
        </p>
      </sec>
      <sec id="sec-6-2">
        <title>5.2. Seed,anti-seed and expansion set analysis</title>
        <p>Applying the most stringent criterion —requiring identification by more than two methods to be included
in the seed— results in a final set of 9,847 unique seed patent families, which can be considered as the
most representative and thus suitable for a robust seed definition. By expanding the candidates set
twice through bidirectional citations, an expansion set comprising 1,918,509 unique patent families is
derived.</p>
      </sec>
      <sec id="sec-6-3">
        <title>5.3. Topic modeling results</title>
        <p>After preprocessing (lowercasing and removing numerals), the seed, antiseed, and expansion documents
are encoded and modeled with BERTopic. The model yields 86 topics for the overall set (Figure 2).
The intertopic distance visualization (Figure 2, panel (a)) provides an overview of the topics generated
by the BERTopic model for the overall set of seed, expansion and antiseed documents. Each point on
the visualization represents a topic, with its position determined by the model’s learned embeddings.
Topics that are close to one another on the map suggest similar themes, while those that are farther
apart indicate distinct conceptual areas. This visualization revealed a well-structured topic space, with
several clusters of topics grouped around core themes relevant to the twin patent corpus.</p>
        <p>Overall, the BERTopic model’s initial output provides a coherent representation of the twin patent
data’s thematic landscape. The visualizations demonstrates a structured topic space with clear thematic
areas and revealed areas for further refinement in the subsequent pruning phase, where the expansion
set would be filtered based on cosine similarity to seed topics.</p>
      </sec>
      <sec id="sec-6-4">
        <title>5.4. Pruning results</title>
        <p>By applying the pruning strategy described in Section 3.3 to the expansion set, 575,441 patent families
are retained.</p>
        <p>
          To assess the impact of the pruning strategy in an unsupervised setting, id est in absence of labeling
that hinders the use of recall and precision analysis, a number of tests can be performed. First, the
topic analysis on the whole and pruned expansion set can be repeated, in order to compare the
Intertopic Distance Map [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] before and after pruning. Pre-pruning, topics appeared more crowded and
overlapped substantially, suggesting semantic redundancy and poor topic separation. After pruning,
the 2D projection revealed clearer topic dispersion, with reduced overlap and tighter clustering. This
suggests increased semantic distinctiveness and topical coherence, both indicators of a better-defined
topic space [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Highlighting this visually helps validate the pruning process not just by number
reduction but by structural improvement in the latent topic space. After pruning, topics appear tighter
and more coherent and, while more numerous than in the overall set (Figure 2), they are fewer than
in the expansion set (147 vs 156), indicating that pruning removes broader, disjointed topics while
preserving fine-grained themes.
        </p>
        <p>Finally, comparing the panels (b) and (c) in Figure 2, a greater average distance is found in the
expansion set compared to the pruned one, and that broader and disjointed topics were efectively
dropped.</p>
        <p>
          Subsequently, to evaluate the semantic alignment of the expansion set with seed and antiseed topics,
the Kernel Density Estimation (KDE) plots of the maximum cosine similarity between each expansion
document and (i) seed topics and (ii) antiseed topics are produced. These plots provide a smooth
estimation of similarity density across the corpus [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ]. As shown in Figure 3, after pruning, the KDE
curve shifted toward higher similarity with seed topics and away from antiseeds, indicating that retained
documents are more aligned with the intended thematic focus and less with undesired content. This is
consistent with efective filtering in vector space, aligning with known techniques in bias detection and
semantic drift analysis [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ].
        </p>
        <p>These diagnostics confirm that pruning was both selective and semantically discriminative, reducing
noise and enhancing alignment with the seed topics.</p>
      </sec>
      <sec id="sec-6-5">
        <title>5.5. Validation against CPC pseudo-labels</title>
        <p>In the first two rows of Table 4 we report validation of the baseline method—the multi-prototype
topic contrast Δ —at the conservative threshold =0 and at a data-driven threshold  MΔCC selected by
maximizing the Matthews correlation coeficient on a separate development split of the labeled pool.
At the conservative operating point (=0 ), Δ attains Precision=0.973, F1=0.928, and Jaccard=0.866.
Using the data-driven threshold, Δ improves to F1=0.956 with higher recall (Recall=0.981) and slightly
lower precision (Precision=0.932). This operating point typically increases recall (and overall F1) while
allowing a controlled rise in false positives. Although the tuned threshold is favored in terms of
precision, accuracy, and overlap, the conservative rule performs well and remains attractive when
labels are unavailable. Practitioners can choose the operating point that matches their objective: the
label-free =0 rule when high precision and simplicity are paramount, or the MCC-selected threshold
when broader coverage (higher recall) is preferred, and labels are available.</p>
        <p>a) Precision: Δ vs. TF-IDF
b) Recall: Δ vs. TF-IDF
c) F1: Δ vs. TF-IDF</p>
        <p>The same table also reports test performance for a TF–IDF max-margin variant evaluated at =0 and
at its own TF–IDF–specific MCC-selected threshold. Δ operates in topic space to provide interpretable,
prototype-aligned pruning, whereas TF–IDF ofers a document-space robustness baseline. At =0 , Δ
achieves high precision with slightly higher recall than TF–IDF (F1 = 0.928 vs. 0.935). With a data-driven
threshold, Δ improves to F1 = 0.956 with a strong recall gain, while TF–IDF reaches a higher balanced
score overall (F1 = 0.978).</p>
        <p>Despite TF–IDF’s gain on this proxy-labeled test, we retain Δ as the primary pruning rule because it
(i) operates in the same topic space that underpins our diagnostics, yielding interpretable prototype-level
decisions; (ii) reduces lexical leakage from seed phrasing and is less sensitive to vectorizer settings
and vocabulary drift; and (iii) provides a transparent, label-free =0 policy that performs well on the
unlabeled corpus. We therefore report both scorers—using Δ for the main selection and TF–IDF as a
robustness check—and include threshold-sensitivity curves in Figure 4 to make the trade-ofs explicit.
Both scorers display the expected trade–of: as  increases, precision rises while recall falls, and F1 peaks
on a broad plateau. The TF–IDF margin reaches the highest peak F1 and attains near–perfect precision
at more conservative  , dominating the upper–right region of the curves. The multi–prototype Δ
increases precision more gradually and preserves relatively higher recall near =0 , providing a smooth,
interpretable operating range aligned with the topic–space diagnostics. In practice, TF–IDF at its
data–driven  MTFC-CIDF is preferable for balanced performance when labels (or proxy labels) are available,
whereas Δ near =0 is a well-behaved default for label–free, topic–aligned pruning. Both methods are
stable over a wide plateau of  values, so small threshold shifts do not materially afect results.</p>
      </sec>
      <sec id="sec-6-6">
        <title>5.6. Sensitivity analysis across BERTopic hyperparameters .</title>
        <p>To assess how the pruning threshold  in the topic–space contrast score Δ = cos(p, s̄) − cos(p, ā)
afects retrieval quality—and how this behavior varies with modeling choices— the topic model is
refit over a grid spanning the three stages of the pipeline: (i) Text vectorization (bag–of–words with 
grams), controlling vocabulary granularity via min_df ∈ {2, 5}; (ii) Low–dimensional projection (UMAP),
controlling local neighborhood size and layout via  neighbors ∈ {15, 30, 50} and min_dist ∈ {0.0, 0.1};
(iii) Density–based clustering (HDBSCAN), controlling cluster granularity and treatment of noise via
min_cluster_size ∈ {10, 30} and min_samples ∈ {None, 5}. For each configuration, topic probabilities
are obtained, the contrast score Δ is recomputed, and  is swept to trace precision/recall/F1 curves.
Across configurations, curves cluster tightly around  = 0 and around a fixed  MCC (chosen once on
a default configuration by maximizing Matthews correlation on a held-out split), with only minor
dispersion. Peak F1 consistently occurs near  ≈ 0 –0.1; performance degrades only for large positive 
where recall collapses (a regime not used operationally). Overall, selection performance is insensitive
to reasonable variation in vectorizer, projection, and clustering hyperparameters, supporting  = 0 as a
default operating point and  MCC as a robustness check.</p>
        <p>a) Precision vs. threshold  .</p>
        <p>b) Recall vs. threshold  .</p>
        <p>c)  1 vs. threshold  .</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>6. Limitations and future direction of work</title>
      <p>While the proposed pipeline demonstrates strong performance in identifying thematically coherent
and contextually relevant patents, several limitations open avenues for further enhancement. In the
expansion and pruning phases, citation networks were leveraged to build a semantically rich candidate
set. To avoid redundancy and overfitting, these same features were excluded from the classification
phase, opting instead for a single-input BERT architecture that processes concatenated abstracts and
titles. This design ensured that distinct sets of information were exploited in diferent phases, reducing
the risk of data leakage or circular logic.</p>
      <p>
        However, treating all textual fields as a single input may dilute domain-specific signals, such as
those embedded in CPC codes or reference patterns. A multi-input BERT architecture, where each
modality (e.g., abstract, CPC, references) is processed independently before feature fusion, could better
preserve the structural and semantic nuances of each information type. For instance, recent work
by [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] demonstrates the efectiveness of such architectures in patent retrieval and classification tasks.
Future work could therefore explore the development of an unsupervised, multi-input BERT framework
tailored to the patent domain to test and capture domain-specific relevance more explicitly.
      </p>
      <p>In the absence of human annotations, a pseudo-labeled set based on overlap with CPC codes is used.
While defensible as a proxy for human curation, these labels are not independent of the construction
rules; reported metrics should therefore be interpreted as upper-bound estimates. Future work will
validate beyond CPC proxies and explore multi-input models that fuse text, classification codes, and
citation structure.</p>
    </sec>
    <sec id="sec-8">
      <title>7. Conclusions</title>
      <p>This work introduces an automated, scalable framework for identifying twin (green ∩ digital) patents
that integrate green and digital technologies. By combining rule-based seed selection, bidirectional
citation expansion, transformer-based embeddings, and topic-guided pruning, the methodology
addresses persistent limitations of earlier approaches, including limited coverage and overlap and lack of
reproducibility.</p>
      <p>Empirical results indicate that the proposed framework yields patent sets with greater topical
relevance and improved semantic coherence, as shown by clearer intertopic separation and a KDE shift
toward seed topics after pruning. Specifically, the combination of PaECTER embeddings, BERTopic
modeling and topic-guided pruning, is shown to enable the efective filtering of heterogeneous patent
corpora while minimizing human intervention. Sensitivity analyses indicate small variation across
reasonable UMAP/HDBSCAN settings and cosine-similarity thresholds, suggesting a robust pruning
score.</p>
      <p>Overall, this approach provides a scalable, transparent basis for monitoring innovation dynamics and
informing policy in the twin transition.</p>
    </sec>
    <sec id="sec-9">
      <title>8. Acknowledgments</title>
      <p>This study was funded by the European Union - NextGenerationEU, Mission 4, Component 2, in the
framework of the GRINS -Growing Resilient, INclusive and Sustainable project (GRINS PE00000018 –
CUP F53C22000760007). The views and opinions expressed are solely those of the authors and do not
necessarily reflect those of the European Union, nor can the European Union be held responsible for
them.</p>
    </sec>
    <sec id="sec-10">
      <title>9. Declaration of Generative AI</title>
      <p>During the preparation of this work, the author used ChatGPT4.5 in order to perform grammar and
spelling check, paraphrase and reword. After using this tool/service, the author reviewed and edited
the content as needed and takes full responsibility for the publication’s content.</p>
      <p>epo.org/en/searching-for-patents/business/patstat, 2024. Accessed 2025-08-25.
[32] European Patent Ofice, PATSTAT 2024 Autumn Edition Data Catalog, v5.24, Technical Report,
European Patent Ofice, 2024. URL: https://link.epo.org/web/searching-for-patents/business/patstat/
data-catalog-patsat-global-en.pdf, accessed 2025-01-10.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Favot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Vesnic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Priore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bincoletto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Morea</surname>
          </string-name>
          ,
          <article-title>Green patents and green codes: How diferent methodologies lead to diferent results</article-title>
          , Resources,
          <source>Conservation &amp; Recycling Advances</source>
          <volume>18</volume>
          (
          <year>2023</year>
          )
          <article-title>200132</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.rcradv.
          <year>2023</year>
          .
          <volume>200132</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Erhardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. E.</given-names>
            <surname>Rose</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Buunk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Harhof</surname>
          </string-name>
          , Paecter:
          <article-title>Patent-level representation learning using citation-informed transformers</article-title>
          ,
          <year>2024</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2402.19411.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Grootendorst</surname>
          </string-name>
          ,
          <article-title>BERTopic: Neural topic modeling with a class-based tf-idf procedure</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2203.05794. arXiv:
          <volume>2203</volume>
          .
          <fpage>05794</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Jindra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Leusin</surname>
          </string-name>
          ,
          <article-title>The development of digital sustainability technologies by top R&amp;D investors</article-title>
          ,
          <source>Technical Report JRC130480</source>
          , Joint Research Centre (JRC),
          <source>European Commission</source>
          ,
          <year>2022</year>
          . URL: https://publications.jrc.ec.europa.eu/repository/handle/JRC130480. doi:
          <volume>10</volume>
          .2760/150239.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Griliches</surname>
          </string-name>
          ,
          <article-title>Patent statistics as economic indicators: A survey</article-title>
          ,
          <source>Journal of Economic Literature</source>
          <volume>28</volume>
          (
          <year>1990</year>
          )
          <fpage>1661</fpage>
          -
          <lpage>1707</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>European</given-names>
            <surname>Patent</surname>
          </string-name>
          <string-name>
            <surname>Ofice</surname>
          </string-name>
          ,
          <article-title>Guide to the cooperative patent classification (cpc</article-title>
          ), https://www. cooperativepatentclassification.org/,
          <source>2016. Accessed</source>
          <year>2025</year>
          -
          <volume>08</volume>
          -25.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Flostrand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pitt</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Bridson,</surname>
          </string-name>
          <article-title>The delphi technique in forecasting - a 42-year bibliographic analysis (</article-title>
          <year>1975</year>
          -2017),
          <source>Technological Forecasting and Social Change</source>
          <volume>150</volume>
          (
          <year>2020</year>
          )
          <article-title>119773</article-title>
          . doi:
          <volume>10</volume>
          . 1016/j.techfore.
          <year>2019</year>
          .
          <volume>119773</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bergeaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Verluise</surname>
          </string-name>
          ,
          <article-title>Identifying technology clusters based on automated patent landscaping</article-title>
          ,
          <source>PLOS ONE 18</source>
          (
          <year>2023</year>
          )
          <article-title>e0295587</article-title>
          . doi:
          <volume>10</volume>
          .1371/journal.pone.
          <volume>0295587</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Capello</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Lenzi, 4.0 technologies and the rise of new islands of innovation in european regions</article-title>
          ,
          <source>Regional Studies</source>
          <volume>55</volume>
          (
          <year>2021</year>
          )
          <fpage>1724</fpage>
          -
          <lpage>1737</lpage>
          . doi:
          <volume>10</volume>
          .1080/00343404.
          <year>2021</year>
          .
          <volume>1964698</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Abood</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Feltenberger</surname>
          </string-name>
          ,
          <source>Automated patent landscaping, Artificial Intelligence and Law</source>
          <volume>26</volume>
          (
          <year>2018</year>
          )
          <fpage>103</fpage>
          -
          <lpage>125</lpage>
          . doi:
          <volume>10</volume>
          .1007/s10506-017-9217-1.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R.</given-names>
            <surname>Lampe</surname>
          </string-name>
          , Strategic citation,
          <source>The Review of Economics and Statistics</source>
          <volume>94</volume>
          (
          <year>2012</year>
          )
          <fpage>320</fpage>
          -
          <lpage>333</lpage>
          . doi:
          <volume>10</volume>
          . 1162/REST_a_
          <fpage>00146</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>I.</given-names>
            <surname>Haščič</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Migotto</surname>
          </string-name>
          ,
          <article-title>Measuring environmental innovation using patent data</article-title>
          ,
          <source>OECD Environment Working Papers 89</source>
          ,
          <string-name>
            <given-names>OECD</given-names>
            <surname>Publishing</surname>
          </string-name>
          ,
          <year>2015</year>
          . URL: https://www.oecd.org/en/publications/ measuring
          <article-title>-environmental-innovation-using-patent-data_5js009kf48xw-en</article-title>
          .
          <source>html. doi:10</source>
          .1787/ 5js009kf48xw-en.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>H.</given-names>
            <surname>Bekamiri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Hain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jurowetzki</surname>
          </string-name>
          ,
          <article-title>Patentsberta: A deep nlp based hybrid model for patent distance and classification using augmented sbert</article-title>
          ,
          <source>Technological Forecasting and Social Change</source>
          <volume>206</volume>
          (
          <year>2024</year>
          )
          <article-title>123536</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.techfore.
          <year>2024</year>
          .
          <volume>123536</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <source>in: Proceedings of NAACL-HLT</source>
          <year>2019</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Minneapolis, Minnesota,
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . URL: https://aclanthology.org/ N19-1423/.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rietzler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Stabinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Opitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Engl</surname>
          </string-name>
          ,
          <article-title>Adapt or get left behind: Domain adaptation through BERT language model finetuning for aspect-target sentiment classification</article-title>
          , in: N.
          <string-name>
            <surname>Calzolari</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Béchet</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Blache</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Choukri</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Cieri</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Declerck</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Goggi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Isahara</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Maegaard</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mariani</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Mazo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Moreno</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Odijk</surname>
          </string-name>
          , S. Piperidis (Eds.),
          <source>Proceedings of the Twelfth Language Resources and Evaluation Conference</source>
          , European Language Resources Association, Marseille, France,
          <year>2020</year>
          , pp.
          <fpage>4933</fpage>
          -
          <lpage>4941</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .lrec-
          <volume>1</volume>
          .607/.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.-S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hsiang</surname>
          </string-name>
          ,
          <article-title>Patent classification by fine-tuning bert language model</article-title>
          ,
          <source>World Patent Information</source>
          <volume>61</volume>
          (
          <year>2020</year>
          )
          <article-title>101965</article-title>
          . URL: https://www.sciencedirect.com/science/article/pii/S0172219019300742. doi:https://doi.org/10.1016/j.wpi.
          <year>2020</year>
          .
          <volume>101965</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>C. D. Manning</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Raghavan</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Schütze</surname>
          </string-name>
          , Introduction to Information Retrieval, Cambridge University Press, Cambridge, UK,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>H.</given-names>
            <surname>Steck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ekanadham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kallus</surname>
          </string-name>
          ,
          <article-title>Is cosine-similarity of embeddings really about similarity</article-title>
          ?,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2403.05440. doi:
          <volume>10</volume>
          .1145/3589335.3651526. arXiv:
          <volume>2403</volume>
          .
          <fpage>05440</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>L.</given-names>
            <surname>McInnes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Healy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Saul</surname>
          </string-name>
          , L. Großberger, Umap:
          <article-title>Uniform manifold approximation and projection</article-title>
          ,
          <source>Journal of Open Source Software</source>
          <volume>3</volume>
          (
          <year>2018</year>
          )
          <article-title>861</article-title>
          . URL: https://doi.org/10.21105/joss.00861. doi:
          <volume>10</volume>
          .21105/joss.00861.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>R. J. G. B. Campello</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Moulavi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Sander</surname>
          </string-name>
          ,
          <article-title>Density-based clustering based on hierarchical density estimates</article-title>
          , in: J.
          <string-name>
            <surname>Pei</surname>
            ,
            <given-names>V. S.</given-names>
          </string-name>
          <string-name>
            <surname>Tseng</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Cao</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Motoda</surname>
          </string-name>
          , G. Xu (Eds.),
          <source>Advances in Knowledge Discovery and Data Mining</source>
          , Springer Berlin Heidelberg, Berlin, Heidelberg,
          <year>2013</year>
          , pp.
          <fpage>160</fpage>
          -
          <lpage>172</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>R. J. G. B. Campello</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Moulavi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Zimek</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Sander</surname>
          </string-name>
          ,
          <article-title>Hierarchical density estimates for data clustering, visualization, and outlier detection</article-title>
          ,
          <source>ACM Trans. Knowl. Discov. Data</source>
          <volume>10</volume>
          (
          <year>2015</year>
          ). URL: https://doi.org/10.1145/2733381. doi:
          <volume>10</volume>
          .1145/2733381.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Geng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , H. Chen,
          <article-title>Zero-shot text classification via reinforced self-training, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</article-title>
          , Association for Computational Linguistics,
          <year>2020</year>
          , pp.
          <fpage>3014</fpage>
          -
          <lpage>3024</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .acl-main.
          <volume>272</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl- main. 272.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>D.</given-names>
            <surname>Zha</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Li, Multi-label dataless text classification with topic modeling</article-title>
          ,
          <source>Knowledge and Information Systems</source>
          <volume>61</volume>
          (
          <year>2019</year>
          )
          <fpage>137</fpage>
          -
          <lpage>160</lpage>
          . doi:
          <volume>10</volume>
          .1007/s10115- 018- 1280- 0.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ratner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Bach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Varma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ré</surname>
          </string-name>
          , et al.,
          <article-title>Snorkel: rapid training data creation with weak supervision</article-title>
          ,
          <source>The VLDB Journal</source>
          <volume>29</volume>
          (
          <year>2020</year>
          )
          <fpage>709</fpage>
          -
          <lpage>730</lpage>
          . doi:
          <volume>10</volume>
          .1007/s00778- 019- 00552- 1.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>D.</given-names>
            <surname>Sculley</surname>
          </string-name>
          ,
          <article-title>Web-scale k-means clustering</article-title>
          ,
          <source>in: Proceedings of the 19th International Conference on World Wide Web, WWW '10</source>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2010</year>
          , p.
          <fpage>1177</fpage>
          -
          <lpage>1178</lpage>
          . URL: https://doi.org/10.1145/1772690.1772862. doi:
          <volume>10</volume>
          .1145/1772690.1772862.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Lloyd</surname>
          </string-name>
          ,
          <article-title>Least squares quantization in pcm</article-title>
          ,
          <source>IEEE Trans. Inf. Theory</source>
          <volume>28</volume>
          (
          <year>1982</year>
          )
          <fpage>129</fpage>
          -
          <lpage>136</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>C.</given-names>
            <surname>Sievert</surname>
          </string-name>
          , K. Shirley,
          <article-title>LDAvis: A method for visualizing and interpreting topics</article-title>
          ,
          <source>in: Proceedings of the Workshop on Interactive Language Learning</source>
          , Visualization, and
          <string-name>
            <surname>Interfaces</surname>
          </string-name>
          , Association for Computational Linguistics, Baltimore, Maryland,
          <year>2014</year>
          , pp.
          <fpage>63</fpage>
          -
          <lpage>70</lpage>
          . doi:
          <volume>10</volume>
          .3115/v1/
          <fpage>W14</fpage>
          - 3110.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          , Sentence-BERT:
          <article-title>Sentence embeddings using siamese BERT-networks</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing</source>
          , Association for Computational Linguistics, Hong Kong, China,
          <year>2019</year>
          , pp.
          <fpage>3982</fpage>
          -
          <lpage>3992</lpage>
          . URL: https://aclanthology.org/D19-1410/.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>H.</given-names>
            <surname>Gonen</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Goldberg,</surname>
          </string-name>
          <article-title>Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics</article-title>
          , Minneapolis, Minnesota,
          <year>2019</year>
          , pp.
          <fpage>609</fpage>
          -
          <lpage>614</lpage>
          . URL: https://aclanthology.org/N19-1061/.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>G.</given-names>
            <surname>Salton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>A vector space model for automatic indexing</article-title>
          ,
          <source>Commun. ACM</source>
          <volume>18</volume>
          (
          <year>1975</year>
          )
          <fpage>613</fpage>
          -
          <lpage>620</lpage>
          . URL: https://doi.org/10.1145/361219.361220. doi:
          <volume>10</volume>
          .1145/361219.361220.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>European</given-names>
            <surname>Patent</surname>
          </string-name>
          <string-name>
            <surname>Ofice</surname>
          </string-name>
          , Patstat autumn 2024:
          <article-title>Worldwide patent statistical database</article-title>
          , https://www.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>