<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Lightweight Out-of-Distribution Detection for Patent Classification in Non-Stationary Environments</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ekaterina Kotliarova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebastian Björkqvist</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>IPRally Technologies Oy</institution>
          ,
          <addr-line>Helsinki</addr-line>
          ,
          <country country="FI">Finland</country>
        </aff>
      </contrib-group>
      <fpage>60</fpage>
      <lpage>66</lpage>
      <abstract>
        <p>Categorizing patents into diferent classes is an essential step in processes such as monitoring competitors, managing patent portfolios, and landscaping existing inventions. In practical applications, classifiers are often trained on limited data and then applied to out-of-distribution documents, i.e., samples that are quite diferent from what the classifier was trained on. This may result in incorrect and nonsensical classification results. In this work, we explore lightweight methods for detecting such out-of-distribution (OOD) samples before classification. We show that a simple nearest neighbor-based approach is highly reliable for OOD sample detection in general, with the downside of having to store the embeddings of the training set to perform inference. We also introduce a method based on probability density functions (PDF) and show that when combined with a custom thresholding strategy, it efectively retains in-distribution samples and filters out anomalies, while requiring the storage of only the mean and covariance matrix of the training data.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;classification</kwd>
        <kwd>distribution shift</kwd>
        <kwd>anomaly detection</kwd>
        <kwd>patents</kwd>
        <kwd>document embeddings</kwd>
        <kwd>patent search</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Classifying patent documents plays a central role in various industrial and legal processes. In practical
deployments, classifiers often operate on limited and evolving data, and may be applied to domains
diferent from those they were trained on. These conditions lead to distribution shifts between training
and test data, and therefore to the appearance of out-of-distribution (OOD) inputs, i.e., documents that
difer significantly from those seen by the model during training.</p>
      <p>
        Our previous work addressed several of these challenges by utilizing search-based embeddings and
semi-supervised learning to improve classification with limited data [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. As a continuation, this paper
focuses on detecting OOD-inputs before classification to prevent unreliable predictions in mismatched
domains.
      </p>
      <p>To this end, we evaluate several unsupervised OOD-detection methods operating in the embedding
space. In particular, we suggest a lightweight approach that: (i) has high in-distribution (ID)
retention: the method retains almost all relevant documents; (ii) is easy-to-use: the method requires
a minimal computation, i.e., only the empirical mean and covariance matrix of the training data are
computed, while test samples are scored via multivariate Gaussian probability density function (PDF);
(iii) and is OOD-agnostic: i.e., the detection threshold is calibrated using only provided by user
training ID-data.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Literature survey</title>
      <p>
        Modern machine learning models are often developed under the assumption that training and test data
are drawn from the same underlying distribution [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], but this rarely holds in practice. In open-world
settings, models frequently encounter out-of-distribution (OOD) inputs –samples that difer significantly
from the training data [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
6th Workshop on Patent Text Mining and Semantic Technologies (PatentSemTech) 2025
$ ekaterina@iprally.com (E. Kotliarova); sebastian@iprally.com (S. Björkqvist)
0000-0002-5491-7741 (E. Kotliarova); 0009-0006-9039-8623 (S. Björkqvist)
      </p>
      <p>© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).</p>
      <p>
        As discussed in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], distributional shifts in tasks based on textual data representations can generally
be categorized into two types: (i) semantic shift, where the OOD samples belong to entirely new
categories and should not be mapped to any existing class; (ii) non-semantic shift, where OOD
samples difer in a domain or style but share the same class semantics as the ID samples.
      </p>
      <p>
        Our task falls under the semantic shift scenario, where the OOD documents may come from previously
unseen patent categories and must not be forced into known classes. As mentioned in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], we can
thus utilize the following taxonomy of OOD-detection methods to detect semantic shift: (i) the OOD
samples available for the training; (ii) the OOD samples are unavailable, but the ID labels available;
(iii) both the OOD data and the ID label unavailable.
      </p>
      <p>
        In our setting, while the training set is labeled for further classification, the labels do not contribute
to the OOD-detection step, which is inherently unsupervised. This leads us to conclude that the
OODdetection task in our case falls back to the third option, which is a well-known classic anomaly detection
problem [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>Although various strategies can be applied to perform OOD-detection, in this work, we focus on
evaluating lightweight, unsupervised OOD-detection methods in the embedding space. Specifically,
we compare a PDF-based likelihood approach with common baselines such as -NN, Local Outlier
Factor, and Isolation Forest (see the discussion in Section 3.1). As thresholding heavily impacts final
performance, we also examine diferent threshold selection strategies (Section 3.2).</p>
      <sec id="sec-3-1">
        <title>3.1. Baseline and Proposed Methods</title>
        <p>As discussed in Section 2, our task falls under unsupervised OOD-detection, which requires no OOD
samples or class labels during training. We prioritize lightweight methods with continuous scores,
allowing us to control the strictness of OOD-detection by adjusting the decision threshold, while
keeping training costs low. The goal of OOD-detection is to suppress unreliable classification outputs
for inputs that deviate significantly from the training data. Thus, detected OOD-samples can be withheld
from further classification or flagged for manual review. We selected the following methods for our
evaluations:
1. Distance-based methods, such as -nearest neighbors (-NN), which compute the average distance
of a test point to its  closest training embeddings [7]. The idea behind nearest-neighbor methods
is that ID (in-distribution) data are more likely to be closer to its neighbors than OOD data. After
computing the scores, a threshold is applied (the threshold selection is covered in Section 3.2).
2. Density-based methods, such as Local Outlier Factor (LOF) and Isolation Forest (IF), which
estimate how isolated a test sample is compared to the ID data[8, 9]. Both methods compute
continuous anomaly scores and require a threshold.
3. Likelihood-based models, which estimate the probability of a sample under a distribution fitted
to the training data. In our case, we adopt a custom Probability Density Function (PDF)-based
approach. By computing the mean and covariance of the in-distribution (ID) training set, we
evaluate the likelihood of each test sample under this distribution. The method is described in
detail in Algorithm 1.</p>
        <p>We use implementations provided by the scikit-learn library for the -Nearest Neighbors, Local
Outlier Factor, and Isolation Forest algorithms [10]. Meanwhile, our PDF-based approach works as
presented in Algorithm 1.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Threshold computation</title>
        <p>Threshold selection plays a crucial role in OOD-detection, as it directly influences the final result;
therefore, it should be considered an important part of the overall approach. A common practice is to
set the threshold to achieve a high true positive rate (TPR) on in-distribution data and then report the</p>
        <sec id="sec-3-2-1">
          <title>Algorithm 1 OOD-Detection via PDF</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>Input: Training dataset in, test sample * , threshold</title>
          <p>Training Stage:
1. Compute mean   and covariance matrix Σ  of train dataset in.</p>
          <p>Inference Stage:
1. Compute the OOD-score of the test sample * by computing the probability density (i.e., (* ))
under the multivariate Gaussian distribution in defined by the training data’s mean   and covariance
Σ .
2. Compare computed OOD-score to the threshold  ; if (* ) ≥  the sample is considered
indistribution (ID), otherwise it is flagged as out-of-distribution (OOD).</p>
          <p>
            Output: Binary decision whether * is from the same distribution in as training data in (ID) or
not (OOD).
corresponding false positive rate (FPR) on OOD data. While many works report FPR@95%TPR meaning
the threshold is set so that 95% TPR on the validation set is achieved [
            <xref ref-type="bibr" rid="ref5">5, 7</xref>
            ], in our case, retaining the
relevant documents is the priority, so we instead use a 99% TPR score threshold.
          </p>
          <p>Additionally, for the PDF-based method (Algorithm 1), we utilize a custom threshold selection
algorithm. The method aims to avoid using unstable low-probability outliers as the threshold, while
also not enforcing arbitrary strictness, such as discarding a fixed percentage of data. To achieve this,
we smooth the normalized likelihoods and identify approximate inflection points. The lowest such
point serves as a cutof, and all values below this point are considered to be outliers. The fact that the
cutof is not the global minimum is crucial, since inflection point detection is approximate due to the
smoothing used, and the global minimum is often unusable due to zero-likelihood artifacts. See Fig. 1
for the intuition behind this approach. The final threshold is set as the minimum of the outlier-cleaned
set. The full procedure is detailed in Algorithm 2.</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>Algorithm 2 Custom Threshold Selection</title>
          <p>Input: Validation set val, mean  , covariance matrix Σ  of train dataset in
Threshold Selection:</p>
          <p>1. Compute likelihoods of val using mean   and covariance matrix Σ  as presented in Inference
Stage of Algorithm 1, resulting in a set val.</p>
          <p>
            2. Normalize values in val to the range [
            <xref ref-type="bibr" rid="ref1">0, 1</xref>
            ].
          </p>
          <p>3. Apply Gaussian smoothing with bandwidth  to normalized likelihood set val. The parameter
 controls how much the curve would be smoothed. We use  = 2.0, selected empirically.
4. Compute second derivative of smoothed scores.</p>
          <p>5. Identify inflection points: locations where second derivative changes sign. Note that the
inflection points are approximate due to normalization and computational artifacts.</p>
          <p>6. Identify the minimum among the inflection points. Use this value as a cutof: remove all
likelihood values in val that are equal to or lower than this point, yielding a cleaned set val-clean.</p>
          <p>7. Set threshold  = min {val-clean}.</p>
          <p>Output: Threshold .</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Graph-based embeddings trained for patent search</title>
        <p>
          We use embeddings generated as described in [13, 14], where each patent document is first converted
into a graph that represents the key features of the invention and their relationships. The resulting
graphs are significantly smaller than the original documents, which enables eficient processing of large
documents while still preserving relevant information required for prior art searches. The graph is then
embedded into a vector space using a graph neural network (GNN) trained to perform prior art searches
using patent examiner citation data. Using citation data enables the model to recognize semantically
similar inventions despite diferences in terminology, placing them close together in the embedding
space. The resulting embeddings may used as input to a lightweight classification model, as shown in
[
          <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Datasets</title>
        <p>Four datasets were chosen for this study: two public and two proprietary. The public datasets are the
Qubit [11] and the Cannabinoid patent datasets [12]. The proprietary ones originate from distinct
domains: one from the mechanical engineering patent domain and the other from the chemistry field.
Only one document per family is kept in each data set to avoid over-representation of large patent
families (refer to Table 1 for the dataset sizes).</p>
        <p>For the purpose of evaluation, we simulate an OOD-detection setup by selecting one dataset (e.g.,
Qubit) to serve as the in-distribution (ID) set and treating samples from the remaining datasets as
out-of-distribution (OOD). All four datasets originate from diferent domains, which reflects realistic
domain shift scenarios in patent classification.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Experiment setup and evaluation</title>
        <p>Each model is trained using a training set extracted from the complete dataset. The models take
document embeddings as input and generate OOD-scores for each test sample as output. Validation
sets are fixed for every dataset. For the experiments on the subsets of data we partition the training set
by randomly sampling  percent of the data points with  ranging from 5 to 100.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and discussions</title>
      <p>The results of all methods are shown in Fig. 2, with the Qubit dataset chosen as the in-distribution data
set. The false negative rate (FNR) hovers around 1% for all methods, which is to be expected since we
selected the threshold to achieve 99% TPR. The -NN algorithm has a low false positive rate (FPR) on
all training data set sizes, while our PDF-based method combined with the custom threshold selection
achieves similar or lower FPR as -NN when at least 50% of the training data set is used. The other
algorithms have significantly higher FPR.</p>
      <p>Table 2 presents how the -NN and PDF algorithms perform using other OOD data sets. The -NN
algorithm is the most stable, performing well even with small amounts of training data, while the PDF
method combined with the custom threshold selection performs well when at least 50% of the training
data is used. The results also demonstrate the usefulness of the custom threshold selection algorithm. If
the threshold for the PDF method is set to achieve 99% TPR then the FPR is significantly higher with
large data sets.</p>
      <p>Future work could explore various thresholding strategies for the -NN method and explore
modifications that reduce the need to store the entire training set to calculate distances between test samples
and -nearest-neighbors. Perhaps, the method could be adapted to operate using the mean vector or a
set of representative cluster centers instead.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>In this work we analyzed algorithms for detecting OOD-samples in classification. We demonstrated
that using nearest neighbors achieves the best trade-of between detecting OOD-samples and keeping
ID-samples, especially with small training sets. We also introduced a PDF-based method and showed
that it, when combined with a custom threshold selection algorithm, works well with large training
sets while avoiding the need to store the entire training set to perform inference.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration of Generative AI</title>
      <p>During the preparation of this work, the authors used Grammarly in order to: Grammar and spelling
check. And further ChatGPT-4o in order to: Paraphrase and reword as well as to improve writing
style. After using these tools, the authors reviewed and edited the content as needed and take full
responsibility for the publication’s content.
[7] Y. Sun, Y. Ming, X. Zhu, Y. Li, Out-of-distribution detection with deep nearest neighbors, in:
K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, S. Sabato (Eds.), Proceedings of the 39th
International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning
Research, PMLR, 2022, pp. 20827–20840. URL: https://proceedings.mlr.press/v162/sun22d.html.
[8] M. M. Breunig, H.-P. Kriegel, R. T. Ng, J. Sander, Lof: identifying density-based local outliers,
in: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data,
SIGMOD ’00, Association for Computing Machinery, New York, NY, USA, 2000, p. 93–104. URL:
https://doi.org/10.1145/342009.335388. doi:10.1145/342009.335388.
[9] F. T. Liu, K. M. Ting, Z.-H. Zhou, Isolation forest, in: 2008 Eighth IEEE International Conference
on Data Mining, 2008, pp. 413–422. doi:10.1109/ICDM.2008.17.
[10] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay,
Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–
2830.
[11] S. Harris, A. Trippe, D. Challis, N. Swycher, Construction and evaluation of gold standards for
patent classification—a case study on quantum computing, World Patent Information 61 (2020)
101961.
[12] S. Harris, Gold standard for the evaluation of machine classification of patent data, 2019. URL:
https://github.com/swh/classification-gold-standard/tree/master.
[13] S. Björkqvist, J. Kallio, Building a graph-based patent search engine, in: Proceedings of the 46th
International ACM SIGIR Conference on Research and Development in Information Retrieval,
SIGIR ’23, Association for Computing Machinery, New York, NY, USA, 2023, p. 3300–3304. URL:
https://doi.org/10.1145/3539618.3591842. doi:10.1145/3539618.3591842.
[14] K. Daniell, I. Buzhinsky, S. Björkqvist, Eficient patent searching using graph
transformers, in: Proceedings of the 6th Workshop on Patent Text Mining and Semantic Technologies,
PatentSemTech’25, 2025. To appear.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lagus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kotliarova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Björkqvist</surname>
          </string-name>
          ,
          <article-title>Patent classification on search-optimized graph-based representations</article-title>
          ,
          <source>in: Proceedings of the 4th Workshop on Patent Text Mining and Semantic Technologies</source>
          ,
          <source>PatentSemTech'23</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>33</fpage>
          -
          <lpage>38</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3604</volume>
          /paper2.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Kotliarova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Björkqvist</surname>
          </string-name>
          ,
          <article-title>Semi-supervised learning methods for patent classification using search-optimized graph-based representations</article-title>
          ,
          <source>in: Proceedings of the 5th Workshop on Patent Text Mining and Semantic Technologies</source>
          ,
          <source>PatentSemTech'24</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>18</fpage>
          -
          <lpage>24</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3775</volume>
          /paper4.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>V.</given-names>
            <surname>Vapnik</surname>
          </string-name>
          ,
          <article-title>Principles of risk minimization for learning theory</article-title>
          ,
          <source>in: Proceedings of the 5th International Conference on Neural Information Processing Systems</source>
          , NIPS'
          <fpage>91</fpage>
          , Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,
          <year>1991</year>
          , p.
          <fpage>831</fpage>
          -
          <lpage>838</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bendale</surname>
          </string-name>
          , T. Boult, Towards open world recognition,
          <source>in: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1893</fpage>
          -
          <lpage>1902</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2015</year>
          .
          <volume>7298799</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Lang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>A survey on out-of-distribution detection in NLP, Transactions on machine learning research (</article-title>
          <year>2023</year>
          ). URL: https://par.nsf.gov/servlets/purl/10526541.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ming</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kuen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Morariu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Barmpalios</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nenkova</surname>
          </string-name>
          ,
          <article-title>A critical analysis of document out-of-distribution detection</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2023</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Singapore,
          <year>2023</year>
          , pp.
          <fpage>4973</fpage>
          -
          <lpage>4999</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          . ifndings-emnlp.
          <volume>332</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .findings-emnlp.
          <volume>332</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>