Ensemble Method for Classification in Imbalanced Patent
                         Data
                         Eleni Kamateri1 and Michail Salampasis1
                         1Department of Information and Electronic Engineering, International Hellenic University (IHU), Alexander Campus,

                         Sindos 57400, Thessaloniki, Greece

                                         Abstract
                                         This study presents an ensemble method for patent classification addressing the imbalance patent data
                                         problem. To achieve this, the dataset is divided into two data partitions based on the codes’
                                         representation magnitude. These partitions are trained separately by two identical classifiers and their
                                         results are combined using a stacking meta-classifier. Experiments are conducted using two benchmark
                                         patent datasets. The first results showed that the proposed combination of classifiers improves the
                                         imbalance patent data problem and outperforms the baseline classifiers, other combinations of
                                         classifiers and recent state-of-the-art techniques for patent classification.

                                         Keywords
                                       Patent, Classification, Ensemble, Imbalance data, Single-label, Sub-classes, Ensemble method, Deep
                         learning, Word embeddings1

                                                                                                                     Learning (DL) models for effective patent modelling
                         1. Introduction                                                                             and representation, and automatic classification. Most
                                                                                                                     of these patent classification efforts used various
                         Patent classification is an important task of the patent                                    simplifications when applied, e.g., working mostly with
                         examination process dealing with the assignment of                                          well-represented codes having many training samples
                         one or more classification codes from a classification                                      or targeting the higher levels of the classification
                         scheme. The most widely used classification scheme is                                       hierarchy, still they do not attain acceptable
                         the International Patent Classification (IPC) which                                         performance, i.e., one close to human performance.
                         contains approximately 70,000 different IPC codes.                                              The accuracy of the classification model mainly
                         The correct assignment of classification codes is quite                                     depends on the quality of the dataset and the
                         important as it ensures that patents with similar                                           classification algorithm. The data-related factors
                         technical characteristics will be clustered together                                        which could reduce the accuracy of a patent
                         under the same classification codes, something which                                        classification model are many, such as the
                         is crucially important for many subsequent tasks, such                                      complex/broad concepts expressed by classification
                         as patent management and search, technology                                                 codes, the ambiguous vocabulary or new terminology
                         characterization and landscape [1, 2]. However, the                                         used, the overlapping concepts among classification
                         high numbers of classification codes, along with their                                      codes (which increases as we go down in the level
                         complex and heterogeneous definitions make the                                              hierarchy), and, last but not least, the imbalanced
                         patent classification a challenging task.                                                   patent dataset problem. This means that some
                              The manual patent classification, which is                                             classification codes have a large number of patent
                         performed by patent officers when a patent                                                  samples and thus high representation magnitude in
                         application arrives, includes the finding of relevant                                       the dataset. These codes are called major
                         classification codes through the hierarchical                                               codes/classes. On the other side, there are some other
                         descriptions of classification codes in the classification                                  classification codes which have very few patent
                         scheme. However, it can be very time consuming,                                             samples and thus low representation magnitude These
                         tedious and strongly dependent on patent officer’s                                          codes are called minor codes/classes.
                         ability and experience [3]. This is the reason why                                              Classification models trained by imbalanced
                         automatic tools for selecting the relevant classification                                   datasets usually have a very poor prediction ability on
                         codes are needed.                                                                           minor codes. In order to solve the imbalanced dataset
                              Research efforts in automated patent classification                                    issue, lots of research efforts have been carried out.
                         [4-7] utilize Natural Language Processing (NLP)                                             Improvements are mainly based on two directions, the
                         techniques and Machine Learning (ML)/Deep

                         PatentSemTech'23: 4th Workshop on Patent Text Mining and
                         Semantic Technologies, July 27th, 2023, Taipei, Taiwan.
                         ekamater@hotmail.com (E. Kamateri); msa@ihu.gr                      (M.
                         Salampasis); 0000-0003-0490-2194 (E. Kamateri);
                         0000-0003-4087-125X (M. Salampasis);


                                      © 2023 Copyright for this paper by its authors. The use permitted under
                                      Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                      CEUR Workshop Proceedings (CEUR-WS.org)


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings

                                                                                                                27
dataset level and the algorithm level [8]. On the dataset               standard deviation of 1,930 patents and a median
level, the main strategy is to use resampling methods.                  frequency of 169, which is a more informative statistic
Over-sampling and under-sampling methods have                           compared to mean for imbalance datasets where there
been introduced to resample the data to get a balanced                  exist many frequency outliers. Similarly, in the USPTO
dataset [9-11]. On the algorithm level, the main idea is                dataset, each code has a mean frequency of 3,177
to adjust the algorithms to improve the accuracy of                     patents with a standard deviation of 12,710 patents
models, such as introducing an ensemble method [12,                     and a median frequency of 578. Moreover, 392 codes
13].                                                                    (53.63% of all 731 codes) in the CLEFIP-0.54M dataset
    In this study, we adjust the ensemble architecture                  and 212 codes (33.76% of all 628 codes) in the USPTO
for patent classification presented in [14] to address                  dataset have a low patent frequency between 1 and
the imbalance patent data problem. More specifically,                   200 patents (Figure 1a and 1b).
we divide the dataset into two partitions using the                         Trying to explore whether the code’s patent
codes’ representation magnitude, i.e., a partition with                 frequency affects the performance of the patent
the major codes and a partition with the minor codes,                   classification models, Figure 2a displays the accuracy
and train two classifiers of the same type with patents                 of a state-of-the-art DL model, the Bi-LSTM [17], when
from each partition separately. Then, we combine the                    applied to a range of patent frequencies in the subclass
outcomes of the two classifiers using a meta-classifier.                category of the IPC 5+ level hierarchy using the 60 first
The experiments showed that the proposed                                words of the abstract section from the CLEFIP-0.54M
combination of classifiers improves the imbalance                       dataset.
patent data problem and outperforms the baseline                            As it is observed, high accuracies can be attained as
classifiers, the previous combinations of classifiers                   the patent frequency of codes increases, meaning that
(presented in [14]) and recent state-of-the-art                         the number of patent samples representing a specific
techniques for patent classification.                                   classification code plays a significant role in the code’s
                                                                        distinguishability and finally in the code’s
                                                                        performance. Considering that the accuracy of the
2. Motivation                                                           classification model across all codes is 63.76%, we
                                                                        assume that an adequate accuracy (see the “threshold”
The classification scheme contains numerous codes, of                   line in red – Figure 2a) is achieved for codes
which a varying number is assigned to each patent [15,                  represented by more than 500 patent samples.
16]. The distribution of patents across classification                      Especially, for classification codes with low
codes is quite unbalanced following a Pareto-like                       representation magnitude the accuracy achieved is
distribution [16]. About 80% of all patent documents                    quite low affecting significantly the total accuracy of
are classified in about 20% of the classification codes,                the classifier. e.g., the accuracy for codes with patent
meaning that some classification codes present quite                    frequency between 0 and 50 patents is only 19.09%.
low and other quite high patent frequency.                              Therefore, the idea behind this study is that if we had
    Similar to the real-life distribution of patents                    a classifier focusing only on these low-represented
across codes, the distribution of patents across codes                  codes, better performance would be achieved. This is
in test collections is quite unbalanced. For example, in                also validated in Figure 2b where the accuracy
the CLEFIP-0.54M dataset2 which originates from the                     achieved by a similar classifier trained only with low-
CLEF-IP 2011 (see Section 4 for more information),                      represented codes is presented.
each code has a mean frequency of 740 patents with a


                              1a                                                                     1b
Figure 1a, b: The unbalanced distribution of patent frequency across the 731 and 628 main classification codes of
the CLEFIP-0.54M and USPTO dataset, respectively.


2

    https://drive.google.com/drive/folders/1tfBsUkQwIpwwgDyw28EO
    ZctaiiJqZr1Q


                                                                   28
                           2a                                                                 2b
Figure 2a: The accuracy of a state-the-art patent classification model, the Bi-LSTM, as a function of codes’ patent
frequency organized into groups of subsequent codes. Figure 2b: The accuracy of the same model trained only on
patents with low-represented codes.
                                                                 classifier specializes in a portion of codes having high
                                                                 and low patent frequency, respectively. This means
3. Ensemble method for                                           that if a patent application characterized with a
     classification of imbalance                                 classification code of low frequency is submitted to the
                                                                 first classifier specializing to high-represented codes,
     patent data                                                 the classifier will not be able to classify this patent
                                                                 application correctly since the specific classifier is not
An ensemble architecture for automated patent                    (probably) trained with similar patents. Conversely, if
classification has been introduced by Kamateri and               this patent application belonging to a classification
Salampasis in [14]. The architecture consists of                 code of low frequency is submitted to the second
individual classifiers that can be of any number and             classifier, which is more delicate to detect codes with
any type, while they can be trained with the same or             low patent frequency, there are better chances to be
different parts of the patent document. Each classifier          properly classified under the correct classification
produces a list of probabilities for all labels based on         code corresponding to the described invention. In such
its whole or partial knowledge about the patent. Then,           cases, an appropriate combination of two baseline
the probabilities for a specific label derived from all          classifiers can better approximate such a boundary by
individual classifiers are combined and a total                  dividing the data space into smaller and easier-to-
probability is calculated for this label. The label with         learn partitions. Then, a meta-classifier is trained on
the maximum probability consists the predicted label             the features that are outputs of the baseline classifiers
for the patent. The combination of probabilities of the          to learn how to best combine their predictions
individual classifiers can be aggregated using                   (stacking). More specifically, the meta-classifier will
simple/weighted averaging, voting, stacking or other             distinguish if the described invention of a patent
combination techniques.                                          application belongs to a high or a low represented
    In this study, we apply this ensemble architecture           classification code and, respectively, coordinate the
for automated patent classification to address the               operation (sigmoid stacking classifier) or selecting the
imbalance patent data issue equipped it with two                 more appropriate (softmax stacking classifier) of the
baseline classifiers and a meta-classifier (Figure 3).           two baseline classifiers to classify a receiving patent
The first classifier is trained with high represented            application.
classification codes, while the second classifier is
trained with low represented codes. Thus, each


Figure 3: Ensemble architecture for automated patent classification focusing on the imbalance patent data.


                                                            29
                                                                      individual classifiers were used as input for a meta-
                                                                      classifier using the stacking technique. The meta-
4. Data collection                                                    classifier is a neural network having two dense layers.
                                                                      The second dense layer is activated with a softmax or
To evaluate the real-world performance of the                         a sigmoid activation in order to obtain a probability
proposed ensemble method for imbalance patent data,                   distribution over all targeted labels/codes.
two patent benchmark datasets have been used: the                         With respect to the patent representation, the first
USPTO-2M and the CLEFIP-0.54M.                                        60 words from the patent part of interest (e.g., title,
4.1. USPTO-2M 1.1                                                     abstract, etc.) were used after undertaking a sequence
The USPTO-2M is a large-scale dataset prepared for                    of preprocessing steps (cleaning punctuation, symbols
patent classification [6]. The raw patent data have                   and numbers, and stop word removal). The feature
been obtained from the online website of the United                   words were then mapped to embeddings using a
States Patent and Trademark Office (USPTO) from                       domain-specific pre-trained language model which
2006 to 2015. The dataset contains 2,000,147 patents                  has been created on a patent dataset, proposed by
with the title and abstract sections. in 637 categories               Risch and Krestel [4].
at the subclass level.                                                    The dataset was split into training, validation and
                                                                      testing sets (80:10:10). Batch size was set to 128,
4.2. CLEFIP-0.54M                                                     epochs for baseline classifiers to 15 and epochs for
The CLEFIP-0.54M contains English patents of CLEF-IP                  meta-classifier to 20.
2011 with the main classification code and all the
following six patent sections: Title, Abstract,
Description, Claims, Applicants and Inventors. In total,              6. Results
the dataset contains 541,131 patents classified in 731
subclass codes of which 276,794 come from the                         In each experiment, two baseline classifiers have been
European Patent Office (EPO) and 264,337 from the                     trained on two different data partitions. The first
World Intellectual Property Organization (WIPO)3 .                    classifier was trained on patents belonging to high-
                                                                      represented codes, having patent frequency over 500
                                                                      patents, while the second classifier was trained on
5. Experimental setup                                                 patents of low-represented codes, with patent
                                                                      frequency between 1 and 500 patents. Table 1
The ensemble architecture presented in [14] is                        presents the Accuracy attained by each classifier i)
instantiated in this study as a single-label classification           when it is tested on the same data partition where it
task at the subclass (3rd) level category of the IPC 5+               was previously trained, named as “Testing on the same
level hierarchy. More specifically, the aim is to identify            data partition”, and ii) when it is tested in the entire
the main classification code. In the CLEFIP-0.54M                     dataset, containing both data partitions with known
dataset, this information is available by the dataset. In             and unknown data, named as “Testing on the entire
the USPTO dataset, we assume that the first code is the               dataset”. It also presents the Accuracy of the meta-
main classification code in cases where many codes are                classifier combining the outcomes of the two baseline
given to a patent.                                                    classifiers using a stacking technique. Last, it presents
    An ensemble of bidirectional LSTM classifiers was                 the Accuracy of the ensemble of classifiers combining
employed, since this ML method has been proved in                     sigmoid predictions from different patent sections.
[14] to attain better results than other DL methods.                      In both datasets, the accuracy is much improved
Each classifier was trained on codes of different patent              when a stacking technique is applied combining the
frequency: low-represented codes with patent                          predicted probabilities acquired by individual
frequency between 0-500 patents and high-                             classifiers specialized in high- and low-represented
represented codes with patent frequency over 500                      codes, respectively. Moreover, the stacking technique
patents, respectively. The outcome probabilities of
Table 1
Accuracy at subclass level
                                                                                      Ensemble of
                                                                                              Baseline Ensemble of
                          Classifier 1 -        Classifier 2 -   Meta-classifier
                                                                                        sigmoid
                                                                                              classifier predictions
                        Training on high-     Training on low-     combining
                                                                                    predictions for all
                                                                                               trained for all patent
                       represented codes     represented codes   classifier 1 & 2
                                                                                     patent sections
                                                                                                on the    sections
           Section
                       Testing on Testing Testing on Testing                                    entire   (Weighted
                        the same on the the same on the                              Weighted dataset     average)
                                                             Softmax Sigmoid Average
                          data    entire    data     entire                          average     [14]       [14]
                        partition dataset partition dataset
            Title       55.34%/54.28%         65.43%/ 1.50%      54.65%   55.39%                          53.44%
USPT


                                                                                    61.98%    62.11%                  59.92
 O


          Abstract      59.85%/59.86%         71.79%/1.65%       59.86%   60.64%                          58.61%
           Abstract     68.02%/63.91%         65.72%/9.37%       67.69%   68.14%                          63.76%
CLEFIP-


          Descriptio
 0.54M


                        70.59%/66.43%         71.23%/10.16%      69.47%   71.10%    75.36%    75.40%      66.46%     70.39%
               n
            Claims      68.64%/64.59%         64.42%/9.52%       68.23%   68.88%                          64.56%


3 CLEFIP-0.54M 2022 (accessed 18/12/2022),

https://github.com/ekamater/CLEFIP2011_XML2MySQL


                                                                 30
using the sigmoid activation seems to slightly                   [6]  S. Li, J. Hu, Y. Cui, J. Hu, DeepPatent: patent
outperformed the stacking classifier using the softmax                classification      with    convolutional     neural
activation. It is also clear that the proposed method                 networks and word embedding. Scientometrics,
provides better results than those obtained from                      117(2)           (2018)          721-744.       doi:
recent state-of-the-art techniques [14, 18, 19].                      https://doi.org/10.1007/s11192-018-2905-5.
                                                                 [7] J. Risch, R. Krestel, Domain-specific word
                                                                      embeddings for patent classification. Data
7. Conclusions                                                        Technologies and Applications 53 (2019) 108-
                                                                      122. doi: https://doi.org/10.1108/DTA-01-
In this study, a novel ensemble method for patent                     2019-0002.
classification is presented addressing the imbalance             [8] H. Feng, W. Qin, H. Wang, Y. Li, G. Hu, A
patent data problem which is one of the most                          combination of resampling and ensemble
significant factors that reduces the accuracy in                      method for text classification on imbalanced
automated patent classification. The results showed                   data. In: Wei, J., Zhang, LJ. (eds), Big Data –
that the proper combination of classifiers can attain                 BigData 2021. BigData 2021. Lecture Notes in
significantly improved accuracy compared to baseline                  Computer Science, volume 12988. Springer,
classifiers and existing classification techniques.                   Cham. doi: https://doi.org/10.1007/978-3-030-
Moreover, the combination of the knowledge gained                     96282-1_1.
from multiple classifiers could address the problem of           [9] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P.
low patent sample representation for codes, a                         Kegelmeyer, SMOTE: synthetic minority over-
phenomenon that is relatively common in the patent                    sampling technique. Journal of artificial
domain as the IPC/CPC taxonomy evolves with new                       intelligence research 16 (2002) 321-357, doi:
codes introduced, codes partitioned into sub-                         https://doi.org/10.1613/jair.953.
categories, etc.                                                 [10] G. E. Batista, A. L. Bazzan, M. C. Monard, Balancing
                                                                      training data for automated annotation of
                                                                      keywords: a case study. In WOB, 2003, pp. 10-18.
Acknowledgements                                                 [11] B. Krawczyk, M. Koziarski, M. Woźniak, Radial-
                                                                      based oversampling for multiclass imbalanced
                    The    research work    was                       data classification. IEEE transactions on neural
                    supported by the Hellenic                         networks and learning systems, 31(8) (2019)
                    Foundation for Research and                       2818-2831. doi: 10.1109/TNNLS.2019.2913673.
Innovation (HFRI) under the HFRI PhD Fellowship                  [12] Y. Zhao, A.K. Shrivastava, K.L. Tsui, Imbalanced
grant (Fellowship Number: 10695).                                     classification by learning hidden data structure.
                                                                      IIE Transactions. 48 (7) (2016) 614–628. doi:
                                                                      https://doi.org/10.1080/0740817X.2015.1110
References                                                            269.
                                                                 [13] C. Cao, Z. Wang, IMCStacking: cost-sensitive
[1]   M. Salampasis, G. Paltoglou, A. Giahanou, Report                stacking learning with feature inverse mapping
      on the CLEF-IP 2012 Experiments: Search of                      for imbalanced problems. Knowledge-Based
      Topically Organized Patents. Conference and                     Systems.        150      (2018)      27–37.     doi:
      Labs of the Evaluation Forum, 2012.                             https://doi.org/10.1016/j.knosys.2018.02.031.
[2]   E. Perez-Molina, F. Loizides, Novel data structure         [14] E., Kamateri, M. Salampasis, 2022. An Ensemble
      and visualization tool for studying technology                  Architecture       of    Classifiers   for    Patent
      evolution based on patent information: The                      Classification. In proceedings of the 3rd
      DTFootprint and the TechSpectrogram. World                      Workshop on Patent Text Mining and Semantic
      Patent Information 64 (2021) 102009. doi:                       Technologies (PatentSemTech), 2022, pp. 6-7.
      https://doi.org/10.1016/j.wpi.2020.102009.                      doi: https://doi.org/10.34726/3550.
[3]   T. Montecchi, D. Russo, Y. Liu, Searching in               [15] M. R. Gouvea Meireles, G. Ferraro, S. Geva,
      Cooperative Patent Classification: Comparison                   Classification and information management for
      between keyword and concept-based search.                       patent collections: a literature review and some
      Advanced Engineering Informatics 27(3) (2013)                   research questions. Information Research, 21, 1,
      335-345.                                       doi:             (2016) 7051-29.
      https://doi.org/10.1016/j.aei.2013.02.002.                 [16] K. Benzineb, J. Guyot, Automated patent
[4]   M. F. Grawe, C. A. Martins, A. G. Bonfante,                     classification. In M. Lupu, K. Mayer, J.Tait & A. J.
      Automated patent classification using word                      Trippe (Eds.), Current challenges in patent
      embedding. In 2017 16th IEEE International                      information retrieval, Springer, London, 2011,
      Conference on Machine Learning and                              pp. 239-262. doi: https://doi.org/10.1007/978-
      Applications (ICMLA), 2017, pp. 408-411. doi:                   3-642-19231-9_12.
      https://doi.org/10.1109/ICMLA.2017.0-127.                  [17] E., Kamateri, V., Stamatis, K., Diamantaras, & M.
[5]   L., Xiao, G., Wang, & Y. Zuo, Research on patent                Salampasis,. Automated Single-Label Patent
      text classification based on word2vec and LSTM.                 Classification using Ensemble Classifiers. In 2022
      In 2018 11th International Symposium on                         14th International Conference on Machine
      Computational Intelligence and Design (ISCID),                  Learning and Computing (ICMLC), 2022, pp.
      2018,             pp.         71-74.           doi:             324–330.                                        doi:
      https://doi.org/10.1109/ISCID.2018.00023.                       https://doi.org/10.1145/3529836.3529849.


                                                            31
[18] M. Sofean, Deep learning based pipeline with
     multichannel inputs for patent classification.
     World Patent Information 66 (2021) 102060.
     doi:
     https://doi.org/10.1016/J.WPI.2021.102060.
[19] D. Tikk, G. Biró, A. Törcsvári, A hierarchical online
     classifier for patent categorization. In Emerging
     technologies of text mining: Techniques and
     applications,      2008,    pp.     244-267.     doi:
     https://doi.org/10.4018/978-1-59904-373-
     9.CH012.


                                                             32