Ensemble Method for Classification in Imbalanced Patent Data Eleni Kamateri1 and Michail Salampasis1 1Department of Information and Electronic Engineering, International Hellenic University (IHU), Alexander Campus, Sindos 57400, Thessaloniki, Greece Abstract This study presents an ensemble method for patent classification addressing the imbalance patent data problem. To achieve this, the dataset is divided into two data partitions based on the codes’ representation magnitude. These partitions are trained separately by two identical classifiers and their results are combined using a stacking meta-classifier. Experiments are conducted using two benchmark patent datasets. The first results showed that the proposed combination of classifiers improves the imbalance patent data problem and outperforms the baseline classifiers, other combinations of classifiers and recent state-of-the-art techniques for patent classification. Keywords Patent, Classification, Ensemble, Imbalance data, Single-label, Sub-classes, Ensemble method, Deep learning, Word embeddings1 Learning (DL) models for effective patent modelling 1. Introduction and representation, and automatic classification. Most of these patent classification efforts used various Patent classification is an important task of the patent simplifications when applied, e.g., working mostly with examination process dealing with the assignment of well-represented codes having many training samples one or more classification codes from a classification or targeting the higher levels of the classification scheme. The most widely used classification scheme is hierarchy, still they do not attain acceptable the International Patent Classification (IPC) which performance, i.e., one close to human performance. contains approximately 70,000 different IPC codes. The accuracy of the classification model mainly The correct assignment of classification codes is quite depends on the quality of the dataset and the important as it ensures that patents with similar classification algorithm. The data-related factors technical characteristics will be clustered together which could reduce the accuracy of a patent under the same classification codes, something which classification model are many, such as the is crucially important for many subsequent tasks, such complex/broad concepts expressed by classification as patent management and search, technology codes, the ambiguous vocabulary or new terminology characterization and landscape [1, 2]. However, the used, the overlapping concepts among classification high numbers of classification codes, along with their codes (which increases as we go down in the level complex and heterogeneous definitions make the hierarchy), and, last but not least, the imbalanced patent classification a challenging task. patent dataset problem. This means that some The manual patent classification, which is classification codes have a large number of patent performed by patent officers when a patent samples and thus high representation magnitude in application arrives, includes the finding of relevant the dataset. These codes are called major classification codes through the hierarchical codes/classes. On the other side, there are some other descriptions of classification codes in the classification classification codes which have very few patent scheme. However, it can be very time consuming, samples and thus low representation magnitude These tedious and strongly dependent on patent officer’s codes are called minor codes/classes. ability and experience [3]. This is the reason why Classification models trained by imbalanced automatic tools for selecting the relevant classification datasets usually have a very poor prediction ability on codes are needed. minor codes. In order to solve the imbalanced dataset Research efforts in automated patent classification issue, lots of research efforts have been carried out. [4-7] utilize Natural Language Processing (NLP) Improvements are mainly based on two directions, the techniques and Machine Learning (ML)/Deep PatentSemTech'23: 4th Workshop on Patent Text Mining and Semantic Technologies, July 27th, 2023, Taipei, Taiwan. ekamater@hotmail.com (E. Kamateri); msa@ihu.gr (M. Salampasis); 0000-0003-0490-2194 (E. Kamateri); 0000-0003-4087-125X (M. Salampasis); © 2023 Copyright for this paper by its authors. The use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 27 dataset level and the algorithm level [8]. On the dataset standard deviation of 1,930 patents and a median level, the main strategy is to use resampling methods. frequency of 169, which is a more informative statistic Over-sampling and under-sampling methods have compared to mean for imbalance datasets where there been introduced to resample the data to get a balanced exist many frequency outliers. Similarly, in the USPTO dataset [9-11]. On the algorithm level, the main idea is dataset, each code has a mean frequency of 3,177 to adjust the algorithms to improve the accuracy of patents with a standard deviation of 12,710 patents models, such as introducing an ensemble method [12, and a median frequency of 578. Moreover, 392 codes 13]. (53.63% of all 731 codes) in the CLEFIP-0.54M dataset In this study, we adjust the ensemble architecture and 212 codes (33.76% of all 628 codes) in the USPTO for patent classification presented in [14] to address dataset have a low patent frequency between 1 and the imbalance patent data problem. More specifically, 200 patents (Figure 1a and 1b). we divide the dataset into two partitions using the Trying to explore whether the code’s patent codes’ representation magnitude, i.e., a partition with frequency affects the performance of the patent the major codes and a partition with the minor codes, classification models, Figure 2a displays the accuracy and train two classifiers of the same type with patents of a state-of-the-art DL model, the Bi-LSTM [17], when from each partition separately. Then, we combine the applied to a range of patent frequencies in the subclass outcomes of the two classifiers using a meta-classifier. category of the IPC 5+ level hierarchy using the 60 first The experiments showed that the proposed words of the abstract section from the CLEFIP-0.54M combination of classifiers improves the imbalance dataset. patent data problem and outperforms the baseline As it is observed, high accuracies can be attained as classifiers, the previous combinations of classifiers the patent frequency of codes increases, meaning that (presented in [14]) and recent state-of-the-art the number of patent samples representing a specific techniques for patent classification. classification code plays a significant role in the code’s distinguishability and finally in the code’s performance. Considering that the accuracy of the 2. Motivation classification model across all codes is 63.76%, we assume that an adequate accuracy (see the “threshold” The classification scheme contains numerous codes, of line in red – Figure 2a) is achieved for codes which a varying number is assigned to each patent [15, represented by more than 500 patent samples. 16]. The distribution of patents across classification Especially, for classification codes with low codes is quite unbalanced following a Pareto-like representation magnitude the accuracy achieved is distribution [16]. About 80% of all patent documents quite low affecting significantly the total accuracy of are classified in about 20% of the classification codes, the classifier. e.g., the accuracy for codes with patent meaning that some classification codes present quite frequency between 0 and 50 patents is only 19.09%. low and other quite high patent frequency. Therefore, the idea behind this study is that if we had Similar to the real-life distribution of patents a classifier focusing only on these low-represented across codes, the distribution of patents across codes codes, better performance would be achieved. This is in test collections is quite unbalanced. For example, in also validated in Figure 2b where the accuracy the CLEFIP-0.54M dataset2 which originates from the achieved by a similar classifier trained only with low- CLEF-IP 2011 (see Section 4 for more information), represented codes is presented. each code has a mean frequency of 740 patents with a 1a 1b Figure 1a, b: The unbalanced distribution of patent frequency across the 731 and 628 main classification codes of the CLEFIP-0.54M and USPTO dataset, respectively. 2 https://drive.google.com/drive/folders/1tfBsUkQwIpwwgDyw28EO ZctaiiJqZr1Q 28 2a 2b Figure 2a: The accuracy of a state-the-art patent classification model, the Bi-LSTM, as a function of codes’ patent frequency organized into groups of subsequent codes. Figure 2b: The accuracy of the same model trained only on patents with low-represented codes. classifier specializes in a portion of codes having high and low patent frequency, respectively. This means 3. Ensemble method for that if a patent application characterized with a classification of imbalance classification code of low frequency is submitted to the first classifier specializing to high-represented codes, patent data the classifier will not be able to classify this patent application correctly since the specific classifier is not An ensemble architecture for automated patent (probably) trained with similar patents. Conversely, if classification has been introduced by Kamateri and this patent application belonging to a classification Salampasis in [14]. The architecture consists of code of low frequency is submitted to the second individual classifiers that can be of any number and classifier, which is more delicate to detect codes with any type, while they can be trained with the same or low patent frequency, there are better chances to be different parts of the patent document. Each classifier properly classified under the correct classification produces a list of probabilities for all labels based on code corresponding to the described invention. In such its whole or partial knowledge about the patent. Then, cases, an appropriate combination of two baseline the probabilities for a specific label derived from all classifiers can better approximate such a boundary by individual classifiers are combined and a total dividing the data space into smaller and easier-to- probability is calculated for this label. The label with learn partitions. Then, a meta-classifier is trained on the maximum probability consists the predicted label the features that are outputs of the baseline classifiers for the patent. The combination of probabilities of the to learn how to best combine their predictions individual classifiers can be aggregated using (stacking). More specifically, the meta-classifier will simple/weighted averaging, voting, stacking or other distinguish if the described invention of a patent combination techniques. application belongs to a high or a low represented In this study, we apply this ensemble architecture classification code and, respectively, coordinate the for automated patent classification to address the operation (sigmoid stacking classifier) or selecting the imbalance patent data issue equipped it with two more appropriate (softmax stacking classifier) of the baseline classifiers and a meta-classifier (Figure 3). two baseline classifiers to classify a receiving patent The first classifier is trained with high represented application. classification codes, while the second classifier is trained with low represented codes. Thus, each Figure 3: Ensemble architecture for automated patent classification focusing on the imbalance patent data. 29 individual classifiers were used as input for a meta- classifier using the stacking technique. The meta- 4. Data collection classifier is a neural network having two dense layers. The second dense layer is activated with a softmax or To evaluate the real-world performance of the a sigmoid activation in order to obtain a probability proposed ensemble method for imbalance patent data, distribution over all targeted labels/codes. two patent benchmark datasets have been used: the With respect to the patent representation, the first USPTO-2M and the CLEFIP-0.54M. 60 words from the patent part of interest (e.g., title, 4.1. USPTO-2M 1.1 abstract, etc.) were used after undertaking a sequence The USPTO-2M is a large-scale dataset prepared for of preprocessing steps (cleaning punctuation, symbols patent classification [6]. The raw patent data have and numbers, and stop word removal). The feature been obtained from the online website of the United words were then mapped to embeddings using a States Patent and Trademark Office (USPTO) from domain-specific pre-trained language model which 2006 to 2015. The dataset contains 2,000,147 patents has been created on a patent dataset, proposed by with the title and abstract sections. in 637 categories Risch and Krestel [4]. at the subclass level. The dataset was split into training, validation and testing sets (80:10:10). Batch size was set to 128, 4.2. CLEFIP-0.54M epochs for baseline classifiers to 15 and epochs for The CLEFIP-0.54M contains English patents of CLEF-IP meta-classifier to 20. 2011 with the main classification code and all the following six patent sections: Title, Abstract, Description, Claims, Applicants and Inventors. In total, 6. Results the dataset contains 541,131 patents classified in 731 subclass codes of which 276,794 come from the In each experiment, two baseline classifiers have been European Patent Office (EPO) and 264,337 from the trained on two different data partitions. The first World Intellectual Property Organization (WIPO)3 . classifier was trained on patents belonging to high- represented codes, having patent frequency over 500 patents, while the second classifier was trained on 5. Experimental setup patents of low-represented codes, with patent frequency between 1 and 500 patents. Table 1 The ensemble architecture presented in [14] is presents the Accuracy attained by each classifier i) instantiated in this study as a single-label classification when it is tested on the same data partition where it task at the subclass (3rd) level category of the IPC 5+ was previously trained, named as “Testing on the same level hierarchy. More specifically, the aim is to identify data partition”, and ii) when it is tested in the entire the main classification code. In the CLEFIP-0.54M dataset, containing both data partitions with known dataset, this information is available by the dataset. In and unknown data, named as “Testing on the entire the USPTO dataset, we assume that the first code is the dataset”. It also presents the Accuracy of the meta- main classification code in cases where many codes are classifier combining the outcomes of the two baseline given to a patent. classifiers using a stacking technique. Last, it presents An ensemble of bidirectional LSTM classifiers was the Accuracy of the ensemble of classifiers combining employed, since this ML method has been proved in sigmoid predictions from different patent sections. [14] to attain better results than other DL methods. In both datasets, the accuracy is much improved Each classifier was trained on codes of different patent when a stacking technique is applied combining the frequency: low-represented codes with patent predicted probabilities acquired by individual frequency between 0-500 patents and high- classifiers specialized in high- and low-represented represented codes with patent frequency over 500 codes, respectively. Moreover, the stacking technique patents, respectively. The outcome probabilities of Table 1 Accuracy at subclass level Ensemble of Baseline Ensemble of Classifier 1 - Classifier 2 - Meta-classifier sigmoid classifier predictions Training on high- Training on low- combining predictions for all trained for all patent represented codes represented codes classifier 1 & 2 patent sections on the sections Section Testing on Testing Testing on Testing entire (Weighted the same on the the same on the Weighted dataset average) Softmax Sigmoid Average data entire data entire average [14] [14] partition dataset partition dataset Title 55.34%/54.28% 65.43%/ 1.50% 54.65% 55.39% 53.44% USPT 61.98% 62.11% 59.92 O Abstract 59.85%/59.86% 71.79%/1.65% 59.86% 60.64% 58.61% Abstract 68.02%/63.91% 65.72%/9.37% 67.69% 68.14% 63.76% CLEFIP- Descriptio 0.54M 70.59%/66.43% 71.23%/10.16% 69.47% 71.10% 75.36% 75.40% 66.46% 70.39% n Claims 68.64%/64.59% 64.42%/9.52% 68.23% 68.88% 64.56% 3 CLEFIP-0.54M 2022 (accessed 18/12/2022), https://github.com/ekamater/CLEFIP2011_XML2MySQL 30 using the sigmoid activation seems to slightly [6] S. Li, J. Hu, Y. Cui, J. Hu, DeepPatent: patent outperformed the stacking classifier using the softmax classification with convolutional neural activation. It is also clear that the proposed method networks and word embedding. Scientometrics, provides better results than those obtained from 117(2) (2018) 721-744. doi: recent state-of-the-art techniques [14, 18, 19]. https://doi.org/10.1007/s11192-018-2905-5. [7] J. Risch, R. Krestel, Domain-specific word embeddings for patent classification. Data 7. Conclusions Technologies and Applications 53 (2019) 108- 122. doi: https://doi.org/10.1108/DTA-01- In this study, a novel ensemble method for patent 2019-0002. classification is presented addressing the imbalance [8] H. Feng, W. Qin, H. Wang, Y. Li, G. Hu, A patent data problem which is one of the most combination of resampling and ensemble significant factors that reduces the accuracy in method for text classification on imbalanced automated patent classification. The results showed data. In: Wei, J., Zhang, LJ. (eds), Big Data – that the proper combination of classifiers can attain BigData 2021. BigData 2021. Lecture Notes in significantly improved accuracy compared to baseline Computer Science, volume 12988. Springer, classifiers and existing classification techniques. Cham. doi: https://doi.org/10.1007/978-3-030- Moreover, the combination of the knowledge gained 96282-1_1. from multiple classifiers could address the problem of [9] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. low patent sample representation for codes, a Kegelmeyer, SMOTE: synthetic minority over- phenomenon that is relatively common in the patent sampling technique. Journal of artificial domain as the IPC/CPC taxonomy evolves with new intelligence research 16 (2002) 321-357, doi: codes introduced, codes partitioned into sub- https://doi.org/10.1613/jair.953. categories, etc. [10] G. E. Batista, A. L. Bazzan, M. C. Monard, Balancing training data for automated annotation of keywords: a case study. In WOB, 2003, pp. 10-18. Acknowledgements [11] B. Krawczyk, M. Koziarski, M. Woźniak, Radial- based oversampling for multiclass imbalanced The research work was data classification. IEEE transactions on neural supported by the Hellenic networks and learning systems, 31(8) (2019) Foundation for Research and 2818-2831. doi: 10.1109/TNNLS.2019.2913673. Innovation (HFRI) under the HFRI PhD Fellowship [12] Y. Zhao, A.K. Shrivastava, K.L. Tsui, Imbalanced grant (Fellowship Number: 10695). classification by learning hidden data structure. IIE Transactions. 48 (7) (2016) 614–628. doi: https://doi.org/10.1080/0740817X.2015.1110 References 269. [13] C. Cao, Z. Wang, IMCStacking: cost-sensitive [1] M. Salampasis, G. Paltoglou, A. Giahanou, Report stacking learning with feature inverse mapping on the CLEF-IP 2012 Experiments: Search of for imbalanced problems. Knowledge-Based Topically Organized Patents. Conference and Systems. 150 (2018) 27–37. doi: Labs of the Evaluation Forum, 2012. https://doi.org/10.1016/j.knosys.2018.02.031. [2] E. Perez-Molina, F. Loizides, Novel data structure [14] E., Kamateri, M. Salampasis, 2022. An Ensemble and visualization tool for studying technology Architecture of Classifiers for Patent evolution based on patent information: The Classification. In proceedings of the 3rd DTFootprint and the TechSpectrogram. World Workshop on Patent Text Mining and Semantic Patent Information 64 (2021) 102009. doi: Technologies (PatentSemTech), 2022, pp. 6-7. https://doi.org/10.1016/j.wpi.2020.102009. doi: https://doi.org/10.34726/3550. [3] T. Montecchi, D. Russo, Y. Liu, Searching in [15] M. R. Gouvea Meireles, G. Ferraro, S. Geva, Cooperative Patent Classification: Comparison Classification and information management for between keyword and concept-based search. patent collections: a literature review and some Advanced Engineering Informatics 27(3) (2013) research questions. Information Research, 21, 1, 335-345. doi: (2016) 7051-29. https://doi.org/10.1016/j.aei.2013.02.002. [16] K. Benzineb, J. Guyot, Automated patent [4] M. F. Grawe, C. A. Martins, A. G. Bonfante, classification. In M. Lupu, K. Mayer, J.Tait & A. J. Automated patent classification using word Trippe (Eds.), Current challenges in patent embedding. In 2017 16th IEEE International information retrieval, Springer, London, 2011, Conference on Machine Learning and pp. 239-262. doi: https://doi.org/10.1007/978- Applications (ICMLA), 2017, pp. 408-411. doi: 3-642-19231-9_12. https://doi.org/10.1109/ICMLA.2017.0-127. [17] E., Kamateri, V., Stamatis, K., Diamantaras, & M. [5] L., Xiao, G., Wang, & Y. Zuo, Research on patent Salampasis,. Automated Single-Label Patent text classification based on word2vec and LSTM. Classification using Ensemble Classifiers. In 2022 In 2018 11th International Symposium on 14th International Conference on Machine Computational Intelligence and Design (ISCID), Learning and Computing (ICMLC), 2022, pp. 2018, pp. 71-74. doi: 324–330. doi: https://doi.org/10.1109/ISCID.2018.00023. https://doi.org/10.1145/3529836.3529849. 31 [18] M. Sofean, Deep learning based pipeline with multichannel inputs for patent classification. World Patent Information 66 (2021) 102060. doi: https://doi.org/10.1016/J.WPI.2021.102060. [19] D. Tikk, G. Biró, A. Törcsvári, A hierarchical online classifier for patent categorization. In Emerging technologies of text mining: Techniques and applications, 2008, pp. 244-267. doi: https://doi.org/10.4018/978-1-59904-373- 9.CH012. 32