Use of Pseudo Relevance Feedback for Patent Clustering with Fuzzy C-means Noushin Fadaei Thomas Mandl Hildesheim University Hildesheim University Hildesheim, Germany Hildesheim, Germany fadaei@uni-hildesheim.de mandl@uni-hildesheim.de are growing very fast and required to be more or- ganized for such purposes. According to European Abstract Patent Office (EPO), there is a steady increase in this huge information resource in terms of filed patents Patent databases are meaningful resources for since 2010b [Off18]. In 2016, EPO recorded the high- technology trend detection as they collect in- est number of granted patents which went up to 10.1% formation on the recent key innovations; how- in 2017 [Off18]. To manage the detection of subtopics ever the importance of wordings in patents and their potentiality of being a trend in such big data, and use of complex content are remarkable further grouping of patents is inevitable. challenges in key word extraction in the text Patent analysis approaches are either qualitative or mining phase. Moreover patents share infor- quantitative [Hon09]. The focus of this work lies on mation by nature and depending on the crite- qualitative patent analysis, i.e., it uses text-mining rion of classification such as materials or uses, techniques regarding the content of patents unlike one may belong to multiple classes. For clus- quantitative approaches which employ metadata. The tering patents, this work proposes an updat- process typically aims to transform documents to vec- ing fuzzy c-means clustering which employs tors using weighting systems (e.g., TFIDF) for key pseudo relevance feedback originating from in- terms; Clustering methods then exploit similarity mea- formation retrieval in order to improve fea- sures to examine related vectors (e.g., Euclidean dis- tures extracted from the patent collection fol- tance) and group them into clusters. The efficiency lowing feedbacks. The results show a notice- of clustering is highly dependent on the selected key able improvement in clustering after applying terms [TJC04]. While the key words extraction re- pseudo relevance feedback clustering patents quires the contribution of experts in various domains, under topic contact lenses. automatic methods restrict the number of selected terms by applying thresholds for term frequency in 1 Introduction documents (TF) or document frequency (DF) [TLL07] or consider building key phrases for example by creat- The increasing competition in research conducted at ing a co-relation matrix of high frequent terms in docu- universities and other research institutes as well as in ments where the co-occurrence of terms in the dataset industry, further intensified by the increasing global- is reserved [THTL06]. This work adopts pseudo rele- ization, reinforces the importance of identifying new vance feedback [MRS08b] to obtain the key terms that trends at an early stage. According to a study by tend to form the core concept of clusters. Thomson Reuters 70% to 90% of the information cov- One of the most commonly used clustering algo- ered in patents depending on the research area is rithms in text-mining is k-means; however k-means not published anywhere else [Cor07]. This quality of is not the best clustering method for patent analy- patents makes such databases vital resources for find- sis. Patents are very rich documents in terms of pro- ing new trends in the industry. Yet patent databases fessional information and they may cover a range of Copyright © CIKM 2018 for the individual papers by the papers' technologies, applications or use of various materials. authors. Copyright © CIKM 2018 for the volume as a collection Therefore exhaustive clustering confines patents that potentially belong to different classes of a certain crite- by its editors. This volume and its papers are published under the Creative Commons License Attribution 4.0 International (CC BY 4.0). rion such as technology or material [FMS+ 15]. More- to be represented as vectors (X = (x1 , x2 , . . . , xn )2 ≤ over k-means requires the number of clusters and can- c ≤ n). Each vector xk ∈ Rs is built up by s features not cope with outliers [CHPT05]. where the features are the selected key terms from the To avoid low accuracy, fuzzy non-exhaustive clus- dataset. The goal is to organize X into groups that tering methods are exploited for patent clustering, e.g., may share vectors for further patent analysis purposes by [THTL06] and [DD09]. The fuzziness of the clus- like identifying trends in the given topic [HXW+ 12]. tering method provides a likelihood of possession (or membership to cluster) instead of a rigid distinction 2.1 Citations In Text and the overlapping characteristic of clustering allows one patent reflecting a number of claims contributes Citations within the text should indicate the author’s in multiple clusters. Setting a threshold for member- last name and year[?]. Reference style[?] should follow ship controls to what extent of similarity patents may the style that you are used to using, as long as the show up in a cluster. Fuzzy c-means (FCM) [Bez81] is citation style is consistent. the most popular fuzzy clustering algorithms and here we use it to softly partition our patent collection on 3 Our Approach certain topics. Also, the membership matrix enables 3.1 Fuzzy C-Means Algorithm (FCM) [Bez81] FCM to cope with the issue of the outliers as all mem- bership values of one document to all clusters ought to Fuzzy c-means algorithm (FCM) [Bez81] is based on add up to 1; thus, an unrelated document to all clus- the fuzzy membership matrix. The membership de- ters receives insignificant membership values for each scribes the likelihood of each vector (document) xk and every cluster and can be ignored. being a part of cluster (subtopic) ci where 1 ≤ k ≤ n The ultimate goal of this study is to investigate the and 1 ≤ i ≤ c with c being the number of clusters. The main idea of partitioning a patent dataset by means of overall membership of each vector is normalized to 1. a fuzzy clustering which allows for updating through The algorithm starts off with initializing c vectors as relevance feedback. Practically this study will run centroids of the clusters. Then the membership (wik ) with help of patent expert users however for experi- is calculated through the Euclidean distances (dik ) of mental purposes and analysing the validity of the ap- each vector (xk ) to the centroid (Pi ) of the cluster it proach, we organize it with FCM and pseudo rele- belongs to and to the centroids of other clusters. vance feedback. For the evaluation, we developed a Pn m benchmark based on World Intellectual Property Orga- k=1 (wik ) xk Pi = P n m (1) nization’s patent reports on recent technology trends k=1 (wik ) which were edited by experts in the domains. The c reports describe technologies and trends within a do- (b) X 1 main, e.g. contact lenses or robotic arms. These re- Wik = 2 (dik )( b) m−1 (2) j=1 [( (djk )( b) ) ] ports are published at the website of the World Intel- lecutal Property Organization (WIPO) 1 . The selected reports are provided by Gridlogics Technologies Pvt. The FCM membership function is calculated as Ltd and were partially generated by the use of Patent [HXW+ 12]: iNSIGHT Pro. The developed benchmark can be used c −1 for experiments in classification, clustering, trend anal- X ||xj − vi ||A m−1 2 µ(i,j) = [ ( ) ] (3) ysis or other intelligent patent processing systems. t=1 ||xj − vt ||A 2 Problem Statement µ(i,j) represents the membership value of jth patent of the dataset and ith topic whose centroid is vi . ||||A Given the query of the expert user which determines stands for norm function. the scope of the topic or more generally given the In- ternational Patent Classification (IPC) 2 of the topic of The memberships updates centroids vectors until interest, we retrieve the patents of the domain from the the overall distance of the updated centroids is less patent dataset. The set of n patents is then supposed than ε compared to the last set of centroids. The 1 http://www.wipo.int/patentscope/en/programs/patent steps of FCM algorithm are as follows: landscapes/plrdb.html 2 According to World Intellectual Property Organiza- tion(WIPO), The International Patent Classification (IPC), es- tablished by the Strasbourg Agreement 1971, provides for a hi- 1. Set the number of clusters to be found (c) erarchical system of language independent symbols for the clas- sification of patents. 2. Set an Euclidean normalization and fuzziness (m) 3. Initialize of cluster prototype P 0 , set the iterative similarity score to the query. Then the top k docu- counter (b) ments are considered as the feedback for best results. Pseudo relevance feedback considers these documents 4. Obtain membership matrix using Equation 2 the source of relevant key terms to the query. above We make use of this method in FCM so that we 5. Update the centroids using Equation 1 above find the patents that are more similar to the core con- cept of the cluster or centroids. The hypothesis is the 6. Repeat starting from step 2 until ||P (b) − key terms provided by these patents reflect the main P (b−1) || <  idea of the cluster and their corresponding vectors can represent a more manageable dataset for FCM. The The algorithm description and formulations are in- procedure is shown in figure 1. It generates an over- spired by [NNB15]. lapping clustering based on the membership matrix of FCM. As long as a relevance feedback is requested the 3.2 Updating Fuzzy C-Means using Pseudo following procedure runs: Relevance Feedback 1. Rank patents based on their membership to the cluster Parameter initialization (number 2. Obtain the top relevant patents for building the of clusters (c), fuzziness (m) and iterations (b)) features for vectors (k% of the size of the corre- sponding cluster) Initialize centroids 𝑃0 (update 𝑝𝑖 if 3. Update X = (x1 , x2 , . . . , xn )2 ≤ c ≤ n (dataset of called/already initialized) vectors) with new feature 4. Pass X to FCM so it starts updating the mem- Calculate the membership matrix X: set of vectors of dataset bership function by updating membership values X with reduced dimension: 𝑊𝑖𝑘 NO update features after RF 5. Stop if after clustering result of FCM no relevance with the key terms of only top 10% similar vectors to feedback (RF) is requested the final clusters centroids Minimum Ɛ Or Maximum iteration? 4 Experiment YES 4.1 Datasets and Experiments Settings Clustering Result We used freely available queries that are provided by the World Intellectual Property Organization YES (WIPO) to form a gold standard. The result sets Relevance were gained through the European Patent Fulltext feedback? (EPFULL) repository. For this work, we used a set, NO namely contact lenses to be clustered and another set, robotic arms to show an insight of the built gold STOP standard. The code was implemented in Python using Natural Language Toolkit (NLTK) for tokenization and stemming but for removing the stop words, Figure 1: Updating fuzzy c-means using pseudo rele- we used our list of stop words for patent analysis. vance feedback For Fuzzy c-means clustering we adopted the code Relevance feedback is a common concept in infor- provided in Github 3 . The overlapping is controlled mation retrieval that involves the user knowledge to by allowing the patents that are at least 99% similar identify and fetch more similar results to the query. to a member of the cluster inside the cluster as a Pseudo relevance feedback resembles the same con- member. The fuzzifier is set on 1.2, the error on 0.001 cept but the entire process is carried out automati- and the number of iterations on 200. We run the cally. The goal is to identify the relevant key terms method for c=3, c=13 and c=23 and for visualization regarding the query and to expand them in order to in 2D, we have used Principal Component analysis update the query and make it in line with more rele- (PCA). vant documents. The procedure starts with retrieving relevant documents and sorting them based on their 3 https://github.com/holtwashere/PossibilisticCMeans 4.2 Evaluation Metrics 4.3 Gold Standard based on World Intellec- tual Property Organization (WIPO) Re- Evaluation measures used for granulated clusters (e. ports g. k-means clusters) such as Dunn, Mutual Informa- There are a number of technological reports available tion (MI), F-measure, Rand Index and Jaccard are not by WIPO under the universal title of public health/life quite useful for measuring overlapping clusters; for in- science. These reports usually cover two types of stance Dunn gives a higher score to the clustering sys- queries; one results in the main class (usually shares tems that assign the data points to more distant clus- the title with the topic that is reported) and one is ters while the contents of each cluster are pretty close. breaking the main query into pieces; thus produces Considering a criterion like uses, we know that one subclasses. Some reports categorize the foregoing sub- patent might have several usages and such measures classes and provide fewer yet more general subclasses. are not revealing any required information. Purity ex- The following example depicts search strings provided poses the very nature of a cluster: the degree of consis- by WIPO report which results to a main class, namely tency. The higher is the purity the less is random clus- robotic arms followed by a further query that along tering. This is one important characteristic of clusters with the main query leads into a subclass of robotic in the patent clustering task, nevertheless we cannot arms: neglect the drawbacks of Purity: it is highly dependent on the number of clusters. Like Purity, MI is also in- fluenced by the number of clusters, while Normalized 1. Query to the main class robotic arm: Mutual Information (NMI) enables us to compare the clusters with each other (it ranges between 0 and 1) (FT=(robot* or (artificial w/2 intelligence) or and it is not affected greatly by the inaccurate number android or cyborg or humanoid*)) or (TAC= of clusters. For assessing the quality of this clustering, (manipulator* or manipulater* or actuator* we have used Normalized Mutual Information (NMI) or actuater* or drives or joint or joints or by Fred and Jain [FJ03] and Purity [MRS08a]. actuation or (”end effector” or ”end effecter”) or ((pneumatic* or air) w/2 muscle*))) and ((IC= B25J9/02 or B25J9/04 or B25J9/06 or B25J13/02 or B25J13/08 or B25J17 or B25J18) or (UC=901/2 or 901/14 or 901/19 or 901/27 or 901/31 or 901/39 or 700/245 or 700/248 or 700/261)) 2. Query to its subclass Anthropomorphic Robot: (TAC) contains (humanoid or android or anthro- pomorphic* or anthropomorfic*) Figure 2: FCM results on 13 clusters under topic Con- tact Lenses Query guide: FT-Full Text, TAC- Title Abstract Claim, IC- International Class, UC- US Class. w/2 shows the maximum number of intervening unmatched positions doesn’t exceed 2. Depending on the authors of the reports, queries are described in different languages and formats. There- fore the queries had to be adjusted to match Json data type. Using Elasticsearch’s API, we collected data through the EPFULL database and obtained the main class and subsequently the corresponding subclasses. For instance under the topic robotic arms, Figure 3: FCM and RF results on 13 clusters under 519 patents were retrieved, out of which 511 patents topic Contact Lenses were covered by subclasses from which 293 hits belong to the subclasses of the types criterion, 474 hits fit in the subclasses of applications criterion and the [CHPT05] Bernard Chen, R. Harrison, Yi Pan, and Table 1: Evaluation results of clustering of the topic Phang C. Tai. Novel hybrid hierarchical- Contact Lenses k-means clustering method (h-k-means) NMI Purity for microarray analysis. In Proceedings #Clusters FCM FCM&RF FCM FCM&RF of the 2005 IEEE Computational Systems 3 clusters 0.18 0.22 0.76 0.77 Bioinformatics Conference - Workshops, 13 clusters 0.05 0.16 0.78 0.77 CSBW ’05, pages 105–108, Washington, 23 clusters 0.09 0.13 0.76 0.70 DC, USA, 2005. IEEE Computer Society. subclasses of parts criterion cover 701 hits. Patents [Cor07] The Thomson Corporation. Global patent may share different subclasses and some available sources: An overview of international patents in EPFULL may not be covered by any sub- patents. 2007. class. The results of the proposed clustering method [DD09] Türkay Dereli and Alptekin Durmuşoğlu. have been examined against these retrieved subclasses. Classifying technology patents to identify trends: Applying a fuzzy-based clustering approach in the turkish textile industry. 5 Results Technology in Society, 31(3):263 – 272, 2009. The results of clustering under the topic contact lenses shows using pseudo relevance feedback can definitely [FJ03] Ana L N Fred and Anil K. Jain. Ro- help patents picking the better cluster. According to bust data clustering. In Proceedings of table 1, while the Purity results remain more or less the IEEE Computer Society Conference the same the normalized mutual information (NMI) on Computer Vision and Pattern Recog- improved 4% to 11% for all three number of clusters. nition, volume 2, 2003. Purity is significantly dependent on the number of clusters as when the data themes are noticeably fewer [FMS+ 15] Noushin Fadaei, Thomas Mandl, Michael than the number of clusters the chance of having more Schwantner, Mustafa Sofean, Julia M. consistent clusters would raise. However The results of Struß, Katrin Werner, and Christa Purity under the topic Contact Lenses do not change Womser-Hacker. Patent analysis and diversely for different number of clusters and it remain patent clustering for technology trend relatively high. This reflects the acceptable ability of mining. In HIER workshop, Hildesheim, FCM in this clustering. Moreover, visualizing cluster- Germany, July 2015. ing at 13 clusters with FCM (figure 2) and FCM with [Hon09] Soonwoo Hong. The magic of patent infor- influence of RF (figure 3) using PCA, we observe the mation. World Intellectual Property Or- data points are less scattered after using RF. ganization (WIPO). Available via DIA- LOG., December 2009. 6 Future work [HXW+ 12] Ming Huang, Zhixun Xia, Hongbo Wang, The main purpose of this study is to test whether the Qinghua Zeng, and Qian Wang. The range use of relevance feedback in clustering can play an im- of the value for the fuzzifier of the fuzzy portant role in grouping patents. In future we would c-means algorithm. Pattern Recogn. Lett., use explicit relevance feedback of the expert users to 33(16):2280–2284, December 2012. modify the dimension of the vectors with more useful features. We would also make use of lexical resources [MRS08a] Christopher D. Manning, Prabhakar on top of the feedbacks. In future we would also survey Raghavan, and Hinrich Schütze. Intro- the position of feedback vectors as they are selected duction to Information Retrieval. Cam- by users and may not be necessary appear around the bridge University Press, New York, NY, centroid. USA, 2008. [MRS08b] Christopher D. Manning, Prabhakar References Raghavan, and Hinrich Schütze. Introduc- tion to Information Retrieval. Cambridge [Bez81] James C. Bezdek. Pattern Recognition University Press, Cambridge, UK, 2008. with Fuzzy Objective Function Algorithms. Kluwer Academic Publishers, Norwell, [NNB15] Janmenjoy Nayak, Bighnaraj Naik, and MA, USA, 1981. HS Behera. Computational intelligence in data mining. Springer, New Delhi, India, 2015. [Off18] European Patent Office. Epo quality re- port 2017. 2018. [THTL06] Amy J. C. Trappey, Fu-Chiang Hsu, Charles V. Trappey, and Chia-I Lin. De- velopment of a patent document classifi- cation and search platform using a back- propagation network. Expert Syst. Appl., 31:755–765, 2006. [TJC04] Y. H. Tseng, D. W. Juang, and S. H. Chen. Global and local term expansion for text retrieval. In Proceedings of the fourth NTCIR workshop on evaluation of infor- mation retrieval, automatic text summa- rization and question answering, Tokyo, Japan, June 2004. [TLL07] Yuen-Hsien Tseng, Chi-Jen Lin, and Yu-I Lin. Text mining techniques for patent analysis. Inf. Process. Manage., 43(5):1216–1247, September 2007.