=Paper= {{Paper |id=Vol-2482/paper34 |storemode=property |title=Use of Pseudo Relevance Feedback for Patent Clustering with Fuzzy CMeans |pdfUrl=https://ceur-ws.org/Vol-2482/paper34.pdf |volume=Vol-2482 |authors=Noushin Fadaei,Thomas Mandl |dblpUrl=https://dblp.org/rec/conf/cikm/Fadaei018 }} ==Use of Pseudo Relevance Feedback for Patent Clustering with Fuzzy CMeans== https://ceur-ws.org/Vol-2482/paper34.pdf

Use of Pseudo Relevance Feedback for Patent Clustering
with Fuzzy C-means

Noushin Fadaei Thomas Mandl
Hildesheim University Hildesheim University
Hildesheim, Germany Hildesheim, Germany
fadaei@uni-hildesheim.de mandl@uni-hildesheim.de

are growing very fast and required to be more or-
ganized for such purposes. According to European
Abstract Patent Office (EPO), there is a steady increase in this
huge information resource in terms of filed patents
Patent databases are meaningful resources for since 2010b [Off18]. In 2016, EPO recorded the high-
technology trend detection as they collect in- est number of granted patents which went up to 10.1%
formation on the recent key innovations; how- in 2017 [Off18]. To manage the detection of subtopics
ever the importance of wordings in patents and their potentiality of being a trend in such big data,
and use of complex content are remarkable further grouping of patents is inevitable.
challenges in key word extraction in the text Patent analysis approaches are either qualitative or
mining phase. Moreover patents share infor- quantitative [Hon09]. The focus of this work lies on
mation by nature and depending on the crite- qualitative patent analysis, i.e., it uses text-mining
rion of classification such as materials or uses, techniques regarding the content of patents unlike
one may belong to multiple classes. For clus- quantitative approaches which employ metadata. The
tering patents, this work proposes an updat- process typically aims to transform documents to vec-
ing fuzzy c-means clustering which employs tors using weighting systems (e.g., TFIDF) for key
pseudo relevance feedback originating from in- terms; Clustering methods then exploit similarity mea-
formation retrieval in order to improve fea- sures to examine related vectors (e.g., Euclidean dis-
tures extracted from the patent collection fol- tance) and group them into clusters. The efficiency
lowing feedbacks. The results show a notice- of clustering is highly dependent on the selected key
able improvement in clustering after applying terms [TJC04]. While the key words extraction re-
pseudo relevance feedback clustering patents quires the contribution of experts in various domains,
under topic contact lenses. automatic methods restrict the number of selected
terms by applying thresholds for term frequency in
1 Introduction documents (TF) or document frequency (DF) [TLL07]
or consider building key phrases for example by creat-
The increasing competition in research conducted at ing a co-relation matrix of high frequent terms in docu-
universities and other research institutes as well as in ments where the co-occurrence of terms in the dataset
industry, further intensified by the increasing global- is reserved [THTL06]. This work adopts pseudo rele-
ization, reinforces the importance of identifying new vance feedback [MRS08b] to obtain the key terms that
trends at an early stage. According to a study by tend to form the core concept of clusters.
Thomson Reuters 70% to 90% of the information cov-
One of the most commonly used clustering algo-
ered in patents depending on the research area is
rithms in text-mining is k-means; however k-means
not published anywhere else [Cor07]. This quality of
is not the best clustering method for patent analy-
patents makes such databases vital resources for find-
sis. Patents are very rich documents in terms of pro-
ing new trends in the industry. Yet patent databases
fessional information and they may cover a range of
Copyright © CIKM 2018 for the individual papers by the papers' technologies, applications or use of various materials.
authors. Copyright © CIKM 2018 for the volume as a collection
Therefore exhaustive clustering confines patents that
potentially belong to different classes of a certain crite-
by its editors. This volume and its papers are published under
the Creative Commons License Attribution 4.0 International (CC
BY 4.0).
rion such as technology or material [FMS+ 15]. More- to be represented as vectors (X = (x1 , x2 , . . . , xn )2 ≤
over k-means requires the number of clusters and can- c ≤ n). Each vector xk ∈ Rs is built up by s features
not cope with outliers [CHPT05]. where the features are the selected key terms from the
To avoid low accuracy, fuzzy non-exhaustive clus- dataset. The goal is to organize X into groups that
tering methods are exploited for patent clustering, e.g., may share vectors for further patent analysis purposes
by [THTL06] and [DD09]. The fuzziness of the clus- like identifying trends in the given topic [HXW+ 12].
tering method provides a likelihood of possession (or
membership to cluster) instead of a rigid distinction 2.1 Citations In Text
and the overlapping characteristic of clustering allows
one patent reflecting a number of claims contributes Citations within the text should indicate the author’s
in multiple clusters. Setting a threshold for member- last name and year[?]. Reference style[?] should follow
ship controls to what extent of similarity patents may the style that you are used to using, as long as the
show up in a cluster. Fuzzy c-means (FCM) [Bez81] is citation style is consistent.
the most popular fuzzy clustering algorithms and here
we use it to softly partition our patent collection on 3 Our Approach
certain topics. Also, the membership matrix enables
3.1 Fuzzy C-Means Algorithm (FCM) [Bez81]
FCM to cope with the issue of the outliers as all mem-
bership values of one document to all clusters ought to Fuzzy c-means algorithm (FCM) [Bez81] is based on
add up to 1; thus, an unrelated document to all clus- the fuzzy membership matrix. The membership de-
ters receives insignificant membership values for each scribes the likelihood of each vector (document) xk
and every cluster and can be ignored. being a part of cluster (subtopic) ci where 1 ≤ k ≤ n
The ultimate goal of this study is to investigate the and 1 ≤ i ≤ c with c being the number of clusters. The
main idea of partitioning a patent dataset by means of overall membership of each vector is normalized to 1.
a fuzzy clustering which allows for updating through The algorithm starts off with initializing c vectors as
relevance feedback. Practically this study will run centroids of the clusters. Then the membership (wik )
with help of patent expert users however for experi- is calculated through the Euclidean distances (dik ) of
mental purposes and analysing the validity of the ap- each vector (xk ) to the centroid (Pi ) of the cluster it
proach, we organize it with FCM and pseudo rele- belongs to and to the centroids of other clusters.
vance feedback. For the evaluation, we developed a Pn m
benchmark based on World Intellectual Property Orga- k=1 (wik ) xk
Pi = P n m
(1)
nization’s patent reports on recent technology trends k=1 (wik )
which were edited by experts in the domains. The
c
reports describe technologies and trends within a do- (b)
X 1
main, e.g. contact lenses or robotic arms. These re- Wik = 2
(dik )( b) m−1
(2)
j=1 [( (djk )( b) ) ]
ports are published at the website of the World Intel-
lecutal Property Organization (WIPO) 1 . The selected
reports are provided by Gridlogics Technologies Pvt. The FCM membership function is calculated as
Ltd and were partially generated by the use of Patent [HXW+ 12]:
iNSIGHT Pro. The developed benchmark can be used c −1
for experiments in classification, clustering, trend anal- X ||xj − vi ||A m−1
2
µ(i,j) = [ ( ) ] (3)
ysis or other intelligent patent processing systems. t=1
||xj − vt ||A

2 Problem Statement µ(i,j) represents the membership value of jth patent
of the dataset and ith topic whose centroid is vi . ||||A
Given the query of the expert user which determines
stands for norm function.
the scope of the topic or more generally given the In-
ternational Patent Classification (IPC) 2 of the topic of
The memberships updates centroids vectors until
interest, we retrieve the patents of the domain from the
the overall distance of the updated centroids is less
patent dataset. The set of n patents is then supposed
than ε compared to the last set of centroids. The
1 http://www.wipo.int/patentscope/en/programs/patent steps of FCM algorithm are as follows:
landscapes/plrdb.html
2 According to World Intellectual Property Organiza-

tion(WIPO), The International Patent Classification (IPC), es-
tablished by the Strasbourg Agreement 1971, provides for a hi- 1. Set the number of clusters to be found (c)
erarchical system of language independent symbols for the clas-
sification of patents. 2. Set an Euclidean normalization and fuzziness (m)
3. Initialize of cluster prototype P 0 , set the iterative similarity score to the query. Then the top k docu-
counter (b) ments are considered as the feedback for best results.
Pseudo relevance feedback considers these documents
4. Obtain membership matrix using Equation 2 the source of relevant key terms to the query.
above We make use of this method in FCM so that we
5. Update the centroids using Equation 1 above find the patents that are more similar to the core con-
cept of the cluster or centroids. The hypothesis is the
6. Repeat starting from step 2 until ||P (b) − key terms provided by these patents reflect the main
P (b−1) || < idea of the cluster and their corresponding vectors can
represent a more manageable dataset for FCM. The
The algorithm description and formulations are in- procedure is shown in figure 1. It generates an over-
spired by [NNB15]. lapping clustering based on the membership matrix of
FCM. As long as a relevance feedback is requested the
3.2 Updating Fuzzy C-Means using Pseudo following procedure runs:
Relevance Feedback
1. Rank patents based on their membership to the
cluster
Parameter initialization (number 2. Obtain the top relevant patents for building the
of clusters (c), fuzziness (m) and
iterations (b)) features for vectors (k% of the size of the corre-
sponding cluster)

Initialize centroids 𝑃0 (update 𝑝𝑖 if 3. Update X = (x1 , x2 , . . . , xn )2 ≤ c ≤ n (dataset of
called/already initialized)
vectors) with new feature

4. Pass X to FCM so it starts updating the mem-
Calculate the membership matrix X: set of vectors of dataset bership function
by updating membership values
X with reduced dimension:
𝑊𝑖𝑘
NO update features after RF 5. Stop if after clustering result of FCM no relevance
with the key terms of only
top 10% similar vectors to
feedback (RF) is requested
the final clusters centroids
Minimum Ɛ Or
Maximum iteration? 4 Experiment
YES 4.1 Datasets and Experiments Settings
Clustering Result We used freely available queries that are provided
by the World Intellectual Property Organization
YES (WIPO) to form a gold standard. The result sets
Relevance
were gained through the European Patent Fulltext
feedback? (EPFULL) repository. For this work, we used a set,
NO
namely contact lenses to be clustered and another
set, robotic arms to show an insight of the built gold
STOP
standard. The code was implemented in Python using
Natural Language Toolkit (NLTK) for tokenization
and stemming but for removing the stop words,
Figure 1: Updating fuzzy c-means using pseudo rele- we used our list of stop words for patent analysis.
vance feedback For Fuzzy c-means clustering we adopted the code
Relevance feedback is a common concept in infor- provided in Github 3 . The overlapping is controlled
mation retrieval that involves the user knowledge to by allowing the patents that are at least 99% similar
identify and fetch more similar results to the query. to a member of the cluster inside the cluster as a
Pseudo relevance feedback resembles the same con- member. The fuzzifier is set on 1.2, the error on 0.001
cept but the entire process is carried out automati- and the number of iterations on 200. We run the
cally. The goal is to identify the relevant key terms method for c=3, c=13 and c=23 and for visualization
regarding the query and to expand them in order to in 2D, we have used Principal Component analysis
update the query and make it in line with more rele- (PCA).
vant documents. The procedure starts with retrieving
relevant documents and sorting them based on their 3 https://github.com/holtwashere/PossibilisticCMeans
4.2 Evaluation Metrics 4.3 Gold Standard based on World Intellec-
tual Property Organization (WIPO) Re-
Evaluation measures used for granulated clusters (e. ports
g. k-means clusters) such as Dunn, Mutual Informa-
There are a number of technological reports available
tion (MI), F-measure, Rand Index and Jaccard are not
by WIPO under the universal title of public health/life
quite useful for measuring overlapping clusters; for in-
science. These reports usually cover two types of
stance Dunn gives a higher score to the clustering sys-
queries; one results in the main class (usually shares
tems that assign the data points to more distant clus-
the title with the topic that is reported) and one is
ters while the contents of each cluster are pretty close.
breaking the main query into pieces; thus produces
Considering a criterion like uses, we know that one
subclasses. Some reports categorize the foregoing sub-
patent might have several usages and such measures
classes and provide fewer yet more general subclasses.
are not revealing any required information. Purity ex-
The following example depicts search strings provided
poses the very nature of a cluster: the degree of consis-
by WIPO report which results to a main class, namely
tency. The higher is the purity the less is random clus-
robotic arms followed by a further query that along
tering. This is one important characteristic of clusters
with the main query leads into a subclass of robotic
in the patent clustering task, nevertheless we cannot
arms:
neglect the drawbacks of Purity: it is highly dependent
on the number of clusters. Like Purity, MI is also in-
fluenced by the number of clusters, while Normalized
1. Query to the main class robotic arm:
Mutual Information (NMI) enables us to compare the
clusters with each other (it ranges between 0 and 1) (FT=(robot* or (artificial w/2 intelligence) or
and it is not affected greatly by the inaccurate number android or cyborg or humanoid*)) or (TAC=
of clusters. For assessing the quality of this clustering, (manipulator* or manipulater* or actuator*
we have used Normalized Mutual Information (NMI) or actuater* or drives or joint or joints or
by Fred and Jain [FJ03] and Purity [MRS08a]. actuation or (”end effector” or ”end effecter”)
or ((pneumatic* or air) w/2 muscle*))) and
((IC= B25J9/02 or B25J9/04 or B25J9/06 or
B25J13/02 or B25J13/08 or B25J17 or B25J18)
or (UC=901/2 or 901/14 or 901/19 or 901/27
or 901/31 or 901/39 or 700/245 or 700/248 or
700/261))

2. Query to its subclass Anthropomorphic
Robot:
(TAC) contains (humanoid or android or anthro-
pomorphic* or anthropomorfic*)
Figure 2: FCM results on 13 clusters under topic Con-
tact Lenses
Query guide: FT-Full Text, TAC- Title Abstract
Claim, IC- International Class, UC- US Class.
w/2 shows the maximum number of intervening
unmatched positions doesn’t exceed 2.

Depending on the authors of the reports, queries are
described in different languages and formats. There-
fore the queries had to be adjusted to match Json
data type. Using Elasticsearch’s API, we collected
data through the EPFULL database and obtained
the main class and subsequently the corresponding
subclasses. For instance under the topic robotic arms,
Figure 3: FCM and RF results on 13 clusters under 519 patents were retrieved, out of which 511 patents
topic Contact Lenses were covered by subclasses from which 293 hits belong
to the subclasses of the types criterion, 474 hits fit
in the subclasses of applications criterion and the
[CHPT05] Bernard Chen, R. Harrison, Yi Pan, and
Table 1: Evaluation results of clustering of the topic
Phang C. Tai. Novel hybrid hierarchical-
Contact Lenses
k-means clustering method (h-k-means)
NMI Purity for microarray analysis. In Proceedings
#Clusters
FCM FCM&RF FCM FCM&RF of the 2005 IEEE Computational Systems
3 clusters 0.18 0.22 0.76 0.77 Bioinformatics Conference - Workshops,
13 clusters 0.05 0.16 0.78 0.77 CSBW ’05, pages 105–108, Washington,
23 clusters 0.09 0.13 0.76 0.70 DC, USA, 2005. IEEE Computer Society.

subclasses of parts criterion cover 701 hits. Patents [Cor07] The Thomson Corporation. Global patent
may share different subclasses and some available sources: An overview of international
patents in EPFULL may not be covered by any sub- patents. 2007.
class. The results of the proposed clustering method [DD09] Türkay Dereli and Alptekin Durmuşoğlu.
have been examined against these retrieved subclasses. Classifying technology patents to identify
trends: Applying a fuzzy-based clustering
approach in the turkish textile industry.
5 Results Technology in Society, 31(3):263 – 272,
2009.
The results of clustering under the topic contact lenses
shows using pseudo relevance feedback can definitely [FJ03] Ana L N Fred and Anil K. Jain. Ro-
help patents picking the better cluster. According to bust data clustering. In Proceedings of
table 1, while the Purity results remain more or less the IEEE Computer Society Conference
the same the normalized mutual information (NMI) on Computer Vision and Pattern Recog-
improved 4% to 11% for all three number of clusters. nition, volume 2, 2003.
Purity is significantly dependent on the number of
clusters as when the data themes are noticeably fewer [FMS+ 15] Noushin Fadaei, Thomas Mandl, Michael
than the number of clusters the chance of having more Schwantner, Mustafa Sofean, Julia M.
consistent clusters would raise. However The results of Struß, Katrin Werner, and Christa
Purity under the topic Contact Lenses do not change Womser-Hacker. Patent analysis and
diversely for different number of clusters and it remain patent clustering for technology trend
relatively high. This reflects the acceptable ability of mining. In HIER workshop, Hildesheim,
FCM in this clustering. Moreover, visualizing cluster- Germany, July 2015.
ing at 13 clusters with FCM (figure 2) and FCM with [Hon09] Soonwoo Hong. The magic of patent infor-
influence of RF (figure 3) using PCA, we observe the mation. World Intellectual Property Or-
data points are less scattered after using RF. ganization (WIPO). Available via DIA-
LOG., December 2009.
6 Future work
[HXW+ 12] Ming Huang, Zhixun Xia, Hongbo Wang,
The main purpose of this study is to test whether the Qinghua Zeng, and Qian Wang. The range
use of relevance feedback in clustering can play an im- of the value for the fuzzifier of the fuzzy
portant role in grouping patents. In future we would c-means algorithm. Pattern Recogn. Lett.,
use explicit relevance feedback of the expert users to 33(16):2280–2284, December 2012.
modify the dimension of the vectors with more useful
features. We would also make use of lexical resources [MRS08a] Christopher D. Manning, Prabhakar
on top of the feedbacks. In future we would also survey Raghavan, and Hinrich Schütze. Intro-
the position of feedback vectors as they are selected duction to Information Retrieval. Cam-
by users and may not be necessary appear around the bridge University Press, New York, NY,
centroid. USA, 2008.

[MRS08b] Christopher D. Manning, Prabhakar
References Raghavan, and Hinrich Schütze. Introduc-
tion to Information Retrieval. Cambridge
[Bez81] James C. Bezdek. Pattern Recognition University Press, Cambridge, UK, 2008.
with Fuzzy Objective Function Algorithms.
Kluwer Academic Publishers, Norwell, [NNB15] Janmenjoy Nayak, Bighnaraj Naik, and
MA, USA, 1981. HS Behera. Computational intelligence in
data mining. Springer, New Delhi, India,
2015.

[Off18] European Patent Office. Epo quality re-
port 2017. 2018.
[THTL06] Amy J. C. Trappey, Fu-Chiang Hsu,
Charles V. Trappey, and Chia-I Lin. De-
velopment of a patent document classifi-
cation and search platform using a back-
propagation network. Expert Syst. Appl.,
31:755–765, 2006.
[TJC04] Y. H. Tseng, D. W. Juang, and S. H.
Chen. Global and local term expansion for
text retrieval. In Proceedings of the fourth
NTCIR workshop on evaluation of infor-
mation retrieval, automatic text summa-
rization and question answering, Tokyo,
Japan, June 2004.
[TLL07] Yuen-Hsien Tseng, Chi-Jen Lin, and
Yu-I Lin. Text mining techniques for
patent analysis. Inf. Process. Manage.,
43(5):1216–1247, September 2007.