=Paper=
{{Paper
|id=Vol-2621/CIRCLE20_01
|storemode=property
|title=Forecasting Patent Growth by Combining Time-Series Signals Using Covariance Patterns
|pdfUrl=https://ceur-ws.org/Vol-2621/CIRCLE20_01.pdf
|volume=Vol-2621
|authors=Manajit Chakraborty,Seyed Ali Bahrainian,Fabio Crestani
|dblpUrl=https://dblp.org/rec/conf/circle/ChakrabortyBC20
}}
==Forecasting Patent Growth by Combining Time-Series Signals Using Covariance Patterns==
Forecasting Patent Growth By Combining Time-Series Signals
Using Covariance Patterns
Manajit Chakraborty Seyed Ali Bahrainian Fabio Crestani
Università della Svizzera italiana (USI) Università della Svizzera italiana (USI) Università della Svizzera italiana (USI)
Switzerland Switzerland Switzerland
manajit.chakraborty@usi.ch seyed.ali.bahreinian@usi.ch fabio.crestani@usi.ch
ABSTRACT involved and complex technologies, such as those used in renew-
Bibliometrics has been employed previously with patents for tech- able energy, patents are usually built upon existing technologies.
nological forecasting. The primary challenge that technological In such cases, it can sometimes be difficult to identify a clear-cut
forecasting faces is early-stage identification of technologies with key patent.
the potential to have a significant impact on the socio-economic Technological forecasting has already been endorsed as an integral
landscape. Bibliographic measures such as citations, are a good element to stay ahead of the curve for corporations and govern-
indicator of technological growth. With this intuition, we carry out ments [7]. Previous studies, like the one by Acs et al. [1], suggested
an exploratory study using various time-series models and topic that patents provide a fairly reliable measure of innovative activity.
modeling over patent content to predict the growth or decline of Citation analysis and especially bibliometrics [3] has been used on
various bibliographic measures for topics in the near future. Intu- citation graphs to identify similar works or to calculate the impact
itively, in order to effectively uncover these citation trends shortly factor of journals, researchers etc. Predicting citation counts for
after the patents are issued, we need to look beyond raw citation patents is non-trivial and also less useful because citation counts in
counts and take into account both the geographical and temporal patents do not change as rapidly as in scholarly papers or other web
information. We posit that, instead of using only citation counts for articles . On the other hand, the change in citation counts, which
time-series prediction, judicious use of signals from topics gener- refers to the rise or decline of the patents of certain categories
ated from documents belonging to various geographical locations could provide us a quantitative as well as a qualitative overview
can help improve the performance. We carry out experiments on a of patent landscape. It will also indicate which topics, and in turn
large collection of patents and present some insightful results and which technological classes, are supposed to get traction in the up-
observations. coming years. Discovering topics from patents and analyzing their
evolution over time is beneficial for making important decisions by
CCS CONCEPTS research institutes, corporations, funding agencies, governments
and any other organization involved in production or promotion of
• Information systems → Data analytics.
intellectual properties. For example, research funding organizations
can adjust their granting policies based on insights produced by
KEYWORDS
predictive models in order to favor topics that are trending and
Correlation Analysis, Patent Citations, Prediction, Time series, gaining increasing attention rather than those that are losing mo-
Topic Modeling, LDA, Citation Growth. mentum and interest.
There are several factors which determine how innovation evolves
1 INTRODUCTION in a particular geographical location over a period of time which
A patent is a contract between the inventor or assignee and the includes political, social, environmental and judicial policies among
state, granting a limited period of time to the inventor to exploit others. While it is nearly impossible to chart all the factors and
the invention. The reasons for patenting could be myriad, ranging measure its impact on innovation, investigating how innovation
from the elementary need for exclusive rights to a technology or grows irrespective of such influences is still important. While topic
invention to building a positive image of an enterprise. Patents are models have been used to forecast emerging technologies from the
pivotal for technological innovation in the context where they ap- vantage point of technological classes [28], they have not been used
ply. They can be used to generate revenues, encourage synergistic in tandem with citations or to chart the citation growth of tech-
partnerships, or to create a market advantage and be the basis for nology classes. In this paper, we assume that technology classes
technological development. are represented by a group of patents which can further be de-
Patent citations, namely references to prior patent documents and lineated by topics drawn from them. We hypothesize that, if we
the state-of-the-art included therein, and their frequency are also can intelligently leverage the information from both citations and
often used as indicators for the technological and commercial value full-text of patents through topic modeling with additional inputs
of a patent [20]. Citations are also used to identify “key” patents, from the geographical regions from where certain topics emerge, it
which often varies depending on the nature of the technology. should aid us in predicting the growth of citations in the next time
In pharmaceutical technologies, e.g., a patent on one important slice with increased accuracy. In light of this, our contributions are
substance can be determined as a key patent. However, for more two-fold:
"Copyright © 2020 for this paper by its authors. Use permitted under Creative Com- https://hbr.org/1967/03/technological-forecasting
mons License Attribution 4.0 International (CC BY 4.0)." https://ec.europa.eu/invest-in-research/pdf/download_en/final_report_hcp.pdf
Manajit Chakraborty, Seyed Ali Bahrainian, and Fabio Crestani
• To provide a time-series representation that exploits and topics that produce words with certain probabilities. Unlike latent
infuses the signals arising from patent (text) content, geo- semantic analysis [8], the topics coming from LDA are easier to
graphical region and accompanying bibliographic measures. interpret, because they are represented by combinations of words
• To improve citation growth prediction using the infused with contribution probabilities for each topic [29]. Besides, LDA is
signal with various time-series models. known as one of the best topic models when dealing with a large
In order to achieve our first goal, we propose a correlation analysis corpus and to interpret the identified latent topics [5].
scheme, which we term as CORrelation Analysis using COvariance
(CORACO), to harness the best of both worlds from topics derived LDA is known to outperform other dimension-reduction tech-
using LDA [5] from patents and bibliographic measure namely niques when dealing with a large corpus and to interpret the iden-
citation counts. The second goal is realized by employing regression tified latent dimensions [5]. Regarding patent-based analysis, stud-
based time-series models on these infused correlated components. ies have applied LDA to the technological trend identification of
Our experiments are performed on a large open-source patent greenhouse gas reduction technology [18], knowledge organization
collection, MAREC. The efficiency of our approach is corroborated system development [12], and firms’ technological concentration
by the significantly improved prediction performance over three trends on patent subjects [28]. LDA can identify sub-topics for a
different baseline models. technology area composed of many patents, and represent each of
the patents in an array of topic distributions. Kim et al. [19] use
2 RELATED WORKS LDA for visualizing development paths among patents through
sensitivity analyses based on semantic patent similarities and cita-
It has already been established that statistical analysis of interna-
tions. Here, the authors use LDA to identify sub-topics of a given
tional patent records is a valuable tool for corporate technology
technology. Topic models have also been employed for patent classi-
analysis and planning. Patents provide a wealth of detailed infor-
fication [27] among other problems. Time-Series analysis has been
mation, comprehensive coverage of technologies and countries, a
previously used by Holger Ernst [9] to examine the relationship
relatively standardized level of invention, and long time-series of
between patent applications and subsequent changes of company
data [21]. So, it essentially provides us with an indicator to measure
performance. We aim to harness the power of both the bibliographic
technological growth, which in turn could be extrapolated to get
measures and topic modeling to provide us with a more accurate
a better understanding of the relation and mutual dependence of
prediction of citation growth using several time-series models.
innovation and economics [16, 22]. One such study to analyze how
quantitative R&D and technology indicators may be used to fore-
cast company stock price performance was carried out by Patrick 3 METHODOLOGY
Thomas [26]. On the other hand, full-text analysis of patents using 3.1 Topic Extraction
topic modeling has also yielded interesting insights for technologi- Our first step involved extraction of topics from the collection of
cal classification, clustering and prediction. With this in mind, the patent documents that suitably indicate the latent themes of the col-
existing literature can be grouped into two broad themes in the lection. For this, we employed LDA with default parameters. Since
context of our research problem: the document collection is large and we needed to find the best rep-
resentation of the data using LDA, we performed a optimum topic
2.1 Use of Citations for Technological number estimation. To this end, similar to the method proposed by
Forecasting Griffiths and Steyvers [10], we performed a model selection pro-
One of the early studies to measure the technological impact based cess. This consists of keeping the LDA Dirichlet hyperparameters
on patent citations was done by Karki [17]. He proposed a host of (commonly known as 𝛼 and 𝛽) fixed and assigning several values to
technological indicators based on citations among patents. Some 𝐾 (parameter for controlling the number of topics). We computed
studies, like the one by Albert et al. [2], have considered only cita- an LDA model for each assignment, and subsequently, we picked
tions counts as indicators of industrially important patents. Zhang the model that satisfies:
et al.[30] proposed to weight 11 indicators of patent value using arg min𝐾 log 𝑃 (𝑊 |𝐾)
Shannon entropy, and selected forward citations as one of the most
important indicators for technological value. The basic motivation where 𝑊 indicates all the words in the corpus. We repeated this
for using citations received as an indicator of quality is that cita- process for 𝐾 from 100 to 600 in steps of 50 to find the optimal
tions indicate some form of knowledge spillovers. As argued by number of topics for all the time slices. We found the optimum
Jaffe et al. [14], citations reflect the fact that either a new technology value at 𝐾 = 500 topics.
builds on an existing one, or that they serve a similar purpose. The next step involved measuring the topic strength T𝑗 of each topic
𝑡 𝑗 per year 𝑦 which can be defined as in Equation 1, where |𝐷 𝑦 |
2.2 Topic Models for Patent Analysis denotes the total number of documents in year 𝑦.
Latent Dirichlet Allocation (LDA) is a generative topic model which |𝐷 𝑦|
finds latent topics in a text corpus, based on the assumption that au-
Õ 𝑝 (𝑡 𝑗 |𝑑𝑖 )
T𝑗,𝑦 = (1)
thors generally write documents with respect to specific topics [5]. 𝑖=1
|𝐷 𝑦 |
Using the LDA process, a document is represented as a mixture of
The topic probabilities, 𝑝 (𝑡 𝑗 |𝑑𝑖 ), are produced by the LDA as scores
http://www.ifs.tuwien.ac.at/imp/marec.shtml for each document 𝑑𝑖 along with the topics corresponding to 𝑑𝑖 .
Forecasting Patent Growth By Combining Time-Series Signals Using Covariance Patterns
Figure 1: Topic Strength Distribution Figure 2: Citation Count Distribution
Additionally, we also compute topic strength for each of the six certain topics. Hence, our next step was to observe the changes in
continents, since Antarctica has no patents, (listed in Section 4.2), citation patterns for topics over the 27 year period. The normalized
𝑐𝑜𝑢𝑛𝑡𝑟 𝑦
T𝑗𝑐𝑜𝑛𝑡𝑖𝑛𝑒𝑛𝑡 , and 127 countries, T𝑗 , in the dataset for each citation count C𝑗,𝑦 for topic 𝑡 𝑗 for year 𝑦 is determined according
of the years 1980 through 2006. The topic strength distribution to Equation 2:
Õ 𝑐𝑑 ,𝑦
helps us gauge the change in the importance of a topic over a C𝑗,𝑦 =
𝑖
(2)
given period. This measure is better than topic frequency since it |𝐷 𝑦 |
𝑡 𝑗 →𝑑𝑖 ∈𝐷 𝑦
not only considers topic count but also accounts for the potential
contribution of a certain topic 𝑡 𝑗 to some document 𝑑𝑖 . For instance, where 𝑐𝑑𝑖 ,𝑦 denotes the number of citations received by document
the topic: 𝑑𝑖 in year 𝑦. For the same topic example, “mobile radio station” in
the previous section, the corresponding citation count distribution
• Topic: 2 (Mobile Radio Station) is presented in Figure 2.
Now, that we have two different distributions or signals from text
• Words: 0.236*“station” + 0.158*“mobile” + 0.088*“radio” + of patents i.e. topic strength and a bibliometric measure i.e. cita-
0.052*“stations” + 0.014*“uplink” + 0.013*“downlink” + tions; we would like to judiciously combine them in such a way
0.011*“access” + 0.010*“quality” + 0.010*“cellular” + that it maximizes the accuracy of prediction of citation growth (or
0.008*“traffic” decline). Our objective thus translates to quantitatively capturing
extracted from our collection, which corresponds to “mobile radio the correlation between these two signals.
stations” depict the underlying topic strength distribution as in
CORACO: In this paper, we propose a method for finding corre-
Figure 1. From this figure, we can observe how the topic distribu-
lation components between two such independent distributions
tion changes with time depending on the geographical region (in
that maximize the commonality of two signals. We call this method
this case, countries). When observed across topics, it also gives a
as CORrelation Analysis with COvariance (CORACO). Thus, our
qualitative overview of which countries are more invested in which
problem can be redefined as a problem of finding two sets of basis
topics.
vectors, one for a and the other for b, such that the projections of
the variables onto the covariance matrix of the two signals would
3.2 Correlated Time-Series Analysis be maximized. Here, for simplicity let us assume, a and b are place-
While topic strength analysis may provide us with some clues holders for T𝑘 and C𝑘 for any given topic 𝑡𝑘 , respectively. Let us
regarding how the topics and the corresponding patents related assume the linear combinations 𝑎 = a𝑇 ŵ𝑎 and 𝑏 = b𝑇 ŵ𝑏 of the
to the topics are changing over a certain period of time; citations two variables a and b respectively, where ŵ𝑎 and ŵ𝑏 are canoni-
provide us with a more tangible resource which helps us measure cal weights. We want to consider the case where only one pair of
the change in interest towards certain patents and by association, basis vectors are required corresponding to the largest correlation
Manajit Chakraborty, Seyed Ali Bahrainian, and Fabio Crestani
component. This indicates that the function to be maximized is: T𝑘 and C𝑘 , is 27 since we are observing the values for 27 years
𝐸 [𝑎𝑏] (1980-2006). Now, while C𝑘 is fixed, the T𝑘 can vary depending on
𝜌 =p the geographical region. North America, Asia and Europe has the
𝐸 [𝑎 2 ]𝐸 [𝑏 2 ]
largest share of patents, as depicted in Table 1. Also, from Table
𝐸 [a𝑇 ŵ𝑎 .b𝑇 ŵ𝑎 ] 2 we observe the distribution of patents by country. We chose to
=q
(3) focus on the top-3 countries for our analysis, since they contribute
𝐸 [ ŵ𝑇𝑎 a.a𝑇 ŵ𝑎 ]𝐸 [ ŵ𝑇𝑏 b.b𝑇 ŵ𝑏 ]
a large share (58.5%) of the total patents produced in the world.
w𝑇𝑎 C𝑎𝑏 w𝑏 Hence, we need to compute the following four sets of CORACO
= 𝑇
w𝑎 C𝑎𝑎 w𝑎 .w𝑇𝑏 C𝑏𝑏 w𝑏 components:
The maximum correlation component can thus be defined as the 𝑋 𝑤𝑜𝑟𝑙𝑑 , 𝑌𝑤𝑜𝑟𝑙𝑑 = CORACO(T𝑘𝑤𝑜𝑟𝑙𝑑 , C𝑘 )
maximum value that 𝜌 can assume with respect to w𝑥 and w𝑦 . The
𝑋 JP, 𝑌JP = CORACO(T𝑘JP, C𝑘 )
subsequent correlation components are uncorrelated for different (7)
solutions, i.e.: 𝑋 US, 𝑌US = CORACO(T𝑘US, C𝑘 )
𝐸 [𝑎 𝑎 ] = 𝐸 [w𝑇 a.a𝑇 w ] = w𝑇 C w = 0 𝑋 GB, 𝑌GB = CORACO(T𝑘GB, C𝑘 )
𝑖 𝑗 𝑎𝑗 𝑎𝑖 𝑎𝑎 𝑎 𝑗
𝑎𝑖
𝐸 [𝑏𝑖 𝑏 𝑗 ] = 𝐸 [w𝑇𝑏𝑖 b.b𝑇 b𝑏 𝑗 ] = w𝑇𝑏𝑖 C𝑏𝑏 w𝑏 𝑗 = 0 for 𝑖 ≠ 𝑗 . (4) In Figure 3, we present the infused signals as provided by CORACO
𝐸 [𝑎𝑖 𝑏 𝑗 ] = 𝐸 [w𝑇 a.b𝑇 w𝑏 𝑗 ] = w𝑇 C𝑎𝑏 w𝑏 𝑗 = 0
for the World for the same topic as in Section 3.1. By World we
𝑎𝑖 𝑎𝑖
The projections onto w𝑎 and w𝑏 , i.e. a and b, describe the under- mean the complete set of patent documents in the collection. The
lying “latent” variables. Now, we know that for any two random CORACO signals are new projected signals onto the covariance
variables m and n with zero mean, the total covariance matrix can of the original signals. We can observe that CORACO successfully
be represented as: combines the distribution of topic strength of the World with its
corresponding Citation Count distribution such that it minimizes
C C𝑚𝑛 h i
C = 𝑚𝑚 =𝐸 m n
m 𝑇
n
(5) points in the distribution where the covariance is large (e.g. between
C𝑛𝑚 C𝑛𝑛 1990-1995 in Figure 3). Essentially, it tries to bring both the signals
is a square matrix where C𝑚𝑚 and C𝑛𝑛 are the intra-set covariance closer on a singular scale to achieve maximum points of similarity.
matrices of 𝑚 and 𝑛 respectively and C𝑚𝑛 = C𝑇𝑛𝑚 is the inter-set These reinforced signals are then given to time-series models as
covariance matrix. input for prediction of citation growth. As results in Section 5 will
Thus, the correlation components between a and b can be found show, using CORACO components instead of raw citation counts
by solving the eigenvalue equations: indeed improves the performance thus validating our hypothesis.
(
−1 C C−1 C = 𝜌 2 ŵ
C𝑎𝑎 𝑎𝑏 𝑏𝑏 𝑏𝑎
−1 C C−1 C = 𝜌 2 ŵ
𝑎
(6) 3.3 Citation Growth Prediction
C𝑏𝑏 𝑏𝑥 𝑎𝑎 𝑎𝑏 𝑏
The final step is concerned with prediction based on the CORACO
where the eigenvalues 𝜌 2 are the squared correlations and the eigen- components. We argue using CORACO components as input to the
vectors w𝑎 and w𝑏 are the normalized correlation basis vectors. time-series models instead of using the raw distribution of citation
The number of non-zero solutions to these equations are limited to counts for topics could significantly enhance the performance. This
the smallest dimensionality of a and b. stems from the fact that CORACO components are representations
for the correlation between two distributions which should be able
Country No. of docs. to model the commonalities better. With this in mind, we choose
JP 159,433 to employ three different time-series models:
US 148,434 1. Linear Regression or Autoregression (AR)
GB 23,869 2. Moving Average (MA)
NL 21,767 3. Simple Exponential Smoothing (SES)
IT 15,795 Due to lack of apparent trends or seasonality attributes in the
SE 6,208 observed variables, we could not use other models such as the
DE 4,799 Autoregression Moving Average (ARMA), the Autoregressive Inte-
CH 3,990 grated Moving Average (ARIMA) or the Seasonal Autoregressive
CA 3,740 Integrated Moving-Average (SARIMA). It is imperative to mention
KR 3,634 that our objective is to observe and predict the direction of change
.. .. in number of citations (increase or decrease) i.e. polarity (ΔC𝑘 ) of
the citations for the next time window. As stated earlier, we consider
World 567,547
one year as a time window. Thus, we are not concerned with predict-
Table 2: Distribution of patents for top-10 countries ing the actual number of citations that a topic is supposed to gain
in the next time window. This is because the number of citations
a patent receives can vary on several exogenous factors such as
In our case, the random variables a and b correspond to vectors
T𝑘 and C𝑘 for any given topic 𝑡𝑘 . The length of both these vectors, https://www.itl.nist.gov/div898/handbook/pmc/section4/pmc4.htm
Forecasting Patent Growth By Combining Time-Series Signals Using Covariance Patterns
Continent No. of docs.
North America (NA) 220,153
Europe (EU) 162,845
Australia (OC) 4,129
South America (SA) 664
Africa (AF) 890
Asia (AS) 178,288
Antarctica (AN) 0
World 567,547
Table 1: Distribution of patents by Continents
Figure 3: CORACO components for T 𝑤𝑜𝑟𝑙𝑑 , C 𝑤𝑜𝑟𝑙𝑑
niche popularity of the technological area, a continuation of similar in other languages. In particular, we considered only English lan-
research or product by a company or group of companies, a country guage patents of the EP sub-collection. We had to discard a few
or region’s sudden interest in a particular technological class etc. documents with one or more missing fields such as classification
The trend of numbers of citations for any patent and thereby any codes, patent citations, applicant country etc. The final dataset
topic is non-decreasing, since citations are accumulative. Therefore, amounted to 567,547 documents. The citation network built out of
our goal is to predict whether the number of citations for a particu- this reduced dataset consisted of 646,537 citations. Admittedly, the
lar topic is going to increase or decrease in the next time window citation network is very sparse, which conforms to the norm that
compared to the last time slice. It then gives us an indication of patents are not as frequently cited as academic publications [6].
how popular a topic (and by extrapolation a technological class)
is going to be in the near future and provides funding agencies,
corporations and governments to adjust their funding strategies
4.2 Preprocessing
accordingly in areas which show a strong upward growth in the The patent collection has mainly two types of documents: (1) Type
next few years. The results and further analysis of the prediction A (A1, A2 ...): European patent application files. (2) Type B (B1, B2
performance are discussed in Section 5. ...): European patent specification files. Of these, we used A1 and B1
documents since they are the most informative ones and contain
4 EXPERIMENTAL SETUP the textual content of the patent. Now, it is imperative to mention
that in the collection, not all A1 documents have a corresponding
4.1 Dataset B1 document. A1 documents contains Abstract along with other
For this study, we used the European Patent (EP) collection from bibliographic information including citations while B1 documents
the MAtrixware REsearch Collection (MAREC). MAREC is a static contain bibliographic information with Description and Claims of
collection of patent applications and granted patents in a unified the patent. Given this, we had to confine our collection to only
file format normalized from EP, WO, US, and JP sources, spanning patents that had both A1 and B1 counterparts. In case any one
a range from July 1976 to June 2008. The collection contains doc- of them was missing, we did not include them in our collection.
uments in several languages, the majority being English, German This step reduced our initial collection of English-language patents
and French, and about half of the documents include full text. In from 837,715 to 567,547. Post this step, we extracted all relevant
MAREC, the documents from different countries and sources are bibliographic information (such as application date, grant date, clas-
normalized to a common XML format with a uniform patent num- sification codes, applicant name, applicant country etc.) including
bering scheme and citation format. The standardized fields include citations in a separate file.
dates, countries, languages, references, person names, and com- We combined the title, abstract, description and claims for each
panies as well as rich subject classifications. It is a comparable patent into a single document which we refer to as ‘full-text’ of the
corpus, where many documents are available in similar versions patent in our paper. It should be noted that in this paper, we have
Manajit Chakraborty, Seyed Ali Bahrainian, and Fabio Crestani
used the terms full-text of patents and documents interchangeably. before computing the CORACO components for each case. The
Additional preprocessing steps include stopword removal (using an comparative performance of the four models using four different
extended stopword list of over 800 words combining NLTK stop- geographical regions (World, Japan, USA and Great Britain) are pre-
words and other open source libraries), expunging unintelligible sented in Tables 3 and 4. In Table 3, all the 500 topics are considered,
and non-alphabetic terms and tokenization. The cleaned documents while in Table 4, only topics which do not have ‘UC’ (unchanged)
were then segregated based on sectors of technology (A-H) they as their labels are considered. This is because, for any prediction
belong to and their countries and continents of origin (Table 1). The algorithm, it is difficult to accurately predict the last element in
continent-wise distribution of the patent documents is presented the series for the next time step. Even if the predicted and actual
in Table 1. The documents are also arranged by their year of patent values are close, by our accuracy metric, it would be considered
registration date. So, each time slice is considered as a year. Based as either one of ‘UP’ or ‘DOWN’. We can clearly observe that by
on this distribution, we chose to discard the documents from the eliminating ‘UC’ labeled topics, the performance of all models in-
years 1978, 1979, 2007 and 2008, since there are very few documents cluding baseline improves by a small margin. Also, the number of
in these years. Retaining these documents tends to skew the topic such topics, 38, is quite low (7.6% of the total number of topics).
distribution negatively. The total number of these discarded docu- From this table we can observe that in the best case, with Simple
ments amount to less than 1% of the whole collection. So, essentially Exponential Smoothing, we achieve 78.3% better performance in
our patent dataset consists of patents from the year 1980 through prediction when compared to all four baseline models. Among all
2006. the CORACOs, CORACOGB is the worst performer with Autore-
gression model and still it records a 39.9% improvement. In terms of
4.3 Tools CORACO components, the topic strength distribution of the world
For LDA, we used the open source tool Gensim [24], with default set- is shown to provide synergistic improvement to the prediction of
tings. The time-series models were employed from the statsmodels polarity of the citations. While, among the three countries United
package [25] built in Python. Other Python packages used include States of America, has the biggest influence in improving the pre-
matplotlib [13], SciPy [15], scikit-learn [23], pyconvert-country dictions even though it is not the largest country in terms of patent
etc.. For experiments, we used a Linux based server with Intel(R) production in our dataset. Comparing among time-series models,
Xeon(R) CPU @ 2.70GHz with 32 cores and 256 GB memory. Simple Exponential Smoothing seems to provide the biggest gain
when CORACO signals are provided as input but fails poorly for
5 RESULTS AND ANALYSIS the baseline input. The improvements achieved by the CORACO
models are statistically significant and hence the corresponding
Baseline. For the baseline, we consider the actual citation count
results have been marked with an asterisk in the tables.
distribution, C𝑘 , as input to the three time-series models. So, this
method basically tries to predict the change in citation counts based
on the historical data giving us four baseline models. We feed the Prediction Error Comparison
citation counts for each of the 500 topics as a vector for years 1980- While, our proposed models perform better than baselines, in terms
2005 as training data. The output from the time-series models are of citation growth performance, it is interesting to also compare
then compared with the true labels of change of citation counts the error in prediction of citation count values. In Table 5, we
for the last time slice (2006). The true labels indicate whether the present the Mean Absolute Error (MAE) of the baseline models
citation count has actually risen or fallen from previous time step. as well as our proposed models. It must be noted that since the
The labels are predetermined and can be one of the three types: CORACO components have a different scale compared to baseline
• UP: indicating that next time slice will receive more citations models, we applied min-max normalization on all prediction error
than the current time slice. vectors (for 500 topics). From the table, we can clearly observe
• DOWN : indicating that next time slice will receive less cita- that the error produced by our CORACO approaches are lower
tions than the current time slice. than that of baselines which operate on actual citation counts.
• UC: indicating that next time slice will receive as many cita- In general, CORACOworld gives the best performance similar to
tions as the current time slice. citation growth prediction. So, we can positively conclude that
The label ‘UC’ (unchanged) is a typical case and occurs only for 38 not only does our proposed models perform better with respect
topics. to citation growth prediction accuracy but even the citation count
prediction errors are lower than all three baseline models.
Metric. The metric for calculating the performance of CORACO
based prediction is the ratio of correctly predicted labels against all 6 CONCLUSIONS AND FUTURE WORK
true labels for 500 topics.
Patent analysis delivers comprehensive competitive intelligence
|Correctly Predicted labels for ΔC𝑘𝑛+1 |model
Accuracymodel = about innovators, relevant technologies, and help estimate the
|True label for ΔC𝑘𝑛+1 |
value of patents owned by competitor companies and governments.
The citation counts are integers, while the topic strengths are de-
Patent Citations and Topic Models have been employed separately
scribed by decimal numbers, and also their scales are different.
in existing literature towards forecasting of technological growth.
So, we had to perform min-max normalization on citation counts
In this paper, we proposed a novel approach that leveraged both
https://www.nltk.org/ patent citation counts and topic importance with geographical
https://pypi.org/project/pycountry-convert/ relevance to improve the prediction of patent citation growth in
Forecasting Patent Growth By Combining Time-Series Signals Using Covariance Patterns
Input
Baselines CORACOworld CORACOUS CORACOJP CORACOGB
Model
AR 0.108 0.452∗ 0.446∗ 0.454∗ 0.428
MA 0.086 0.396∗ 0.374∗ 0.390∗ 0.386∗
SES 0.048 0.470∗ 0.416∗ 0.438∗ 0.434∗
Table 3: Accuracy comparison of models with all labels. Best performances are marked in bold. Statistically significant results
are marked with an asterisk (*)
Input
Baselines CORACOworld CORACOJP CORACOUS CORACOGB
Model
AR 0.139 0.492∗ 0.476∗ 0.499∗ 0.463∗
MA 0.193 0.428∗ 0.414∗ 0.402∗ 0.417
SES 0.148 0.605∗ 0.550∗ 0.574∗ 0.570∗
Table 4: Accuracy comparison of models without UC labels. Best performances are marked in bold. Statistically significant
results are marked with an asterisk (*)
Input
Baselines CORACOworld CORACOJP CORACOUS CORACOGB
Model
AR 0.0371 0.0315 0.0321 0.0329 0.0324
MA 0.0416 0.0363 0.0358 0.0359 0.0357
SES 0.0383 0.0320 0.0330 0.0321 0.0334
Table 5: Prediction Error (MAE) comparison of models. Best performances are marked in bold.
the next year. To this end, we proposed a covariance based cor- [2] M.B. Albert, D. Avery, F. Narin, and P. McAllister. 1991. Direct validation of
related time-series method that maximizes the similarity of two citation counts as indicators of industrially important patents. Research Policy
20, 3 (1991), 251 – 259.
distributions. For prediction, we employed three time-series mod- [3] Leonidas Aristodemou and Frank Tietze. 2018. Citations as a measure of tech-
els and compared our approach against three baseline models by nological impact: A review of forward citation-based measures. World Patent
Information 53 (2018), 39 – 44.
also providing a comparative overview of the geographical region’s [4] Seyed Ali Bahrainian, Ida Mele, and Fabio Crestani. 2018. Predicting Topics in
influence on the prediction. Our results substantiate our hypothesis Scholarly Papers. In Advances in Information Retrieval - 40th European Conference
that correlated time-series model modifies the signal in such a way on IR Research, ECIR 2018, Grenoble, France, March 26-29, 2018, Proceedings. 16–28.
[5] D.M. Blei, A.Y. Ng, and M.I. Jordan. 2003. Latent Dirichlet allocation. Journal of
that is superior to all baseline models using the original time-series Machine Learning Research 3, 4-5 (2003), 993–1022.
vectors. [6] Setfano Breschi, G Tarasconi, C Catalini, L Novella, P Guatta, and H John-
As part of our future work, we would like to study the impact of son. 2006. Highly Cited Patents, Highly Cited Publications, and Research
Networks. CESPRIBOCCONI UNIVERSITY, http://ec. europa. eu/invest-in-
our proposed approach on other complex time-series models such research/pdf/download_en/final_report_hcp. pdf (Accessed: 01/07/2019) (2006).
as LSTM networks [11]. We will investigate ways to extend our [7] Richard S. Campbell. 1983. Patent trends as a technological forecasting tool.
World Patent Information 5, 3 (1983), 137 – 143.
model to higher dimensions such that we could find representations [8] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and
of multiple signals. We would also like to employ dynamic topic Richard Harshman. 1990. Indexing by latent semantic analysis. Journal of the
models for topic elicitation such as the one proposed by Bahrainian American Society for Information Science 41, 6 (1990), 391–407.
[9] Holger Ernst. 2001. Patent applications and subsequent changes of performance:
et al. [4] to account for topic evolution over time. Lastly, we will evidence from time-series cross-section analyses on the firm level. Research
apply our model to other time-series problems other than patent Policy 30, 1 (2001), 143 – 157.
analysis. [10] Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings
of the National Academy of Sciences 101, suppl 1 (2004), 5228–5235.
[11] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory.
Neural Computation 9, 8 (1997), 1735–1780.
ACKNOWLEDGEMENTS [12] Zheng Hu and Xin-Yan Deng. 2014. Aerodynamic interaction between forewing
We thank the anonymous reviewers for their valuable comments. and hindwing of a hovering dragonfly. Acta Mechanica Sinica 30, 6 (01 Dec 2014),
787–799.
This work was partially supported by The Global Structure for Knowl- [13] J. D. Hunter. 2007. Matplotlib: A 2D graphics environment. Computing in Science
edge Networks project grant under the SNSF National Research & Engineering 9, 3 (2007), 90–95.
Programme 75 “Big Data” (NRP 75). [14] Adam B Jaffe, Manuel Trajtenberg, and Michael S Fogarty. 2000. The meaning
of patent citations: Report on the NBER/Case-Western Reserve survey of patentees.
Technical Report. National bureau of economic research.
[15] Eric Jones, Travis Oliphant, Pearu Peterson, et al. 2001–. SciPy: Open source
REFERENCES scientific tools for Python. http://www.scipy.org/ [Online; accessed 01/07/2019].
[1] Zoltan J Acs, Luc Anselin, and Attila Varga. 2002. Patents and innovation counts [16] Junegak Joung and Kwangsoo Kim. 2017. Monitoring emerging technologies for
as measures of regional production of new knowledge. Research Policy 31, 7 technology planning using technical keyword based analysis from patent data.
(2002), 1069 – 1085.
Manajit Chakraborty, Seyed Ali Bahrainian, and Fabio Crestani
Technological Forecasting and Social Change 114 (2017), 281 – 292. [24] Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling
[17] M.M.S. Karki. 1997. Patent citation analysis: A policy analysis tool. World Patent with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges
Information 19, 4 (1997), 269 – 272. for NLP Frameworks. ELRA, Valletta, Malta, 45–50.
[18] Gabjo Kim, Sangsung Park, and Dongsik Jang. 2014. Technology Analysis from [25] Skipper Seabold and Josef Perktold. 2010. Statsmodels: Econometric and statistical
Patent Data Using Latent Dirichlet Allocation. In Soft Computing in Big Data modeling with python. In 9th Python in Science Conference.
Processing, Keon Myung Lee, Seung-Jong Park, and Jee-Hyong Lee (Eds.). Springer [26] Patrick Thomas. 2001. A relationship between technology indicators and stock
International Publishing, Cham, 71–80. market performance. Scientometrics 51, 1 (2001), 319–333.
[19] Mujin Kim, Youngjin Park, and Janghyeok Yoon. 2016. Generating patent devel- [27] Subhashini Venugopalan and Varun Rai. 2015. Topic based classification and
opment maps for technology monitoring using semantic patent-topic analysis. pattern identification in patents. Technological Forecasting and Social Change 94
Computers & Industrial Engineering 98 (2016), 289 – 299. (2015), 236 – 250.
[20] Doug Lichtman and Mark A Lemley. 2007. Rethinking Patent Law’s Presumption [28] Bo Wang, Shengbo Liu, Kun Ding, Zeyuan Liu, and Jing Xu. 2014. Identifying
of Validity. Stan. L. Rev. 60 (2007), 45. technological topics and institution-topic distribution probability for patent
[21] Mary Ellen Mogee. 1991. Using Patent Data for Technology Analysis and Plan- competitive intelligence analysis: a case study in LTE technology. Scientometrics
ning. Research-Technology Management 34, 4 (1991), 43–49. 101, 1 (01 Oct 2014), 685–704.
[22] Eleonora Pantano, Constantinos-Vasilios Priporas, Stefano Sorace, and Gianpaolo [29] Chong Wang and David M. Blei. 2011. Collaborative Topic Modeling for Recom-
Iazzolino. 2017. Does innovation-orientation lead to retail industry growth? mending Scientific Articles. In Proceedings of the 17th ACM SIGKDD International
Empirical evidence from patent analysis. Journal of Retailing and Consumer Conference on Knowledge Discovery and Data Mining (San Diego, California, USA)
Services 34 (2017), 88 – 94. (KDD ’11). ACM, New York, NY, USA, 448–456.
[23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. [30] Yi Zhang, Yue Qian, Ying Huang, Ying Guo, Guangquan Zhang, and Jie Lu. 2017.
Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour- An entropy-based indicator system for measuring the potential of patents in
napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine technological innovation: rejecting moderation. Scientometrics 111, 3 (01 Jun
Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830. 2017), 1925–1946.