=Paper=
{{Paper
|id=Vol-3745/paper2
|storemode=property
|title=Automated Identification of Emerging Technologies: Open Data Approach
|pdfUrl=https://ceur-ws.org/Vol-3745/paper2.pdf
|volume=Vol-3745
|authors=Ljiljana Dolamic,Julian Jang-Jaccard,Alain Mermoud,Vincent Lenders
|dblpUrl=https://dblp.org/rec/conf/eeke/DolamicJML24
}}
==Automated Identification of Emerging Technologies: Open Data Approach==
Automated Identification of Emerging Technologies: Open Data Approach Ljiljana Dolamic1,† , Julian Jang-Jaccard1,*,† , Alain Mermoud1,† and Vincent Lenders1,† 1 Cyber-Defence Campus, armasuisse Science and Technology, Thun, Switzerland Abstract Identifying emerging technologies and forecasting their trends is pivotal for stakeholders and decision-makers across academia, industry, and government agencies. The current strategies employed to track technology trends often rely on proprietary closed datasets and often rely on the insights of human domain experts. Not only are these approaches expensive and manual, but they are also time-consuming. In this study, we introduce an automated method for identifying emerging trends through a quantitative approach that utilizes extensive publicly available data, including patents, publications, and Wikipedia Pageview statistics. Our method proposes four criteria – novelty, growth, impact, and coherence – to automatically score technologies, based on a mathematical foundation. This approach enables the monitoring of tech trends across various sectors in an automated manner, without the need for domain experts. The results obtained through rigorous evaluation, benchmarked against similar reports from leading market research firms, illustrate a low recall rate paired with high precision, affirming the reliability of our proposed method. Furthermore, our method identifies emerging technologies not present in similar market reports, highlighting its unique capabilities. Keywords technology monitoring, emerging technologies, attributes of emergence, scientometrics, open source data, machine learning, informetrics, natural language processing 1. Introduction solid mathematical foundation. However, most studies fo- cus on specific predetermined sets of technologies, making Understanding emerging technologies is crucial for vari- it challenging to devise a general method for identifying ous entities, including industry, academia, and government emerging technologies [6]. agencies. It can shape strategic decisions, improve com- In this paper, we introduce a novel approach for iden- petitive positions, and create opportunities for technology tifying emerging technologies based on their coverage in strategies. Owing to these considerations, there is a substan- publicly available data sources, including patents, publica- tial need for identifying emerging technologies, prompting tions, and Wikipedia Pageview statistics. Unlike previous widespread media coverage on the topic and leading market studies, we have not preselected any specific set of technolo- research firms like Gartner and Forrester to offer services gies. Our method is transparent, does not require expert promising deeper insights. input, and gives reproducible results for any technology. Despite the common and widespread use of the term The remainder of this paper is organized as follows: Sec- ’emerging technologies,’ there is no single standard agree- tion 2 provides a survey of existing research. In Section 3, ment on what constitutes the term. This lack of a clear we offer a description of the data used. Section 4 outlines definition makes it challenging to develop a scientifically the proposed methodology. We present the evaluation re- sound methodology to identify emerging technologies. Gart- sults in Section 5. The limitation of our proposed method ner’s renowned Hype Cycle for Emerging Technologies, is discussed in Section 6. Finally, Section 7 concludes the while intuitive, cannot serve as an underlying model and paper with future work. has faced criticism in the literature for being considered unscientific, inconsistent, generic, and subjective [1]. Other market research firms, such as Forrester and IHS Markit, 2. Related Work also produce annual reports on emerging technologies, yet the methodology for identifying these technologies remains Definitions for the term ’emerging technologies’ in the liter- unclear. ature often overlap but are based on distinct characteristics. Research in the area of identifying emerging technolo- For example, some authors (e.g., [7, 8, 9, 10, 11]) emphasize gies primarily relies on qualitative methods, expert systems, the potential impact of the technology on the economy or and survey-based approaches. For quantitative methods, re- society, covering both evolutionary change and disruptive searchers have utilized open datasets and S-curve models to innovations. Others, like Boon [12], prioritize uncertainty identify technology emergence [2, 3, 4, 5]. S-Curve models, about a technology’s future evolution. Some researchers based on logistic or Gompertz growth concepts, provide a combine both potential and uncertainty aspects [13, 14], while others underline novelty and growth [15]. The myriad of characteristics chosen to define emerg- Joint Workshop of the 5th Extraction and Evaluation of Knowledge Entities ing technologies has given rise to diverse scientometric from Scientific Documents and the 4th AI + Informetrics (EEKE-AII2024), April 23 24, 2024, Changchun, China and Online approaches for measurement [16, 17], lacking a standard- * ized definition of the underlying concept of emergence. A Corresponding author. † These authors contributed equally. comprehensive analysis by Rotolo, Hicks, and Martin [18] $ ljiljana.dolamic@ar.admin.ch (L. Dolamic); explores existing research on the definition of emerging julian.jang-jaccard@ar.admin.ch (J. Jang-Jaccard); technologies, aggregating comparable approaches. They alain.mermoud@ar.admin.ch (A. Mermoud); identify five main characteristics—radical novelty, rela- vincent.lenders@ar.admin.ch (V. Lenders) 0000-0002-0656-5315 (L. Dolamic); 0000-0002-1002-057X tively fast growth, coherence, prominent impact, and un- (J. Jang-Jaccard); 0000-0001-6471-772X (A. Mermoud); certainty—commonly appearing across the studied research. 0000-0002-2289-3722 (V. Lenders) We adopt this definition as a foundational framework for © 2024 Copyright 2024 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 24 our study. patents granted by the USPTO since 2013. We utilize a subset Predicting emerging technologies often relies on pub- of around 6.6 million patent records for our study. licly available datasets, commonly leveraging patents such Publications from arXiv2 : We employ arXiv as a pri- as those from the United States Patent and Trademark Of- mary publication source, taking advantage on its free distri- fice (USPTO), Global Patent Index (GPI), and Thompson bution model for open-access scholarly articles. The reposi- Innovation. Numerous publications advocate for the use tory hosts over 2.4 million publications spanning computer of bibliometric methods to extract data and identify emerg- science and diverse scientific disciplines since 1993. Figure ing technologies, followed by deploying growth models for 2 displays the number of submissions to arXiv since August prediction. In the work of Daim et al. [19], bibliometric 1991. Our study focuses on a subset of approximately 1.4 methods, US patent analysis, and S-curves were employed million arXiv publications. for forecasting technologies such as fuel cells, food safety, and optical storage. Similarly, Ranaei et al. [3] used ex- pert interviews to fit data acquired by text-mining patents into growth curve models for predicting hybrid cars and fuel cells. Text-mining on patents and fitting to S-curves were also proposed in [20], and Bengisu et al. [21] found correlations between patent and publication data extracted by scientometric methods for 20 technologies, deploying S-curves for forecasting. S-Curve models for predicting emerging technologies were also proposed by [2, 22]. In recent times, artificial intelligence has regained signif- icant attention, leading to the use of machine learning to model and predict emerging technologies. Kyebambe and Hwang [23, 24] employed supervised learning on citation graphs from USPTO data to automatically label and forecast emerging technologies. Similarly, Zhou [25] applied super- Figure 2: Number of arXiv submissions since 1991 (Source from vised deep learning on worldwide patent data, with training [27]) sets labeled based on Gartner’s Hype Cycle. Wikipedia Pageview Statistics 3 : In addition, we incor- 3. Data porate Wikipedia Pageview statistics which indicates the number of visitors to a Wikipedia article within a specified We primarily use three different datasets: patent data from time frame. This offers insight into real-time public inter- USPTO, publication data from arXiv, and statistical data est and engagement, serving as a dynamic and accessible from Wikipedia Pageviews. indicator of emerging trends and technologies. Figure 3 illustrates an example of a monthly pageview statistics for the keyword ’deep learning’. Leveraging the Wikipedia API, we retrieved the monthly views for 50,954 articles relevant to the technology. Figure 1: Top 200 locations by patent count for granted patents Figure 3: Number of Pageviews of the topic ’deep learning’ during 2013 - 2023 (Source from [26]) during Jan 2023 - Jan 2024 (Source from [28]) Patents from PatentsView1 : Patent information pro- vides valuable insights into the latest innovations, trends, 4. Methodology and competitive landscapes within various industries. We utilize PatentsView to acquire patent information from the In this section, we outline our methodology, and Figure 4 USPTO for granted patents since 1976. As of December 5, offers a comprehensive overview of the entire process. 2023, there are over 8 million records of granted patents The proposed method is initiated by classifying each available for free download for further analysis. Figure 1 Wikipedia article as either technology-related or not, em- provides a glimpse of the top 200 locations worldwide for 2 https://arxiv.org/ 1 3 https://patentsview.org/ https://en.wikipedia.org/wiki/Wikipedia:Pageview_statistics 25 Figure 4: Overview of the Proposed Methodology ploying a binary classification approach termed as technol- To address this issue, we devised a two-step method- ogy classification. ology named ’technology classification,’ which involves Once this classification is established, we extract abstracts the process of selecting relevant technology articles from from USPTO and scholarly arXiv publications. These ab- Wikipedia. stracts undergo annotation using the DBPedia tool 4 , align- ing the text with Wikipedia articles. This annotation process Step 1: Cleaning and Selecting Relevant Categories aims to link the abstract content to relevant Wikipedia en- Each Wikipedia article is linked to categories, forming a tries. To reduce noise, we eliminate annotations occurring complex graph with parent-child relationships. The edges fewer than 5 times and those not aligned with the technol- between categories are loosely defined as "is related to," ogy classification. often connecting different Wikipedia articles from non- The resulting filtered annotations, all within the technol- technology areas. This correlation appears to limit the re- ogy classification, serve as the basis for constructing time liability of extracting only technology articles using these series. The count of mentions for each technology 𝑡 ∈ 𝑇 per graph-based relationships. year is summed across each data source 𝑑 ∈ 𝐷, reflecting To address this, we first clean up the directed categories the increasing occurrences of patents and publications over graph by removing hidden categories, admin and user pages. time. Mathematically, this can be represented as: Furthermore, we apply regular expression filters to eliminate ∑︁ categories not related to technologies, such as companies, Total Count(𝑡) = count(𝑡, 𝑑) people names, brands, currencies, and countries. 𝑑∈𝐷 Additionally, we utilize Wikipedia’s Main Topic Classifi- cations (MTC), encompassing categories like Technology, where count(𝑡, 𝑑) is the count of mentions for technology Business, Arts, Health, etc. Subsequently, we calculate the 𝑡 in data source 𝑑. We then compute relative counts in shortest path for each category in the filtered graph corre- relation to the total number of technology mentions per sponding to 28 MTC to retain the articles with the smallest year, represented as: distance to Technology, Science, or Engineering concepts. This resulted in 7,876 technology classification candidates, Total Count(𝑡) still containing some categories that may not belong to tech- Relative Count(𝑡) = nology. By having a human domain expert manually go Total Technology Mentions per Year through the 7,876 technology classification candidates, we Furthermore, monthly Wikipedia Pageviews are obtained ultimately create a list of 1,356 technology categories. for all technologies and transformed into time series. These Succinctly, this process can be written as the following time series, along with Wikipedia categories, contribute to pseudocode in Algorithm 1. the computation of four scores—Novelty, Growth, Impact, and Coherence—each derived from the definitions provided Step 2: Technology Classification using SVM by [18]. Finally, we aggregate and normalize these four The overall process of machine learning-based training to scores to generate an emergence score for each technology. obtain the final technology classification is detailed in Algo- rithm 2. 4.1. Technology Classification To create an input dataset for the Support Vector Ma- chine (SVM), which serves as our classifier, we extract ab- The output of annotated abstracts from patents and pub- stracts from Wikipedia articles identified within the tech- lications contains noise, as each annotation refers to a nology categories established in Step 1. The abstracts from Wikipedia article, not necessarily related to technology. all Wikipedia pages directly linked to a technology cate- gory are concatenated, stemmed, and then subjected to TF- 4 https://www.dbpedia.org/ IDF-based weighting. This process generates a weighted 26 Algorithm 1 Cleaning and Selecting Relevant Categories particular technology has a significant portion of references 1: procedure CleanUpDirectedGraph occurring in the last few years, it receives a high novelty 2: Remove hidden categories, admin and user pages score. To implement this, we considered the time span of the from the directed categories graph last 10 years and calculated the percentage of annotations 3: Apply regular expression filters to eliminate irrele- for each year. Linearly decreasing weights ranging from vant categories (e.g., companies, people names, brands, 10 to 1 were assigned, respectively, thereby giving higher currencies, and countries) weight to more recent years. Technologies for which the 4: end procedure majority of annotations occurred more than 10 years ago 5: procedure UtilizeMainTopicClassifications are considered not meeting the novelty criterion and are 6: Use Main Topic Classifications (MTC) encompassing consequently discarded. categories like Technology, Business, Arts, Health, etc. To express this more mathematically, we first define the 7: Calculate the shortest path for each category in the yearly time series 𝑋𝑡,𝑑 using Eq. 1: filtered graph to MTC 8: end procedure 𝑋𝑡,𝑑 = {𝑋𝑡,𝑑,𝑦 : 𝑦 ∈ 𝑌 } (1) 9: procedure FilterByDistanceToMTC where: 10: Retain articles with the smallest distance to Tech- nology, Science, or Engineering concepts within MTC • 𝑋𝑡,𝑑,𝑦 is the number of times technology 𝑡 is refer- 11: end procedure enced in dataset 𝑑 during year 𝑦. • 𝑦 ∈ 𝑌 denotes the year within the specified range. bag-of-words for each technology category. Subsequently, Thus, the total number of occurrences of all technologies feature reduction is applied to form usable feature vectors. 𝑡 ∈ 𝑇 in a dataset 𝑑 ∈ 𝐷 over a given year 𝑦 is represented It is worth noting that optimal results were observed us- mathematically as Eq. 2: ing mutual information-based feature reduction, targeting ∑︁ a vector length of 1000. Distances to each MTC topic are Total(𝑡, 𝑑) = 𝑋𝑡,𝑑,𝑦 (2) 𝑦∈𝑌 appended to this vector, producing the final feature vectors as input features. where: To address the imbalance in class distribution caused by our small training set of 1,356 positive samples, we employ • Total(t,d) denotes the total count of mentions or oc- oversampling techniques, using Borderline-SMOTE [29], to currences of technology (𝑡) in dataset (𝑑). increase the size of the input samples. The list of technolo- • 𝑋𝑡,𝑑,𝑦 is the number of times technology 𝑡 is refer- gies identified through SVM training is considered the final enced in dataset 𝑑 during year 𝑦. ∑︀ list pertaining to technology. • 𝑦∈𝑌 signifies the summation over all years (𝑦) This final list is subsequently used to filter annotations within the specified range 𝑌 . from patents and publications. The novelty score Novelty(𝑡) of a technology 𝑡 ∈ 𝑇 is then expressed mathematically as Eq. 3: Algorithm 2 Technology Classification using SVM 1: procedure CreateDataset 2: Extract abstracts from Wikipedia articles in identi- ∑︁ ∑︁ (︂ 𝑋𝑡,𝑑,𝑦 )︂ Novelty(t) = × 100 × 𝑤𝑦 (3) fied technology categories 𝑦∈𝑌 Total(𝑡, 𝑑) 𝑑∈𝐷 3: Concatenate and stem abstracts, apply TF-IDF- based weighting where: 4: Perform feature reduction for usable feature vectors 5: Append distances to each MTC topic to create final • Novelty(t) represent novelty score for technology feature vectors (𝑡). 6: end procedure • 𝑋𝑡,𝑑,𝑦 is the number of times technology 𝑡 is men- 7: procedure HandleClassImbalance tioned in dataset 𝑑 during year 𝑦. 8: Employ Borderline-SMOTE for oversampling • Total(t,d) represents the total occurrences of tech- 9: end procedure nology (𝑡) in dataset (𝑑). 10: procedure FinalizeTechnologyList • 𝑤𝑦 is a weight assigned to each year based on Eq. 4. ∑︀ ∑︀ 11: Use SVM training outcome as the final list of tech- • 𝑑∈𝐷 𝑦∈𝑌 denotes double summation over all nologies datasets(𝐷) and years (𝑌 ). 12: end procedure The formula computes the weight for each year based on its relative position within the given range. The weight increases linearly with the year’s proximity to the earliest 4.2. Emergence Score year, providing a higher weight to more recent years, as Eq. 4: Novelty Score: Novelty in emerging technologies signi- fies their distinctive newness, pioneering concepts, break- 𝑤𝑦 = (𝑦 + 1 − min ′ 𝑦′ ) (4) through advancements, and creative problem-solving, dis- ∀𝑦 ∈𝑌 tinguishing them from existing solutions and suggesting where: transformative potential [15, 18]. In our study, we define novelty for a technology based • 𝑦 denotes the specific year for which the weight is on increased mentions in recent years. For instance, if a calculated. 27 • min∀𝑦′ ∈𝑌 𝑦 ′ signifies the minimum value among all where: years in the defined range 𝑌 . • Slope(𝑡, 𝑑) denotes the scope of the growth curve for technology (𝑡) in dataset (𝑑). Growth Score: Emerging technologies exhibit relatively • min(Slope(𝑇, 𝑑)) represents the minimum slope fast growth rates compared to non-emerging technologies value among all technologies in dataset (𝑑). [18]. The growth rate of a technology, assessed through • max(Slope(𝑇, 𝑑)) represents the maximum slope growth curves in patents and publications, has been studied value among all technologies in dataset (𝑑). extensively [30, 31, 32]. Using the concept of growth curves, we employ a two-step approach to compute the growth This normalization process facilitates comparative analy- score of a technology. sis across different technologies and datasets. In Step 1, we apply regression techniques to fit the num- The technology’s final growth score is then computed by ber of yearly technology mentions to four different curve integrating both the model score, which is determined based models: Linear, Quadratic, Gaussian, and Exponential 5 . We on the best-fitting growth curve model, and the slope score, select the model with the highest R-squared (𝑅2 ) measure reflecting the rate of change in the technology’s mentions [33] and compute the slope of the curve based on the regres- over time, using Eq. 7. sion coefficients. It is important to note that we assume the positive or negative sign of the slope determines whether ∑︁ the trend is increasing or decreasing. Subsequently, based Growth(t) = (𝑀 𝑜𝑑𝑒𝑙_𝑠𝑐𝑜𝑟𝑒(𝑡, 𝑑)+𝑁 𝑜𝑟𝑚_𝑠𝑙𝑜𝑝𝑒(𝑡, 𝑑)) on the best-fitting model and the slope, we assign the tech- 𝑑∈𝐷 nology to one of the classes defined in Table 1 to compute (7) the model_score. where: • 𝑀 𝑜𝑑𝑒𝑙_𝑠𝑐𝑜𝑟𝑒(𝑡, 𝑑) denotes the model_score for the Table 1 specified technology (𝑡) in the given dataset (𝑑). Curve models and growth scores • 𝑁 𝑜𝑟𝑚_𝑠𝑐𝑜𝑟𝑒(𝑡, 𝑑) denotes the normalized slope for curve model model_score the specified technology (𝑡) in the given dataset (𝑑). Exponent increase/decrease +/- 1.00 • ∑︀ 𝑑∈𝐷 indicates the summation across all datasets Quadratic increase/decrease +/- 0.75 (𝐷) for the specified technology. Gaussian increase/decrease +/- 0.05 Linear increase/decrease +/- 0.25 Nothing fits 0.00 Impact Score: Wikipedia Pageviews represent the num- ber of times a particular article has been accessed on the In Step 2, the slope of the technology growth curve Wikipedia website, providing insights into the level of pub- Slope(𝑡, 𝑑) is calculated by taking the difference between lic interest and engagement with specific topics or content. the absolute counts of the last and the first year and divid- Utilizing this information, we leverage Wikipedia Pageview ing it by the total number of years, as depicted in Eq. 5. statistics to compute the impact score of a technology. We This equation quantifies the rate of change in technology use a monthly views to gather more data points. After ex- mentions over time for a specific technology (𝑡) within a tracting the monthly views, denoted as (𝑤), we apply a dataset (𝑑). 3-month moving average filter to smooth the time series. This filter calculates the average of each data point along Count(𝑡, 𝑑, 𝑌final ) − Count(𝑡, 𝑑, 𝑌begin ) with the two preceding and two succeeding months, effec- 𝑆𝑙𝑜𝑝𝑒(𝑡, 𝑑) = 𝑌final − 𝑌begin tively reducing noise and revealing underlying trends - see (5) Eq. 8. where: 𝑤𝑖−2 + 𝑤𝑖−1 + 𝑤𝑖 + 𝑤𝑖+1 + 𝑤𝑖+2 • 𝑌𝑓 𝑖𝑛𝑎𝑙 represents the final year for which the counts 𝑀 𝐴𝑖 = (8) 5 are considered. • 𝑌𝑏𝑒𝑔𝑖𝑛 represents the initial year for which the The smoothed data (𝑀 𝐴𝑖 ) then replaces (𝑑) in the two- counts are considered. step approach used for the growth score. We classify the • 𝐶𝑜𝑢𝑛𝑡(𝑡, 𝑑, 𝑌𝑓 𝑖𝑛𝑎𝑙 ) denotes the absolute count of trends into the same five classes (as seen in Table 1). mentions of the technology (𝑡) in the dataset (𝑑) during the final year. Impact(t) = 𝑀 𝑜𝑑𝑒𝑙_𝑠𝑐𝑜𝑟𝑒(𝑡, 𝑀 𝐴𝑖 )+𝑁 𝑜𝑟𝑚_𝑠𝑙𝑜𝑝𝑒(𝑡, 𝑀 𝐴𝑖 ) • 𝐶𝑜𝑢𝑛𝑡(𝑡, 𝑑, 𝑌𝑏𝑒𝑔𝑖𝑛 ) denotes the absolute count of (9) mentions of the technology (𝑡) in the dataset (𝑑) Eq. 9 represents the calculation of the impact score during the initial year. Impact(𝑡) for a technology (𝑡). It combines the model score Subsequently, all calculated slope values are normalized 𝑀 𝑜𝑑𝑒𝑙_𝑠𝑐𝑜𝑟𝑒(𝑡, 𝑀 𝐴𝑖 ) and the normalized slope score to the range [0.0;1.0] using Eq. 6, where Norm_slope(𝑡, 𝑑) 𝑁 𝑜𝑟𝑚_𝑠𝑙𝑜𝑝𝑒(𝑡, 𝑀 𝐴𝑖 ) obtained from the 3-month mov- represents the normalized slope. ing average (𝑀 𝐴𝑖 ) of Wikipedia Pageviews. This score reflects both the growth pattern and the temporal trends in Slope(𝑡, 𝑑) − min(Slope(𝑇, 𝑑)) Wikipedia Pageviews, providing a comprehensive assess- 𝑁 𝑜𝑟𝑚_𝑠𝑙𝑜𝑝𝑒(𝑡, 𝑑) = max(Slope(𝑇, 𝑑)) − min(Slope(𝑇, 𝑑)) ment of the technology’s impact. (6) 5 Coherence Score: In our study, we consider coherence We utilize Apache Commons SimpleRegression and OLSMultipleLin- earRegression for the linear and quadratic models. The same regression as the persistence of a technology over time, as referred to tools are used with the logarithm of the data points to derive the expo- by [18]. When identifying emerging technologies, we as- nential and Gaussian models, respectively. sume that the presence of a category on Wikipedia signifies 28 a thematic grouping that brings together related techno- classification method identifies 50,954 technologies from logical concepts. The coherence within such categories is the 4,996,310 Wikipedia articles we utilized in our study. established through shared characteristics, applications, and underlying principles of the technologies they encompass. 5.1. Results This alignment allows for consistent trends to emerge within the category over time, reflecting the collective evolution of In this section, we discuss the observations obtained after technologies. Wikipedia categorization serves as a valuable applying our proposed methodology to the public dataset indicator of how various technologies within a category discussed earlier. develop in tandem, providing insights into the overarching trends and advancements in related technological domains. Individual Scores: Table 2 displays the top 20 technolo- To compute the coherence score, we begin by collecting gies with the highest novelty, growth, and impact scores. all unique categories from Wikipedia, forming what we Notably, technologies related to Artificial Intelligence (AI)† refer to as the ’Category Set.’ Subsequently, we perform appear among the top 20 across all scores, including Deep a mapping process, converting plural category names to Learning and Convolutional Neural Network (CNN) for nov- their singular counterparts, and then matching them with elty, and Artificial Intelligence, Machine Learning, and Arti- articles sharing identical names. The coherence score is ficial Neural Network for impact; all except CNN correspond then computed with the following Eq. 10: to categories in Wikipedia and are considered coherent. In the top 20 novel technologies, alongside AI-related {︃ technologies, there are notable mentions of vehicle-related 0.5, if 𝑡 ∈ Category Set technologies such as Multirotor, Autonomous Car, and Coherence(t) = (10) 0, otherwise Vehicle-to-everything. The Nanosheet closes the novelty list, being the only technology not related to either computer In other words, if the technology (𝑡) is part of the Cate- science or vehicle technology. Communication ranks first in gory Set, the coherence score is 0.5; otherwise, it is 0. This the list of the top 20 technologies according to the growth mathematical expression reflects the coherent presence of a score, with Communication-related technologies like Wire- technology within a specific thematic category. less and Data Transmission being other fast-growing terms. The list also includes older technologies that receive con- Emergence Score: Towards calculating the emergence tinuous or renewed interest, such as Lidar or Rechargeable score, we sum the novelty, growth, impact, and coherence Battery. Apart from vehicle-related technologies like Un- scores. We then normalize the result to the range [0.0;1.0], manned Aerial Vehicle and Autonomous Car, this list is as shown in Eq. 11. completed by the Internet of Things and Quantum Comput- ing. Emergence(t) = 𝑁 𝑜𝑟𝑚[𝑛 * 𝑁 𝑜𝑣𝑒𝑙𝑡𝑦(𝑡)+ 𝑔 * 𝐺𝑟𝑜𝑤𝑡ℎ(𝑡) + 𝑖 * 𝐼𝑚𝑝𝑎𝑐𝑡(𝑡) + 𝑐 * 𝐶𝑜ℎ𝑒𝑟𝑒𝑛𝑐𝑒(𝑡)] Overall Score: Table 3 presents the overall top 20 tech- (11) nologies after combining the individual scores. Deep Learning emerges as the top technology in our We introduce control variables, including n, g, i, and c, to methodology, with Convolutional Neural Network (CNN) empirically manage the impact of biases arising from data also making the list as a sub-category of Deep Learning. As imbalance, aiming to achieve the highest precision. anticipated, Machine Learning is present, alongside the In- ternet of Things, both demonstrating coherence and ranking Technology Class and Technology Class Score: Indi- in the top 20 for impact and novelty, respectively. Cyber- viduals often generate multiple articles on Wikipedia that attack holds a high position, accompanied by various tech- closely relate to one another, such as those on Machine nologies related to Computer security, forming the second Learning, Deep Learning, and Artificial Neural Networks. group in the result list. Key-Value Database, the simplest To establish connections between these closely related tech- form of NoSQL databases, secures the seventh spot in the nologies, we employ Wikidata properties such as ’subclass top 20 emerging technologies. Communication and Smart- of,’ ’part of,’ ’instance of,’ or ’said to be the same as.’ We phone, technologies that have garnered attention for years, refer to this group of related technologies as a ’Technology are also on the list. We observe the inclusion of technologies Class.’ The Technology Class score (TCs) is computed by such as Autonomous Car, Knowledge Graph, and 5G in the taking the emergence score of the technology within the set top 20 scored technologies. of related technologies, selecting the one with the maximum Our findings align well with similar observations made by emergence score, as shown in Eq. 12:. Zhou et al. [34] and Daim et al. [35], returning four Conver- gence Emerging Technologies (CET) in the top five results, TCs = max Emergence (t) (12) with the fifth (CNN) being a sub-class of Deep Learning. 𝑡∈𝐸𝐶 Table 4 displays the top 20 technology classes identified from the top 100 technologies based on the emergence score. 5. Evaluation This method of presenting results enhances the visibility of other technologies, such as Virtual Assistant or Exoskeleton. For patents, we gathered the abstracts of 6,647,699 patents from PatentsView. From this dataset, we derived 112,199 unique annotations, of which 77,995 had more than 5 oc- 5.2. Benchmarking currences. Similarly, for publications, we collected the ab- To benchmark the compatibility of our proposed emergence stracts of 1,425,558 research papers from arXiv. Within this scoring to other similar works, we compiled the union dataset, we identified 111,627 unique annotations with tech- set of emerging technologies identified by leading technol- nology classification, and among them, 65,162 articles had ogy analysts, including Gartner, Forrester, IHS Markit, and occurrences exceeding 5 times. Our proposed technology 29 Table 2 Top 20 Technologies in Novelty, Growth, and Impact scores Novelty Growth Impact Smart City Communication URL Deep Learning† Wireless LED Lamp POWER8 Pixel Machine Learning† Vehicle To Everything Web Server Artificial Neural Network† Data Science Convolutional Neural Network† Neural Coding Knowledge Graph Data Transmission Robot Locomotion Internet of Things Mathematical Optimization HTTP Cookie Return-Oriented Programming Stator Blockchain Smartwatch Rechargeable Battery Artificial Intelligence† Multirotor Radio-Frequency Identification Computer Science Ransomware Unmanned Aerial Vehicle Sustainable Energy Row Hammer Internet of things BNC Connector Software-Defined Networking Quantum Computing Electron Backscatter Diffraction Convolutional Neural Network† Computer Data Storage Slurry Pump Virtual Reality Headset Object Detection Cryptocurrency High Efficiency Video Coding Lidar Precision and Recall Cyber-Physical System Transfer Learning† XLR Connector Insider Threat Unsupervised Learning† Phishing Autonomous Car HVAC QR Code Nanosheet Autonomous Car PDF Table 3 Table 4 Overall Top 20 Technologies Overall Top 20 Technology Classes Technology Technology Classes Deep Learning† Artificial Intelligence Autonomous Car Autonomous Driving Internet of Things Internet of Thing Convolutional Neural Network (CNN)† Computer Security Machine Learning† Database Ransomware Knowledge Graph Key-Value Database Augmented, Virtual, Mixed Reality Shard (Database Architecture) Connectivity Cyberattack Telecommunication Knowledge Graph Cloud and Virtualization Augmented Reality Data Science Smartphone Optical Instrument Communication Virtual Assistant Side-Channel Attack Exoskeleton Cloud Gaming Computer Vision 5G Satellite Imagery Data Science Heterogeneous Computing Return Oriented Programming Distributed Computing Lidar Medical Device Push Technology 3D Printing the World Economic Forum (WEF). Gartner predicted 35 single technology class. technologies in its technology hype cycle, Forrester pre- Table 6 illustrates the performance metrics of Average dicted 12, IHS Markit 8, and WEF 10 emerging technologies. Precision (AP) and Recall (R) for the top 20 technologies (T) Upon merging the overlapping technologies from these four and Technology Classes (TC) identified in the evaluation lists, we derived a consolidated list of 36 unique technology set. classes which we use as ground truth. Table 5 provides an In the ’base’ run, all control variables in Eq. 10 are set to overview of these classes. 1. Additionally, alongside the ’max_prec’ parameter set, we Notably, the majority of technologies in this table appear present the average precision and recall of the Computer to belong to the Computer Science-related domain, with Science technology class (max_prec_cs). Within the top 20 72% of them being linked to it. Technologies marked with technologies with the highest emergence score, only one ’†’ are those we were unable to directly map to a Wikipedia non-technology result was observed. The average precision article or category. Additionally, articles judged as non- (AP) was 0.72 for the base run. However, all the relevant technologies by the SVM classifier are indicated in the table concepts from this subset relate to only 6 out of the 36 with ’.’ technologies mentioned before, resulting in a recall (R) of It is worth mentioning that Wikipedia articles on Aug- 0.16. By changing the control variables for the max_prec, mented, Mixed, and Virtual Reality are collectively pre- where non-Computer Science technology does not grow and sented, following Forrester’s proposal to consider them as a have entries in Wikipedia articles, we were able to increase 30 Table 5 our evaluation set but are present in our technology result Evaluation Set: Technology classes based on Gartner, Forrester, set, ranked 4,897 and 12,421, respectively. To address this IHS Markit and WEF bias, we split the result set as well as the evaluation set into Technology Classes distinct domains (CS, Nanotechnology, Medicine, etc.). This Tissue Engineering approach allowed us to navigate around the bias. The third Unmanned Aerial Vehicle row (CS TC) of Table 6 provides the average precision and Smartdust recall when only results related to the Computer Science Artificial Intelligence field are considered, as this class is predominant in our re- 4D Printing sult/evaluation sets. Although this approach results in only Ontology (Information Science) a 10% increase in average precision, the increase in recall Neuromorphic Engineering Exoskeleton rises to 30%. Edge Computing Autonomous Driving Self-Healing System Technology† 7. Conclusion Volumetric Display This paper presents an automated method for identifying 5G Quantum Computing emerging technologies using publicly available data. Our Platform as a Service approach is applicable across various technology sectors Application Specific Integrated Circuits without the need for human domain experts, as it relies on Autonomous Robot a clear mathematical foundation. Mobile Robot We propose an emergence scoring system based on nov- Brain Computer Interface elty, growth, impact, and coherence scores. Novelty and Internet of Things growth scores are computed from time series data of an- Biochip notations applied to USPO patents and arXiv publications. Digital Twin The impact score is derived from the Wikipedia Pageview Nanotechnology time series, while the coherence score utilizes Wikipedia Virtual Assistant Lithium-Silicon Battery categories. Blockchain To assess the effectiveness of our proposed methods, we Augmented, Virtual, Mixed Reality compiled an evaluation set of 36 emerging technologies by E-textiles amalgamating lists from prominent market research firms Cloud Computing like Gartner and Forrester Research. The evaluation un- Computer Vision veiled a low recall (0.16) in identifying emerging technolo- Ubiquitous Video† gies. Natural Language Generation This research lays the groundwork for further investi- Switched Fabric gations, including the development of a methodology to Personalized Medicine determine the more fine-grained stages of emergence (e.g., Cell Encapsulation Gene drive pre-emergence, emergence, post-emergence) for a particular technology within different timeframes. Our study can be enhanced by incorporating the Ope- Table 6 nAlex concept 6 , which has gained more popularity com- Average Precision (AP) and Recall (R) of Technologies (T) and pared to the now-defunct DBpedia concepts. Additionally, Technology Classes (TC) we plan to employ more advanced deep learning models Parameters Classes AP R instead of the SVM model, as mentioned in [36, 37], specifi- base T 0.72 0.16 cally a combination of LSTM and Transformer [38, 39], to T 0.81 0.19 conduct more efficient time series analysis. This will be max_prec TC 0.72 0.28 performed using a larger publication dataset than arXiv, CS TC 0.79 0.36 such as the one available on OpenAlex 7 . Additionally, since max_prec_cs CS TC 0.90 0.36 our methodology still requires a certain degree of manual intervention, such as inspecting Wikipedia categories and adjusting bias variables, we want to explore techniques that both AP (0.81) and R (0.19). In this setting, the control can minimize these manual components to enhance scala- variables were chosen to facilitate the maximum precision bility and reduce potential subjectivity. (e.g., g, n, i, and c set to 1, 0.3, 0.1, and 0.3, respectively). Acknowledgments 6. Limitations We extend our thanks to the developers at Trivo Sys- A bias is evident when examining the results of identified tems—Pratiksha Jain, Himanshu Jain, and Marc Liechti—for emerging technologies toward Computer Science, as no- their work on the Technology Market Monitoring 1.0 project. ticed within the evaluation set, with 70% of technologies We appreciate their valuable contributions to shaping the within the top 100 results belonging to this domain. This initial stage of our study. We also extend our thanks to ar- bias complicates the exploration of trends in other domains. masuisse Science and Technology for supporting the study. Taking chemistry as an example, the International Union of Pure and Applied Chemistry (IUPAC) issued a list of emerg- ing technologies for this domain, containing, among others, 6 https://docs.openalex.org/api-entities/concepts 3D bioprinting or Flow chemistry, none of which figure in 7 https://openalex.org/ 31 References Measuring technological convergence in encryption technologies with proximity indices: A text min- [1] O. Dedehayir, M. Steinert, The hype cycle model: A re- ing and bibliometric analysis using openalex, arXiv view and future directions, Technological Forecasting preprint arXiv:2403.01601 (2024). and Social Change 108 (2016) 28–41. [18] D. Rotolo, D. Hicks, B. R. Martin, What is an emerging [2] G. Intepe, T. Koc, The use of s curves in technology technology?, Research policy 44 (2015) 1827–1843. forecasting and its application on 3d tv technology, [19] T. U. Daim, G. Rueda, H. Martin, P. Gerdsri, Forecast- International Journal of Industrial and Manufacturing ing emerging technologies: Use of bibliometrics and Engineering 6 (2012) 2491–2495. patent analysis, Technological forecasting and social [3] S. Ranaei, M. Karvonen, A. Suominen, T. Kässi, Fore- change 73 (2006) 981–1012. casting emerging technologies of low emission vehicle, [20] D. Kucharavy, E. Schenk, R. De Guio, Long-run fore- in: Proceedings of PICMET’14 Conference: Portland casting of emerging technologies with logistic models International Center for Management of Engineering and growth of knowledge, in: 19th CIRP design con- and Technology; Infrastructure and Service Integra- ference, 2009, p. 277. tion, IEEE, 2014, pp. 2924–2937. [21] M. Bengisu, R. Nekhili, Forecasting emerging technolo- [4] J. W. Z. Sossa, F. P. Marro, B. A. Alzate, F. M. V. Salazar, gies with the aid of science and technology databases, A. F. A. Patiño, S-curve analysis and technology life cy- Technological Forecasting and Social Change 73 (2006) cle. application in series of data of articles and patents, 835–844. Revista ESPACIOS| Vol. 37 (Nº 07) Año 2016 (2016). [22] M. Nieto, F. Lopéz, F. Cruz, Performance analysis of [5] S. Kar, A. K. Kar, M. P. Gupta, Understanding the s- technology using the s curve model: the case of digital curve of ambidextrous behavior in learning emerging signal processing (dsp) technologies, Technovation 18 digital technologies, IEEE Engineering Management (1998) 439–457. Review 49 (2021) 76–98. [23] M. N. Kyebambe, G. Cheng, Y. Huang, C. He, Z. Zhang, [6] R. Adner, R. Kapoor, Innovation ecosystems and Forecasting emerging technologies: A supervised the pace of substitution: Re-examining technology learning approach through patent analysis, Technolog- s-curves, Strategic management journal 37 (2016) 625– ical Forecasting and Social Change 125 (2017) 236–244. 648. [24] S.-Y. Hwang, D.-J. Shin, J.-J. Kim, Systematic review on [7] A. L. Porter, J. D. Roessner, X.-Y. Jin, N. C. Newman, identification and prediction of deep learning-based cy- Measuring national ‘emerging technology’capabilities, ber security technology and convergence fields, Sym- Science and Public Policy 29 (2002) 189–200. metry 14 (2022) 683. [8] B. R. Martin, Foresight in science and technology, [25] Y. Zhou, F. Dong, Z. Li, J. Du, Y. Liu, L. Zhang, Forecast- Technology analysis & strategic management 7 (1995) ing emerging technologies with deep learning and data 139–168. augmentation: convergence emerging technologies vs [9] N. Corrocher, F. Malerba, F. Montobbio, The emer- non-convergence emerging technologies (2017). gence of new technologies in the ICT field: main ac- [26] P. USPTO, Locations that drive innovation, 2023. URL: tors, geographical distribution and knowledge sources, https://datatool.patentsview.org/, accessed: December Technical Report, Department of Economics, Univer- 9, 2023. sity of Insubria, 2003. [27] arXiv, Monthly submissions, 2024. URL: https://arxiv. [10] M. Halaweh, Emerging technology: What is it, Journal org/stats/monthly_submissions, accessed: February 5, of technology management & innovation 8 (2013) 108– 2024. 115. [28] P. Analysis, Comparison of pageviews across multi- [11] S.-C. Hung, Y.-Y. Chu, Stimulating new industries ple pages, 2023. URL: https://pageviews.wmcloud.org/, from emerging technologies: challenges for the public accessed: February 12, 2024. sector, Technovation 26 (2006) 104–110. [29] H. Han, W.-Y. Wang, B.-H. Mao, Borderline-smote: a [12] W. Boon, E. Moors, Exploring emerging technologies new over-sampling method in imbalanced data sets using metaphors–a study of orphan drugs and phar- learning, in: International conference on intelligent macogenomics, Social science & medicine 66 (2008) computing, Springer, 2005, pp. 878–887. 1915–1927. [30] B. Andersen, The hunt for s-shaped growth paths in [13] S. Cozzens, S. Gatchair, J. Kang, K.-S. Kim, H. J. Lee, technological innovation: a patent study, Journal of G. Ordóñez, A. Porter, Emerging technologies: quan- evolutionary economics 9 (1999) 487–526. titative identification and measurement, Technology [31] M. Meyer, Patent citation analysis in a novel field Analysis & Strategic Management 22 (2010) 361–376. of technology: An exploration of nano-science and [14] B. C. Stahl, What does the future hold? a critical view nano-technology, Scientometrics 51 (2001) 163–183. of emerging information and communication technolo- [32] G. S. Day, P. J. Schoemaker, Avoiding the pitfalls of gies and their social consequences, in: Researching the emerging technologies, California management re- Future in Information Systems: IFIP WG 8.2 Working view 42 (2000) 8–33. Conference, Turku, Finland, June 6-8, 2011. Proceed- [33] D. S. Moore, Introduction to the Practice of Statistics, ings, Springer, 2011, pp. 59–76. WH Freeman and company, 2009. [15] H. Small, K. W. Boyack, R. Klavans, Identifying emerg- [34] Y. Zhou, F. Dong, Y. Liu, Z. Li, J. Du, L. Zhang, Forecast- ing topics in science and technology, Research policy ing emerging technologies using data augmentation 43 (2014) 1450–1467. and deep learning, Scientometrics 123 (2020) 1–29. [16] W. Glänzel, B. Thijs, Using ‘core documents’ for detect- [35] T. Daim, K. K. Lai, H. Yalcin, F. Alsoubie, V. Kumar, ing and labelling new emerging topics, Scientometrics Forecasting technological positioning through technol- 91 (2012) 399–416. ogy knowledge redundancy: Patent citation analysis [17] A. Tavazzi, D. P. David, J. Jang-Jaccard, A. Mermoud, of iot, cybersecurity, and blockchain, Technological 32 Forecasting and Social Change 161 (2020) 120329. [36] Y. Zhang, C. Zhang, P. Mayr, A. Suominen, Y. Ding, An editorial of “ai+ informetrics”: Robust models for large-scale analytics, Information Processing and Man- agement (2023) 103495. [37] W. Xu, J. Jang-Jaccard, A. Singh, Y. Wei, F. Sabrina, Im- proving performance of autoencoder-based network anomaly detection on nsl-kdd dataset, IEEE Access 9 (2021) 140136–140146. [38] Y. Wei, J. Jang-Jaccard, W. Xu, F. Sabrina, S. Camtepe, M. Boulic, Lstm-autoencoder-based anomaly detection for indoor air quality time-series data, IEEE Sensors Journal 23 (2023) 3787–3800. [39] Y. Wei, J. Jang-Jaccard, F. Sabrina, W. Xu, S. Camtepe, A. Dunmore, Reconstruction-based lstm-autoencoder for anomaly-based ddos attack detection over multivariate time-series data, arXiv preprint arXiv:2305.09475 (2023). 33