-

Temporal Analysis of Scienti c Literature to Find Grand Challenges and Saturated Problems

Kritika Agrawal

Vikram Pudi kritika.agrawal@research.iiit.ac.in

vikram@iiit.ac.in

0 0 Data Sciences and Analytics Center, Kohli Center on Intelligent Systems IIIT , Hyderabad , India

47 54

As scienti c communities grow and evolve, there is emergence of new techniques and decline of old ones. The tremendous amount of research publications available online aims to solve a lot of interesting problems. With time, some of the elds have been studied well and research problems solved to a great extent. However, there are few di cult research problems which are yet not solved completely and interests a lot of researchers. In this paper, we aim to nd research elds which are saturated and research elds which need to be explored yet. We rst extract research problems in a semi supervised manner using a proven bootstrap framework from scienti c literature of the last fty years. We show how a simple statistics based model on top of the research problems extracted can nd the saturated elds and grand challenges in any domain of computer science.

scienti c data extraction temporal analysis unsupervised learning

A consistently thriving global research community has over decades produced a colossal amount of research papers that are published online, which makes it crucial to organize this huge bulk of information systematically so that upcoming researchers can navigate through e ciently and continue to push boundaries of scienti c research. Such an organization over intellectual information will not only boost the rate of further research work but also augment researchers with a better holistic view of development in research and the directions in which it is evolving into. One of the rst elementary steps we take as researchers is to gure out which problems to focus on solving, and structured analysis on present research status will help researchers identify critical problems and also give insight about how they developed across time. Due to this it will be easier to realize if particular problem has got no recent improvement in the recent past and has moved into a thriving application and so on. Analysis is the foundation to organization of cumulative knowledge garnered by the research community in decades, and this paper deals with this rst step in direction.

[ 1 ] rst proposed a task that de nes scienti c terms for 474 abstracts from the ACL anthology [ 2 ] into three aspects: domain, technique, and focus. They applied template-based bootstrapping on title and abstract of articles to tackle the problem. They used handcrafted dependency based features. Based on this study, [ 3 ] improved the performance by introducing hand- designed features to the bootstrapping framework. They both tried to study the in uence of di erent scienti c communities over the period of time. However, their work was limited to the computational linguistics eld. We propose a method for temporal analysis of scienti c literature of complete computer science domain.

A recent challenge on Scienti c Information Extraction (ScienceIE) [ 4 ] provided a dataset consisting of 500 scienti c paragraphs with keyphrase annotations for three categories: TASK, PROCESS, MATERIAL across three scienti c domains, Computer Science, Material Science, and Physics. This invited many supervised and semi-supervised techniques in this eld. Although all these techniques can help extract important concepts of a research paper in a particular domain, we need more general and scalable methods which can summarize the complete research community and help in time based analysis. For this we used a DBLP dataset which spans over fty years and cover a wide variety of computer science elds.

As the rst step of time based analysis, we aim to nd saturated elds and grand challenges. We de ne saturated elds as those research problems which have been studied to a great extent and nothing much is left to achieve in them. On the other hand grand challenges are de ned as those problems which have been tried to solve over a large period of time and are still worked upon extensively. 2

De nitions

Saturated Problems: Problems which were very actively studied in the yesteryears and are now solved to a great extent. Example, parts of speech tagging in NLP.

Grand Challenges: Problems which were de ned in yesteryears and are still worked upon extensively. Example, machine translation in NLP. Research during the 1980s typically relied on translation through some variety of intermediary linguistic representation involving morphological, syntactic, and semantic analysis. In current times, research has focused on moving from domain speci c systems to domain independent translation systems.

Approach Identifying Aim and Method

Our approach is based on a proven method followed by [ 5 ] .Given a document, we classify its phrases as Aim or Method. This approach is built on the observation that the semantics of the sentence of a research article containing a phrase belonging to any of the concept type is similar across research papers. To capture this semantic similarity, we use k nearest neighbour classi er on top of state-of-the-art [ 6 ] domain based word embeddings. We start by extracting features from a small set of annotated examples and used bootstrapping framework [ 7 ] for extracting new features from unlabeled dataset. Finally, after some iterations, we have a set of phrases classi ed as Aim or Method for each research paper present in the dataset.

Merging of phrases which mean the same: We group the papers according

to the conference in which they were published. Then 8 papers in the same group, we cluster their extracted phrases by running DBSCAN [ 8 ] over vector space representations of these phrases. The clusters are created based on lexical similarity which is captured by cosine distance between phrase embeddings. [ 5 ] A cluster i belonging to conference c1 and a cluster j belonging to conference c2 are merged if they have any common phrase. Finally we get clusters such that phrases in each cluster have the same meaning. 3.2

Time based Analysis models

From the rst step, we have research problems which have been studied as \AIM" for the last fty years. We also have techniques\METHOD" used to solve these problems over these years. We rst extract data for each research eld, p, and nd the number of times paper published on them for each of the years in the range 1971 to 2013.

Finding Saturated Problems:

{ Count vs year plot for such problems should show a steep decline in the current years. { Based on exploratory data analysis we came up with the following rules for nding saturated problems from the data collected above { We list a problem p as a saturated problem if:

T1 is the rst year when the problem appeared in the literature. T2 is the last time when the problem appeared in the literature.

Count of p appearing as aim in T2 should be less than the count of p appearing as aim in T1 Peak of count vs year plot should have occured much before 2013. Suppose problem p1 has peak at time t1 and problem p2 has peak at time t2. P1 is a better candidate for saturated problem than p2 if the di erence between T2 of p1 and t1 is more than the di erence between T2 of p2 and t2.

Finding Grand Problems:

{ Count vs year plot for such problems should start from yester years and be consistent over the time. Peaks should be current years as well as yester years. { Based on exploratory data analysis we came up with the following rules for nding grand challenges from the data collected above

We list a problem p as a grand challenge if: ∗ T1 is the rst year when the problem appeared in the literature. T2 is the last time when the problem appeared in the literature. ∗ T1 for problem p to be classi ed as a grand challenge should be before 2000 and T2 after 2010. Time span between T1 and T2 should be more than 10 years. ∗ Count of p appearing as an aim in T2 should be more than some threshold. This is to rule out the edge cases where there is occurrence of few counts in current years.

We rank these problems based on the following formula: ∗ To capture the fact that more the span of the problem over the years, more likely it is a grand challenge; we propose rank to be directly proportional to the number of years it spans to. ∗ To capture the fact the count needs to be consistent over the years; we propose rank to be inversely proportional to Pin=1(count[i] count[i 1]) where i iterates over all the years in which a problem p occurs. All experiments were done on DBLP citation network version 7. We chose DBLP dataset to get a wide variety of research papers from di erent domains over a large time period. It has 2,244,021 papers and 4,354,534 citation relationships. After pruning out some papers and data cleaning we came up with 332,793 papers having 1,508,560 citation links. These papers range from 1936 to 2013. However for the period 1936- 1971, the number of papers available were relatively very less for time based analysis. So we pruned the data further and worked on papers from 1971 to 2013. results, we extracted top 100 problems in both the categories. We represent our results as word clouds [ 9 ] where the font and color of each word is proportional to rank of that problem as extracted by our algorithm. { Discussion of Results: 1. Speech recognition has a rich history that precedes Internet era. In 1952, three bell lab researchers made \Audrey" which recognized formats in power spectrum of each word. Investment in research in this area ampli ed during 1970s with DARPA marking funding for understanding speech. IEEE speech groups were setup. In 1990s CMU led research funded Sphinx system which dominated DARPA 1992 evaluation. In 2005 Siri came into life under Apple. From 2012 there was a major breakthrough in research and HMM models which were industry standard till then were replaced by DNN. In 2014 end-to-end speech training was new paradigm that caught winds within DNN. In 2016 CMU and Google collectively introduced idea of \Attention" in training. In past three years there has been work on language agnostic ASR and more notable improvements kept on pressing. With importance of digital assistance, industry support has further expedited constant improvements every month over month till date. Clearly its a eld with surreal active development and its not a surprise that our Model has correctly predicted this model as a Grand Challenge. 2. Human Computer interaction is de ned as a discipline concerned with the design and evolution of interactive computing systems for human use. HCI surfaced in the 1980s with the advent of personal computing, just as machines such as the Apple Macintosh, IBM PC 5150 started turning up in homes and o ces. HCI soon became the subject of intense academic investigation. Initially, HCI researchers focused on how easy computers are to learn and use which has now also included to support the vision of personalized, adaptive, responsive, and proactive services, adaptation and personalization methods and techniques that will need to consider how to incorporate AI and big data [ 10 ]. 3. In algorithmic information theory, the Kolmogorov complexity of an object, such as a piece of text, is the length of a shortest computer program that produces the object as output. Research on this started in 1970s and is still going on. 4. The exact solution of facility location problem is known to be hard. And there are many approximation algorithms. No new research have been done on this problem. So clearly it is a saturated problem. 5. A one-way function is easy to compute on every input, but hard to invert. Although, The existence of true one-way functions is an open conjecture. In practice many functions such as those based on discrete Log are assumed to be work well since no polynomial time algorithm is known to invert them. 6. Loop optimization is the process of increasing execution speed and reducing overhead of loops. This problem is fairly solved and many modern compilers already use loop optimization techniques like Fission, Fusion, Inversion, Parallelisation etc.

Fig. 2. Word Cloud for Saturated Problems 5

Conclusions and Next Steps

In this paper, we show the temporal analysis of scienti c literature by extracting saturated problems and grand challenges. We propose this as the rst step towards time based analysis. We plan to further do time based analysis by nding transition time for problems where transition time is de ned as the time period where a problem starts occurring as method instead of aim.

Sonal

Gupta and

Christopher

Manning . Analyzing the dynamics of research by extracting key aspects of scienti c papers . In Proceedings of 5th International Joint Conference on Natural Language Processing , pages 1 { 9 , Chiang

Mai

, Thailand, November 2011 . Asian Federation of Natural Language Processing .

Amjad

Abu Jbara and Dragomir R. Radev . The acl anthology network corpus as a resource for nlp-based bibliometrics . 2013 .

3. Chen-Tse

Tsai

, Gourab Kundu, and

Dan

Roth . Concept-based analysis of scienti c literature . In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, CIKM '13, page 1733 { 1738 , New York, NY, USA, 2013 . Association for Computing Machinery .

Isabelle

Augenstein , Mrinal Das , Sebastian Riedel , Lakshmi Vikraman, and Andrew McCallum . SemEval 2017 task 10: ScienceIE - extracting keyphrases and relations from scienti c publications . In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) , pages 546 { 555 , Vancouver, Canada, August 2017 . Association for Computational Linguistics .

Kritika

Agrawal , Aakash Mittal, and

Vikram

Pudi . Scalable, semi-supervised extraction of structured information from scienti c literature . In Proceedings of the Workshop on Extracting Structured Knowledge from Scienti c Publications , pages 11 { 20 , Minneapolis , Minnesota, June 2019 . Association for Computational Linguistics .

Jacob

Devlin , Ming-Wei

Chang

Kenton

Lee ,

and Kristina

Toutanova . BERT: pretraining of deep bidirectional transformers for language understanding . CoRR , abs/ 1810 .04805, 2018 .

Sonal

Gupta and

Christopher

Manning . Improved pattern learning for bootstrapped entity extraction . In Proceedings of the Eighteenth Conference on Computational Natural Language Learning , pages 98 { 108 , Ann

Arbor

, Michigan, June 2014 . Association for Computational Linguistics .

Martin

Ester , Hans-Peter Kriegel , Jorg Sander, and Xiaowei Xu . A density-based algorithm for discovering clusters in large spatial databases with noise . In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining , KDD' 96 , page 226 { 231 . AAAI Press, 1996 .

Heimerl ,

Lohmann ,

Lange , and

Ertl . Word cloud explorer: Text analytics based on word clouds . In 2014 47th Hawaii International Conference on System Sciences , pages 1833 { 1842 , 2014 .

10. Chairs Constantine Stephanidis, Gavriel Salvendy, Members of the Group Margherita Antona, Jessie Y. C. Chen , Jianming Dong, Vincent G. Du y, Xiaowen Fang, Cali Fidopiastis, Gino Fragomeni, Limin Paul Fu, Yinni Guo, Don Harris, Andri Ioannou, Kyeong ah (Kate) Jeong, Shin'ichi Konomi, Heidi Kromker, Masaaki Kurosu , James R. Lewis , Aaron Marcus, Gabriele Meiselwitz, Abbas Moallem, Hirohiko Mori, Fiona Fui-Hoon

Nah

, Stavroula Ntoa, Pei-Luen Patrick

Rau

, Dylan Schmorrow, Keng Siau, Norbert Streitz, Wentao Wang, Sakae Yamamoto, Panayiotis Zaphiris, and

Jia

Zhou . Seven HCI grand challenges . International Journal of Human{Computer Interaction , 35 ( 14 ): 1229 { 1269 , 2019 .