=Paper= {{Paper |id=Vol-2741/paper-08 |storemode=property |title=Temporal Analysis of Scientific Literature to Find Grand Challenges and Saturated Problems |pdfUrl=https://ceur-ws.org/Vol-2741/paper-08.pdf |volume=Vol-2741 |authors=Kritika Agrawal,Vikram Pudi |dblpUrl=https://dblp.org/rec/conf/sigir/AgrawalP20 }} ==Temporal Analysis of Scientific Literature to Find Grand Challenges and Saturated Problems== https://ceur-ws.org/Vol-2741/paper-08.pdf
  Temporal Analysis of Scientific Literature to
Find Grand Challenges and Saturated Problems

                         Kritika Agrawal and Vikram Pudi
                kritika.agrawal@research.iiit.ac.in, vikram@iiit.ac.in

    Data Sciences and Analytics Center, Kohli Center on Intelligent Systems IIIT,
                                 Hyderabad, India



        Abstract. As scientific communities grow and evolve, there is emer-
        gence of new techniques and decline of old ones. The tremendous amount
        of research publications available online aims to solve a lot of interest-
        ing problems. With time, some of the fields have been studied well and
        research problems solved to a great extent. However, there are few diffi-
        cult research problems which are yet not solved completely and interests
        a lot of researchers. In this paper, we aim to find research fields which
        are saturated and research fields which need to be explored yet. We first
        extract research problems in a semi supervised manner using a proven
        bootstrap framework from scientific literature of the last fifty years. We
        show how a simple statistics based model on top of the research prob-
        lems extracted can find the saturated fields and grand challenges in any
        domain of computer science.

        Keywords: scientific data extraction · temporal analysis · unsupervised
        learning


1     Introduction and Related Work

A consistently thriving global research community has over decades produced
a colossal amount of research papers that are published online, which makes it
crucial to organize this huge bulk of information systematically so that upcoming
researchers can navigate through efficiently and continue to push boundaries of
scientific research. Such an organization over intellectual information will not
only boost the rate of further research work but also augment researchers with
a better holistic view of development in research and the directions in which
it is evolving into. One of the first elementary steps we take as researchers is
to figure out which problems to focus on solving, and structured analysis on
present research status will help researchers identify critical problems and also
give insight about how they developed across time. Due to this it will be easier
to realize if particular problem has got no recent improvement in the recent past
    Copyright © 2020 for this paper by its authors. Use permitted under Creative
    Commons License Attribution 4.0 International (CC BY 4.0). BIRDS 2020, 30 July
    2020, Xi’an, China (online).


                                           47
and has moved into a thriving application and so on. Analysis is the foundation
to organization of cumulative knowledge garnered by the research community in
decades, and this paper deals with this first step in direction.
    [1] first proposed a task that defines scientific terms for 474 abstracts from
the ACL anthology [2] into three aspects: domain, technique, and focus. They
applied template-based bootstrapping on title and abstract of articles to tackle
the problem. They used handcrafted dependency based features. Based on this
study, [3] improved the performance by introducing hand- designed features to
the bootstrapping framework. They both tried to study the influence of different
scientific communities over the period of time. However, their work was limited
to the computational linguistics field. We propose a method for temporal analysis
of scientific literature of complete computer science domain.
    A recent challenge on Scientific Information Extraction (ScienceIE) [4] pro-
vided a dataset consisting of 500 scientific paragraphs with keyphrase annota-
tions for three categories: TASK, PROCESS, MATERIAL across three scientific
domains, Computer Science, Material Science, and Physics. This invited many
supervised and semi-supervised techniques in this field. Although all these tech-
niques can help extract important concepts of a research paper in a particular
domain, we need more general and scalable methods which can summarize the
complete research community and help in time based analysis. For this we used a
DBLP dataset which spans over fifty years and cover a wide variety of computer
science fields.
   As the first step of time based analysis, we aim to find saturated fields and
grand challenges. We define saturated fields as those research problems which
have been studied to a great extent and nothing much is left to achieve in
them. On the other hand grand challenges are defined as those problems which
have been tried to solve over a large period of time and are still worked upon
extensively.




2   Definitions


Saturated Problems: Problems which were very actively studied in the yester-
years and are now solved to a great extent. Example, parts of speech tagging in
NLP.
     Grand Challenges: Problems which were defined in yesteryears and are
still worked upon extensively. Example, machine translation in NLP. Research
during the 1980s typically relied on translation through some variety of interme-
diary linguistic representation involving morphological, syntactic, and semantic
analysis. In current times, research has focused on moving from domain specific
systems to domain independent translation systems.


                                       48
3     Approach
3.1   Identifying Aim and Method
Our approach is based on a proven method followed by [5] .Given a document,
we classify its phrases as Aim or Method. This approach is built on the ob-
servation that the semantics of the sentence of a research article containing a
phrase belonging to any of the concept type is similar across research papers.
To capture this semantic similarity, we use k nearest neighbour classifier on top
of state-of-the-art [6] domain based word embeddings. We start by extracting
features from a small set of annotated examples and used bootstrapping frame-
work [7] for extracting new features from unlabeled dataset. Finally, after some
iterations, we have a set of phrases classified as Aim or Method for each research
paper present in the dataset.

Merging of phrases which mean the same: We group the papers according
to the conference in which they were published. Then ∀ papers in the same
group, we cluster their extracted phrases by running DBSCAN [8] over vector
space representations of these phrases. The clusters are created based on lexical
similarity which is captured by cosine distance between phrase embeddings. [5]
A cluster i belonging to conference c1 and a cluster j belonging to conference c2
are merged if they have any common phrase. Finally we get clusters such that
phrases in each cluster have the same meaning.

3.2   Time based Analysis models
From the first step, we have research problems which have been studied as “AIM”
for the last fifty years. We also have techniques“METHOD” used to solve these
problems over these years. We first extract data for each research field, p, and
find the number of times paper published on them for each of the years in the
range 1971 to 2013.

Finding Saturated Problems:
 – Count vs year plot for such problems should show a steep decline in the
   current years.
 – Based on exploratory data analysis we came up with the following rules for
   finding saturated problems from the data collected above
 – We list a problem p as a saturated problem if:
     • T1 is the first year when the problem appeared in the literature. T2 is
       the last time when the problem appeared in the literature.
     • Count of p appearing as aim in T2 should be less than the count of p
       appearing as aim in T1
     • Peak of count vs year plot should have occured much before 2013.
     • Suppose problem p1 has peak at time t1 and problem p2 has peak at
       time t2 . P1 is a better candidate for saturated problem than p2 if the
       difference between T2 of p1 and t1 is more than the difference between
       T2 of p2 and t2 .

                                       49
Finding Grand Problems:
 – Count vs year plot for such problems should start from yester years and
   be consistent over the time. Peaks should be current years as well as yester
   years.
 – Based on exploratory data analysis we came up with the following rules for
   finding grand challenges from the data collected above
     • We list a problem p as a grand challenge if:
         ∗ T1 is the first year when the problem appeared in the literature. T2
           is the last time when the problem appeared in the literature.
         ∗ T1 for problem p to be classified as a grand challenge should be
           before 2000 and T2 after 2010. Time span between T1 and T2 should
           be more than 10 years.
         ∗ Count of p appearing as an aim in T2 should be more than some
           threshold. This is to rule out the edge cases where there is occurrence
           of few counts in current years.
     • We rank these problems based on the following formula:
         ∗ To capture the fact that more the span of the problem over the years,
           more likely it is a grand challenge; we propose rank to be directly
           proportional to the number of years it spans to.
         ∗ To capture the fact the count needs to be consistent  Pover    the years;
                                                                    n
           we propose rank to be inversely proportional to          i=1 (count[i] −
           count[i − 1]) where i iterates over all the years in which a problem p
           occurs.
                                                     n
                          Rank(p) ∝ Pn                                           (1)
                                        i=1 (count[i] − count[i − 1])
           Where i iterates over all the years in which problem p occurs,starting
           from the second entry and n is the total number of years.


4     Experiments and Results
4.1   Dataset
All experiments were done on DBLP citation network version 7. We chose DBLP
dataset to get a wide variety of research papers from different domains over a
large time period. It has 2,244,021 papers and 4,354,534 citation relationships.
After pruning out some papers and data cleaning we came up with 332,793
papers having 1,508,560 citation links. These papers range from 1936 to 2013.
However for the period 1936- 1971, the number of papers available were relatively
very less for time based analysis. So we pruned the data further and worked on
papers from 1971 to 2013.

4.2   Finding Grand challenges and Saturated Problems:
We got a total of 555,383 problems in the first step. Out of these, our algorithm
classified 599 as saturated problems and 1052 as grand challenges. To analyse the


                                        50
results, we extracted top 100 problems in both the categories. We represent our
results as word clouds [9] where the font and color of each word is proportional
to rank of that problem as extracted by our algorithm.


              Grand Challenges              Saturated Problems
              speech recognition                         disk arrays
              computer vision                    schema integration
              kolmogorov complexity             abductive reasoning
              real-time applications            reconfigurable mesh
              human-computer interaction      loop transformations
              query language             non-monotonic reasoning
              automatic parallelization             claw-free graphs
              stereo vision               facility location problem
              java                                 one-way function
              xml                                     robot learning

           Table 1. Top 10 Grand Challenges and Saturated Problems.




 – Discussion of Results:
   1. Speech recognition has a rich history that precedes Internet era. In 1952,
      three bell lab researchers made “Audrey” which recognized formats in
      power spectrum of each word. Investment in research in this area am-
      plified during 1970s with DARPA marking funding for understanding
      speech. IEEE speech groups were setup. In 1990s CMU led research
      funded Sphinx system which dominated DARPA 1992 evaluation. In
      2005 Siri came into life under Apple. From 2012 there was a major
      breakthrough in research and HMM models which were industry stan-
      dard till then were replaced by DNN. In 2014 end-to-end speech training
      was new paradigm that caught winds within DNN. In 2016 CMU and
      Google collectively introduced idea of “Attention” in training. In past
      three years there has been work on language agnostic ASR and more
      notable improvements kept on pressing. With importance of digital as-
      sistance, industry support has further expedited constant improvements
      every month over month till date. Clearly its a field with surreal ac-
      tive development and its not a surprise that our Model has correctly
      predicted this model as a Grand Challenge.
   2. Human Computer interaction is defined as a discipline concerned with
      the design and evolution of interactive computing systems for human
      use. HCI surfaced in the 1980s with the advent of personal computing,
      just as machines such as the Apple Macintosh, IBM PC 5150 started
      turning up in homes and offices. HCI soon became the subject of intense
      academic investigation. Initially, HCI researchers focused on how easy
      computers are to learn and use which has now also included to support
      the vision of personalized, adaptive, responsive, and proactive services,


                                       51
   adaptation and personalization methods and techniques that will need
   to consider how to incorporate AI and big data [10].
3. In algorithmic information theory, the Kolmogorov complexity of an ob-
   ject, such as a piece of text, is the length of a shortest computer program
   that produces the object as output. Research on this started in 1970s
   and is still going on.
4. The exact solution of facility location problem is known to be hard. And
   there are many approximation algorithms. No new research have been
   done on this problem. So clearly it is a saturated problem.
5. A one-way function is easy to compute on every input, but hard to
   invert. Although, The existence of true one-way functions is an open
   conjecture. In practice many functions such as those based on discrete
   Log are assumed to be work well since no polynomial time algorithm is
   known to invert them.
6. Loop optimization is the process of increasing execution speed and re-
   ducing overhead of loops. This problem is fairly solved and many modern
   compilers already use loop optimization techniques like Fission, Fusion,
   Inversion, Parallelisation etc.




                Fig. 1. Word Cloud for Grand Challenges


                                   52
                     Fig. 2. Word Cloud for Saturated Problems



5    Conclusions and Next Steps

In this paper, we show the temporal analysis of scientific literature by extracting
saturated problems and grand challenges. We propose this as the first step to-
wards time based analysis. We plan to further do time based analysis by finding
transition time for problems where transition time is defined as the time period
where a problem starts occurring as method instead of aim.


References

 1. Sonal Gupta and Christopher Manning. Analyzing the dynamics of research by
    extracting key aspects of scientific papers. In Proceedings of 5th International Joint
    Conference on Natural Language Processing, pages 1–9, Chiang Mai, Thailand,
    November 2011. Asian Federation of Natural Language Processing.
 2. Amjad Abu Jbara and Dragomir R. Radev. The acl anthology network corpus as
    a resource for nlp-based bibliometrics. 2013.
 3. Chen-Tse Tsai, Gourab Kundu, and Dan Roth. Concept-based analysis of sci-
    entific literature. In Proceedings of the 22nd ACM International Conference on
    Information & Knowledge Management, CIKM ’13, page 1733–1738, New York,
    NY, USA, 2013. Association for Computing Machinery.
 4. Isabelle Augenstein, Mrinal Das, Sebastian Riedel, Lakshmi Vikraman, and An-
    drew McCallum. SemEval 2017 task 10: ScienceIE - extracting keyphrases and
    relations from scientific publications. In Proceedings of the 11th International
    Workshop on Semantic Evaluation (SemEval-2017), pages 546–555, Vancouver,
    Canada, August 2017. Association for Computational Linguistics.


                                           53
 5. Kritika Agrawal, Aakash Mittal, and Vikram Pudi. Scalable, semi-supervised ex-
    traction of structured information from scientific literature. In Proceedings of the
    Workshop on Extracting Structured Knowledge from Scientific Publications, pages
    11–20, Minneapolis, Minnesota, June 2019. Association for Computational Lin-
    guistics.
 6. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-
    training of deep bidirectional transformers for language understanding. CoRR,
    abs/1810.04805, 2018.
 7. Sonal Gupta and Christopher Manning. Improved pattern learning for boot-
    strapped entity extraction. In Proceedings of the Eighteenth Conference on Com-
    putational Natural Language Learning, pages 98–108, Ann Arbor, Michigan, June
    2014. Association for Computational Linguistics.
 8. Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. A density-based
    algorithm for discovering clusters in large spatial databases with noise. In Proceed-
    ings of the Second International Conference on Knowledge Discovery and Data
    Mining, KDD’96, page 226–231. AAAI Press, 1996.
 9. F. Heimerl, S. Lohmann, S. Lange, and T. Ertl. Word cloud explorer: Text analytics
    based on word clouds. In 2014 47th Hawaii International Conference on System
    Sciences, pages 1833–1842, 2014.
10. Chairs Constantine Stephanidis, Gavriel Salvendy, Members of the Group
    Margherita Antona, Jessie Y. C. Chen, Jianming Dong, Vincent G. Duffy, Xi-
    aowen Fang, Cali Fidopiastis, Gino Fragomeni, Limin Paul Fu, Yinni Guo, Don
    Harris, Andri Ioannou, Kyeong ah (Kate) Jeong, Shin’ichi Konomi, Heidi Krömker,
    Masaaki Kurosu, James R. Lewis, Aaron Marcus, Gabriele Meiselwitz, Abbas
    Moallem, Hirohiko Mori, Fiona Fui-Hoon Nah, Stavroula Ntoa, Pei-Luen Patrick
    Rau, Dylan Schmorrow, Keng Siau, Norbert Streitz, Wentao Wang, Sakae Ya-
    mamoto, Panayiotis Zaphiris, and Jia Zhou. Seven HCI grand challenges. Inter-
    national Journal of Human–Computer Interaction, 35(14):1229–1269, 2019.




                                           54