<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Temporal Analysis of Scienti c Literature to Find Grand Challenges and Saturated Problems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kritika Agrawal</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vikram Pudi kritika.agrawal@research.iiit.ac.in</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>vikram@iiit.ac.in</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Data Sciences and Analytics Center, Kohli Center on Intelligent Systems IIIT</institution>
          ,
          <addr-line>Hyderabad</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <fpage>47</fpage>
      <lpage>54</lpage>
      <abstract>
        <p>As scienti c communities grow and evolve, there is emergence of new techniques and decline of old ones. The tremendous amount of research publications available online aims to solve a lot of interesting problems. With time, some of the elds have been studied well and research problems solved to a great extent. However, there are few di cult research problems which are yet not solved completely and interests a lot of researchers. In this paper, we aim to nd research elds which are saturated and research elds which need to be explored yet. We rst extract research problems in a semi supervised manner using a proven bootstrap framework from scienti c literature of the last fty years. We show how a simple statistics based model on top of the research problems extracted can nd the saturated elds and grand challenges in any domain of computer science.</p>
      </abstract>
      <kwd-group>
        <kwd>scienti c data extraction</kwd>
        <kwd>temporal analysis</kwd>
        <kwd>unsupervised learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>A consistently thriving global research community has over decades produced
a colossal amount of research papers that are published online, which makes it
crucial to organize this huge bulk of information systematically so that upcoming
researchers can navigate through e ciently and continue to push boundaries of
scienti c research. Such an organization over intellectual information will not
only boost the rate of further research work but also augment researchers with
a better holistic view of development in research and the directions in which
it is evolving into. One of the rst elementary steps we take as researchers is
to gure out which problems to focus on solving, and structured analysis on
present research status will help researchers identify critical problems and also
give insight about how they developed across time. Due to this it will be easier
to realize if particular problem has got no recent improvement in the recent past
and has moved into a thriving application and so on. Analysis is the foundation
to organization of cumulative knowledge garnered by the research community in
decades, and this paper deals with this rst step in direction.</p>
      <p>
        [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] rst proposed a task that de nes scienti c terms for 474 abstracts from
the ACL anthology [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] into three aspects: domain, technique, and focus. They
applied template-based bootstrapping on title and abstract of articles to tackle
the problem. They used handcrafted dependency based features. Based on this
study, [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] improved the performance by introducing hand- designed features to
the bootstrapping framework. They both tried to study the in uence of di erent
scienti c communities over the period of time. However, their work was limited
to the computational linguistics eld. We propose a method for temporal analysis
of scienti c literature of complete computer science domain.
      </p>
      <p>
        A recent challenge on Scienti c Information Extraction (ScienceIE) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
provided a dataset consisting of 500 scienti c paragraphs with keyphrase
annotations for three categories: TASK, PROCESS, MATERIAL across three scienti c
domains, Computer Science, Material Science, and Physics. This invited many
supervised and semi-supervised techniques in this eld. Although all these
techniques can help extract important concepts of a research paper in a particular
domain, we need more general and scalable methods which can summarize the
complete research community and help in time based analysis. For this we used a
DBLP dataset which spans over fty years and cover a wide variety of computer
science elds.
      </p>
      <p>As the rst step of time based analysis, we aim to nd saturated elds and
grand challenges. We de ne saturated elds as those research problems which
have been studied to a great extent and nothing much is left to achieve in
them. On the other hand grand challenges are de ned as those problems which
have been tried to solve over a large period of time and are still worked upon
extensively.
2</p>
    </sec>
    <sec id="sec-2">
      <title>De nitions</title>
      <p>Saturated Problems: Problems which were very actively studied in the
yesteryears and are now solved to a great extent. Example, parts of speech tagging in
NLP.</p>
      <p>Grand Challenges: Problems which were de ned in yesteryears and are
still worked upon extensively. Example, machine translation in NLP. Research
during the 1980s typically relied on translation through some variety of
intermediary linguistic representation involving morphological, syntactic, and semantic
analysis. In current times, research has focused on moving from domain speci c
systems to domain independent translation systems.</p>
    </sec>
    <sec id="sec-3">
      <title>Approach</title>
      <sec id="sec-3-1">
        <title>Identifying Aim and Method</title>
        <p>
          Our approach is based on a proven method followed by [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] .Given a document,
we classify its phrases as Aim or Method. This approach is built on the
observation that the semantics of the sentence of a research article containing a
phrase belonging to any of the concept type is similar across research papers.
To capture this semantic similarity, we use k nearest neighbour classi er on top
of state-of-the-art [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] domain based word embeddings. We start by extracting
features from a small set of annotated examples and used bootstrapping
framework [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] for extracting new features from unlabeled dataset. Finally, after some
iterations, we have a set of phrases classi ed as Aim or Method for each research
paper present in the dataset.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Merging of phrases which mean the same: We group the papers according</title>
        <p>
          to the conference in which they were published. Then 8 papers in the same
group, we cluster their extracted phrases by running DBSCAN [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] over vector
space representations of these phrases. The clusters are created based on lexical
similarity which is captured by cosine distance between phrase embeddings. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]
A cluster i belonging to conference c1 and a cluster j belonging to conference c2
are merged if they have any common phrase. Finally we get clusters such that
phrases in each cluster have the same meaning.
3.2
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Time based Analysis models</title>
        <p>From the rst step, we have research problems which have been studied as \AIM"
for the last fty years. We also have techniques\METHOD" used to solve these
problems over these years. We rst extract data for each research eld, p, and
nd the number of times paper published on them for each of the years in the
range 1971 to 2013.</p>
      </sec>
      <sec id="sec-3-4">
        <title>Finding Saturated Problems:</title>
        <p>{ Count vs year plot for such problems should show a steep decline in the
current years.
{ Based on exploratory data analysis we came up with the following rules for
nding saturated problems from the data collected above
{ We list a problem p as a saturated problem if:</p>
        <p>T1 is the rst year when the problem appeared in the literature. T2 is
the last time when the problem appeared in the literature.</p>
        <p>Count of p appearing as aim in T2 should be less than the count of p
appearing as aim in T1
Peak of count vs year plot should have occured much before 2013.
Suppose problem p1 has peak at time t1 and problem p2 has peak at
time t2. P1 is a better candidate for saturated problem than p2 if the
di erence between T2 of p1 and t1 is more than the di erence between
T2 of p2 and t2.</p>
      </sec>
      <sec id="sec-3-5">
        <title>Finding Grand Problems:</title>
        <p>{ Count vs year plot for such problems should start from yester years and
be consistent over the time. Peaks should be current years as well as yester
years.
{ Based on exploratory data analysis we came up with the following rules for
nding grand challenges from the data collected above</p>
        <p>We list a problem p as a grand challenge if:
∗ T1 is the rst year when the problem appeared in the literature. T2
is the last time when the problem appeared in the literature.
∗ T1 for problem p to be classi ed as a grand challenge should be
before 2000 and T2 after 2010. Time span between T1 and T2 should
be more than 10 years.
∗ Count of p appearing as an aim in T2 should be more than some
threshold. This is to rule out the edge cases where there is occurrence
of few counts in current years.</p>
        <p>
          We rank these problems based on the following formula:
∗ To capture the fact that more the span of the problem over the years,
more likely it is a grand challenge; we propose rank to be directly
proportional to the number of years it spans to.
∗ To capture the fact the count needs to be consistent over the years;
we propose rank to be inversely proportional to Pin=1(count[i]
count[i 1]) where i iterates over all the years in which a problem p
occurs.
All experiments were done on DBLP citation network version 7. We chose DBLP
dataset to get a wide variety of research papers from di erent domains over a
large time period. It has 2,244,021 papers and 4,354,534 citation relationships.
After pruning out some papers and data cleaning we came up with 332,793
papers having 1,508,560 citation links. These papers range from 1936 to 2013.
However for the period 1936- 1971, the number of papers available were relatively
very less for time based analysis. So we pruned the data further and worked on
papers from 1971 to 2013.
results, we extracted top 100 problems in both the categories. We represent our
results as word clouds [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] where the font and color of each word is proportional
to rank of that problem as extracted by our algorithm.
{ Discussion of Results:
1. Speech recognition has a rich history that precedes Internet era. In 1952,
three bell lab researchers made \Audrey" which recognized formats in
power spectrum of each word. Investment in research in this area
ampli ed during 1970s with DARPA marking funding for understanding
speech. IEEE speech groups were setup. In 1990s CMU led research
funded Sphinx system which dominated DARPA 1992 evaluation. In
2005 Siri came into life under Apple. From 2012 there was a major
breakthrough in research and HMM models which were industry
standard till then were replaced by DNN. In 2014 end-to-end speech training
was new paradigm that caught winds within DNN. In 2016 CMU and
Google collectively introduced idea of \Attention" in training. In past
three years there has been work on language agnostic ASR and more
notable improvements kept on pressing. With importance of digital
assistance, industry support has further expedited constant improvements
every month over month till date. Clearly its a eld with surreal
active development and its not a surprise that our Model has correctly
predicted this model as a Grand Challenge.
2. Human Computer interaction is de ned as a discipline concerned with
the design and evolution of interactive computing systems for human
use. HCI surfaced in the 1980s with the advent of personal computing,
just as machines such as the Apple Macintosh, IBM PC 5150 started
turning up in homes and o ces. HCI soon became the subject of intense
academic investigation. Initially, HCI researchers focused on how easy
computers are to learn and use which has now also included to support
the vision of personalized, adaptive, responsive, and proactive services,
adaptation and personalization methods and techniques that will need
to consider how to incorporate AI and big data [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
3. In algorithmic information theory, the Kolmogorov complexity of an
object, such as a piece of text, is the length of a shortest computer program
that produces the object as output. Research on this started in 1970s
and is still going on.
4. The exact solution of facility location problem is known to be hard. And
there are many approximation algorithms. No new research have been
done on this problem. So clearly it is a saturated problem.
5. A one-way function is easy to compute on every input, but hard to
invert. Although, The existence of true one-way functions is an open
conjecture. In practice many functions such as those based on discrete
Log are assumed to be work well since no polynomial time algorithm is
known to invert them.
6. Loop optimization is the process of increasing execution speed and
reducing overhead of loops. This problem is fairly solved and many modern
compilers already use loop optimization techniques like Fission, Fusion,
Inversion, Parallelisation etc.
        </p>
        <p>Fig. 2. Word Cloud for Saturated Problems
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and Next Steps</title>
      <p>In this paper, we show the temporal analysis of scienti c literature by extracting
saturated problems and grand challenges. We propose this as the rst step
towards time based analysis. We plan to further do time based analysis by nding
transition time for problems where transition time is de ned as the time period
where a problem starts occurring as method instead of aim.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Sonal</given-names>
            <surname>Gupta</surname>
          </string-name>
          and
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <article-title>Analyzing the dynamics of research by extracting key aspects of scienti c papers</article-title>
          .
          <source>In Proceedings of 5th International Joint Conference on Natural Language Processing</source>
          , pages
          <volume>1</volume>
          {
          <fpage>9</fpage>
          ,
          <string-name>
            <surname>Chiang</surname>
            <given-names>Mai</given-names>
          </string-name>
          , Thailand,
          <year>November 2011</year>
          .
          <article-title>Asian Federation of Natural Language Processing</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Amjad</given-names>
            <surname>Abu</surname>
          </string-name>
          Jbara and
          <string-name>
            <surname>Dragomir R. Radev</surname>
          </string-name>
          .
          <article-title>The acl anthology network corpus as a resource for nlp-based bibliometrics</article-title>
          .
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Chen-Tse</surname>
            <given-names>Tsai</given-names>
          </string-name>
          , Gourab Kundu, and
          <string-name>
            <given-names>Dan</given-names>
            <surname>Roth</surname>
          </string-name>
          .
          <article-title>Concept-based analysis of scienti c literature</article-title>
          .
          <source>In Proceedings of the 22nd ACM International Conference on Information &amp; Knowledge Management, CIKM '13, page</source>
          <volume>1733</volume>
          {
          <fpage>1738</fpage>
          , New York, NY, USA,
          <year>2013</year>
          .
          <article-title>Association for Computing Machinery</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Isabelle</given-names>
            <surname>Augenstein</surname>
          </string-name>
          ,
          <string-name>
            <surname>Mrinal Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>Sebastian Riedel</surname>
          </string-name>
          , Lakshmi Vikraman, and
          <article-title>Andrew McCallum</article-title>
          .
          <article-title>SemEval 2017 task 10: ScienceIE - extracting keyphrases and relations from scienti c publications</article-title>
          .
          <source>In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)</source>
          , pages
          <fpage>546</fpage>
          {
          <fpage>555</fpage>
          , Vancouver, Canada,
          <year>August 2017</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Kritika</given-names>
            <surname>Agrawal</surname>
          </string-name>
          , Aakash Mittal, and
          <string-name>
            <given-names>Vikram</given-names>
            <surname>Pudi</surname>
          </string-name>
          .
          <article-title>Scalable, semi-supervised extraction of structured information from scienti c literature</article-title>
          .
          <source>In Proceedings of the Workshop on Extracting Structured Knowledge from Scienti c Publications</source>
          , pages
          <volume>11</volume>
          {
          <fpage>20</fpage>
          ,
          <string-name>
            <surname>Minneapolis</surname>
          </string-name>
          , Minnesota,
          <year>June 2019</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ming-Wei</surname>
            <given-names>Chang</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Kenton</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Kristina</given-names>
            <surname>Toutanova</surname>
          </string-name>
          .
          <article-title>BERT: pretraining of deep bidirectional transformers for language understanding</article-title>
          .
          <source>CoRR</source>
          , abs/
          <year>1810</year>
          .04805,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Sonal</given-names>
            <surname>Gupta</surname>
          </string-name>
          and
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <article-title>Improved pattern learning for bootstrapped entity extraction</article-title>
          .
          <source>In Proceedings of the Eighteenth Conference on Computational Natural Language Learning</source>
          , pages
          <volume>98</volume>
          {
          <fpage>108</fpage>
          ,
          <string-name>
            <surname>Ann</surname>
            <given-names>Arbor</given-names>
          </string-name>
          , Michigan,
          <year>June 2014</year>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Martin</given-names>
            <surname>Ester</surname>
          </string-name>
          ,
          <string-name>
            <surname>Hans-Peter Kriegel</surname>
            , Jorg Sander, and
            <given-names>Xiaowei</given-names>
          </string-name>
          <string-name>
            <surname>Xu</surname>
          </string-name>
          .
          <article-title>A density-based algorithm for discovering clusters in large spatial databases with noise</article-title>
          .
          <source>In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining</source>
          , KDD'
          <volume>96</volume>
          , page
          <volume>226</volume>
          {
          <fpage>231</fpage>
          . AAAI Press,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>F.</given-names>
            <surname>Heimerl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lohmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lange</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Ertl</surname>
          </string-name>
          .
          <article-title>Word cloud explorer: Text analytics based on word clouds</article-title>
          .
          <source>In 2014 47th Hawaii International Conference on System Sciences</source>
          , pages
          <year>1833</year>
          {
          <year>1842</year>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. Chairs Constantine Stephanidis, Gavriel Salvendy, Members of the Group Margherita Antona,
          <string-name>
            <surname>Jessie Y. C. Chen</surname>
          </string-name>
          , Jianming Dong, Vincent G. Du y, Xiaowen Fang, Cali Fidopiastis, Gino Fragomeni, Limin Paul Fu, Yinni Guo, Don Harris, Andri Ioannou,
          <article-title>Kyeong ah (Kate) Jeong, Shin'ichi Konomi, Heidi Kromker, Masaaki Kurosu</article-title>
          ,
          <string-name>
            <surname>James R. Lewis</surname>
          </string-name>
          , Aaron Marcus, Gabriele Meiselwitz, Abbas Moallem, Hirohiko Mori, Fiona
          <string-name>
            <surname>Fui-Hoon</surname>
            <given-names>Nah</given-names>
          </string-name>
          , Stavroula Ntoa,
          <string-name>
            <surname>Pei-Luen Patrick</surname>
            <given-names>Rau</given-names>
          </string-name>
          , Dylan Schmorrow, Keng Siau, Norbert Streitz, Wentao Wang, Sakae Yamamoto, Panayiotis Zaphiris, and
          <string-name>
            <given-names>Jia</given-names>
            <surname>Zhou</surname>
          </string-name>
          .
          <article-title>Seven HCI grand challenges</article-title>
          .
          <source>International Journal of Human{Computer Interaction</source>
          ,
          <volume>35</volume>
          (
          <issue>14</issue>
          ):
          <volume>1229</volume>
          {
          <fpage>1269</fpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>