=Paper= {{Paper |id=Vol-2345/paper12 |storemode=property |title=Finding Temporal Trends of Scientific Concepts |pdfUrl=https://ceur-ws.org/Vol-2345/paper12.pdf |volume=Vol-2345 |authors=Michael Färber,Adam Jatowt |dblpUrl=https://dblp.org/rec/conf/ecir/FarberJ19 }} ==Finding Temporal Trends of Scientific Concepts== https://ceur-ws.org/Vol-2345/paper12.pdf
                                         BIR 2019 Workshop on Bibliometric-enhanced Information Retrieval




 Finding Temporal Trends of Scientific Concepts

                        Michael Färber1 and Adam Jatowt2
        1
            Department of Computer Science, University of Freiburg, Germany
                        michael.faerber@cs.uni-freiburg.de
              2
                Department of Social Informatics, Kyoto University, Japan
                            adam@kuis.db.kyoto-u.ac.jp



       Abstract. Science evolves very rapidly, and researchers have studied the
       evolution of coarse-grained research topics. However, to our knowledge,
       no analysis of the temporal trends of fine-grained scientific concepts
       has been performed based on papers’ full texts. For this paper, we
       extract noun phrases as concepts from all computer science papers of
       arXiv.org. We then identify positive and negative trends by means of
       simple linear regression, Mann-Kendall test, and Theil-Sen estimate.
       In our experiments, we obtain noteworthy findings about trends using
       the Mann-Kendall test, while the Theil-Sen estimate and simple
       linear regression lead to many non-scientific concepts. Our findings
       are potentially relevant for both ordinary researchers and researchers
       working in bibliometrics and scientometrics.


Keywords: trend detection, scholarly data, bibliometrics, time series


1    Motivation
Science evolves very rapidly and the increasing number of researchers and
scientific publications worldwide in various disciplines reinforce this effect [1,2,3].
We argue that this phenomenon of scientific evolution is worth investigating
in more detail [4,5,6,7,8,9,10]. Specifically, ordinary researchers, as well as
researchers working in bibliometrics and scientometrics, might be interested in
the answers to the following questions:
Q1: Positive and negative trends: Which scientific concepts have become
    common in recent years and which scientific concepts have become less
    common? Which concepts have maintained their relevance over time?
Q2: Replacement of concepts: Which concepts have recently been supplanted
    by other concepts?
    In this work, we target those questions by extracting noun phrases from
a corpus of scientific papers, namely the contents of all computer science
papers of arXiv.org [11]. Then, positive and negative trends over time in the
set of extracted noun phrases are identified, contributing to answering Q1.
Furthermore, concepts that have replaced other concepts over time (reflected
in the usage statistics) are identified, contributing to answering Q2.




                                         132
                                            BIR 2019 Workshop on Bibliometric-enhanced Information Retrieval




2    Trend Detection
To find positive and negative trends in time series data, a variety of algorithms
are available [12,13]. In the following, we focus on those we used in our analysis
(see Sec. 3).
    We consider years as intervals of the time series. Furthermore, we use the
normalized relative frequency of the concepts as the basis for our calculations.
Formally, let D = {d1 , ..., d|D| } be our document corpus. Let ci be a concept
in the concept set C occurring nci times in the corpus D. Let Dci be the set
of documents in which ci occurs at least once. Then, the normalized relative
frequency of ci on the document level is defined as the ratio of documents
containing ci with respect to all documents: rfci = |Dci |/|D|.
    Simple Linear Regression. For this basic trend detection method, we
calculate for a given concept ci the difference between the relative frequencies
rfci ,k , rfci ,l of two time periods k, l (e.g., year 2007 and 2017): d = rfci ,k −rfci ,l .
    Mann-Kendall τ . To obtain statistically significant trends in time series
data, the Mann-Kendall test [14,13] is commonly used. This test can be applied
as a non-parametric test for monotonic trends. The Mann-Kendall statistic can
be used as indication whether a trend exists statistically and whether it is
positive or negative. More formally, the null hypothesis of Kendall’s τ is that
there is no trend (H0 : τ = 0). The alternative hypothesis is that there is a trend
(H1 : τ 6= 0). Given that we have the concepts in a temporal order, let Gi be the
number of data points after yi that are greater than yi . Let Li be the number of
data points after yi that are smaller than yi . Then, the Kendall’s τ coefficient is
calculated as
                                      τ = 2S/n(n − 1)
and S is thereby defined as
                                         n−1
                                         X
                                    S=         (Gi − Li )
                                         i=1

and corresponds to the the sum of the differences between Gi and Li along the
time series. Since we are dealing with a sufficiently large number of time slots n,
we can assume normal distribution for the test statistic z [13,15] and write
                                           τ
                           z=p
                                  2(2n + 5)/9n(n − 1)
   Theil-Sen Trend Line. The Theil-Sen estimate [16] can be used to estimate
the slope of a trend. It can be considered a non-parametric alternative to the
parametric ordinary least squares regression line. A Theil-Sen line models how
the median value changes linearly with time [13]. Formally, let
                               yj − yi
                      Bn = {           : xi 6= xj , 1 ≤ i < j ≤ n}
                               xj − xi

The Theil-Sen estimator βˆn is then defined as the median of all slopes in Bn ,
i.e., βˆn = med(Bn) with med standing for the median.




                                            133
                                       BIR 2019 Workshop on Bibliometric-enhanced Information Retrieval




3     Trend Detection of Scientific Concepts
We now describe our approach for extracting concepts from scientific papers and
identifying positive and negative trends.1

3.1   Data Set and Extracting Concepts
We use the arXiv CS data set [11] as our database. This data set contains the
plaintexts of all papers hosted at arXiv.org in the field of computer science from
the early beginnings of arXiv.org until December 2017. As corpora covering the
contents of research papers are rare, and the usage of arXiv.org has become
increasingly common in recent years, we believe that this data set is a valid
basis for concept evolution analyses. In total, the data set covers about 90,000
papers, resulting in about 16 million plaintext sentences. Note that in this data
set, formulas have been replaced by placeholders for easier text processing.
    Given the papers’ fulltexts, we are interested in the concepts mentioned in
these papers. For this paper, we use case-insensitive noun phrases as concept
representations. Thus, we extract noun phrases from the papers’ fulltexts.
Our approach uses an extended rule set of [18] (in total, eight rules) on the
part-of-speech tags assigned by the Stanford parser.
    Given the 15.5M sentences from the initial data set, we obtained 10.67M
unique noun phrases (76.7M non-unique noun phrases). When ordering
the extracted noun phrases by absolute frequency, we can observe that
domain-specific concepts, which are in the focus of this paper, occur particularly
in the mid range, while functional words and phrases common for writing papers
(e.g., ”number,” ”section,” ”figure”) appear at the top.2 We use this observation
to filter out irrelevant concepts during trend detection.

3.2   Sparsity and Thresholds
The set of extracted noun phrases still contains many irrelevant, non-scientific
noun phrases. Processing all of them would result in large databases, unnecessary
trend calculations, and declined querying performance of indices. Thus, we
analyzed the effectiveness of several filtering methods (following the similar
procedure of [15]): (1) each concept needs to appear in at least 100 documents
within the whole corpus; (2) each concept needs to appear in at least three
different years; (3) the combination of methods 1 and 2. Table 1 shows the
results. We can observe that using threshold 1 (i.e., each concept must appear in
at least 100 documents) allows a considerable decrease in the number of concepts.
However, threshold 2 (i.e., the number of years in which each concept needs to
appear) also seems to be very effective. Ultimately, we followed [15] and used
the combination of (1) and (2).
1
  See https://github.com/michaelfaerber/scholarly-trends for our source code
  and [17] for a demonstration system based on our trend detection approach.
2
  The data set of extracted noun phrases is available at https://github.com/
  michaelfaerber/scholarly-trends.




                                       134
                                       BIR 2019 Workshop on Bibliometric-enhanced Information Retrieval




      Table 1: Frequencies of noun phrases when applying various thresholds.
Threshold                                  Frequency Percent of All Noun Phrases
< 100 occurrences                            9,769,907                         99.51%
≥ 100 occurrences                               47,871                          0.49%
occur in < 3 years                           8,945,295                         91.11%
occur in ≥ 3 years                             872,600                          8.89%
> 100 occurrences and occur in ≥ 3 years        47,759                          0.49%
> 100 occurrences and occur in < 3 years           112                       0.00014%



Table 2: Top 15 positively trending Table 3: Top 15 negativley trending
noun phrases by simple linear regression noun phrases by simple linear
(2007-2017).                             regression (2007-2017).
            Noun phrase     # Docs                       Noun phrase    # Docs
            experiments      26150                       proof           31854
            dataset          15111                       theorem         31378
            training         13708                       definition      34978
            table            36883                       fact            53356
            methods          29840                       course          15154
            performance      39505                       lemma           22527
            data             37760                       elements        26122
            features         17831                       condition       20904
            accuracy         16638                       case            66931
            datasets         10002                       whose           39785
            models           23789                       us              52096
            images           12065                       length          26846
            model            40009                       notation        21878
            training data     8444                       construction    18790
            method           38301                       sense           24674




3.3     Applying Trend Detection Methods

Given the set of filtered noun phrase series, we apply the trend detection
algorithms outlined in Sec. 2, namely the simple linear regression, the
Mann-Kendall test, and the Theil-Sen estimate. We thereby obtained the
following findings:
    Simple Linear Regression. We list the top 15 positively and negatively
trending noun phrases using the simple linear regression in Table 2 and 3.3 Very
general concepts (e.g., ”experiments,” ”dataset,” and ”training”) show a strong
increase in the usage over time in our data set. This might be surprising, but can
be partially explained by the fact that our considered concepts are from 2007
and 2017; rather general concepts remain over such a long time span. Given
the negatively trending concepts, it becomes apparent that concepts concerning
formalism and theories were much more important in 2007 than in 2017. Overall,
we can state that the simple linear regression leads partially to already relevant
3
    The full list is available online at https://github.com/michaelfaerber/
    scholarly-trends.




                                       135
                                           BIR 2019 Workshop on Bibliometric-enhanced Information Retrieval




Table 4: Top 15 positively trending noun Table 5: Top 15 negatively trending
phrases by Mann-Kendall test.            noun phrases by Mann-Kendall test.
Noun phrase                 Mann-     # Docs      Noun phrase          Mann-     # Docs
                            Kendall z                                  Kendall z
training data               4.20       8444       block length         -4.20        937
high accuracy               4.20       2182       bits                 -4.20       9297
regularizer                 4.20       1257       transmitted codeword -4.20        409
pixels                      4.20       5817       course               -4.20      15154
supplementary material      4.20       1409       capacity             -4.20       8716
liu                         4.20       1028       spin glasses         -4.20        109
ground truth                4.20       4349       shannon              -4.20       1912
synthetic images            4.20        278       bit                  -4.20       7945
gpus                        4.20       1389       message              -4.20       8146
hours                       4.20       2277       codes                -4.20       6015
cloud                       4.20       1555       real numbers         -4.20       2940
gradient                    4.20       6963       cdma systems         -4.20         94
higher accuracy             4.20       1206       alphabet size        -4.20        570
millions                    4.20       2963       intermediate nodes   -4.05        725
machine learning techniques 4.20        776       codeword             -4.05       3347




findings about trends of concepts, although many abstract concepts are also
found to be trending.
    Mann-Kendall test. Table 4 and Table 5 list the top 15 positively and
negatively trending noun phrases using the Mann-Kendall test (see Sec. 2).
Out of all 47,759 indexed noun phrases, 19,525 of them are found to have
a statistically significant (positive or negative) trend over the years (using
Kendall’s τ |z| > 3 as in [15]). This value might appear high. However, note that
we have applied a strong filter for obviously irrelevant concepts (see Sec. 3.2).
    Considering all trending noun phrases, we can recognize that the
Mann-Kendall test appears to be a reasonable trend detection method for our
case. We obtained noteworthy findings concerning the trending concepts:

 – Among the positively trending concepts are many machine
   learning-associated concepts, such as ”gradient,” ”deep neural networks,”
   ”convolutional neural networks,” ”convolutional layer,” and ”gpus.” The
   metrics ”ROC” and ”AUC” (capitalized for better readability) are also
   trending.
 – ”One-shot learning” and ”data science” are identified as positively trending
   and render the general orientation of computer science research in recent
   years.
 – Negatively trending noun phrases are particularly from the area of formal
   (i.e., theoretical) computer science, such as the area of information theory.
   Representative, negatively trending concepts are ”block length,” ”bits,”
   ”shannon,” and ”message,” but also ”decision problem” and ”Turing
   machine.” It is quite obvious that arXiv was predominated by theoretical
   computer science, while nowadays machine learning is the predominant field.
   Ultimately, this means that our database is, to some extent, unbalanced.
   However, we believe that it is acceptable, as it reflects the general orientation
   of computer science research overall over the years.




                                           136
                                            BIR 2019 Workshop on Bibliometric-enhanced Information Retrieval




Table 6: Top 15 positive trending noun Table 7: Top 15 negative trending noun
phrases by Theil-Sen slope.            phrases according by Theil-Sen slope.
Noun Phrase Theil-Sen slope Total # Docs.         Noun Phrase         Theil-Sen slope # Docs.
experiments            2.55        26150          theorem                      -2.55   31378
performance            2.50        39505          proof                        -2.32   31854
table                  2.44        36883          definition                   -1.76   34978
dataset                2.24        15111          lemma                        -1.65   22527
data                   2.12        37760          fact                         -1.40   53356
methods                2.04        29840          course                       -1.38   15154
training               1.95        13708          notion                       -1.33   18398
features               1.89        17831          case                         -1.24   66931
accuracy               1.70        16638          length                       -1.21   26846
parameters             1.61        33835          elements                     -1.18   26122
datasets               1.56        10002          us                           -1.18   52096
method                 1.52        38301          following theorem            -1.16   14283
experiment             1.45        14829          construction                 -1.16   18790
work                   1.41        55240          sense                        -1.13   24674
images                 1.41        12065          condition                    -1.09   20904




 – Also, the concepts ”DBScan” and ”LDA” have been used with increasing
   frequency (statistically proven) and have remained on a stable level in recent
   years. This may appear surprising, as those concepts are believed to have
   been established for a long time and therefore might be expected to decrease.
 – ”Quantum computing” and ”PageRank” have not been identified as trending
   but show a strong increase and then a plateau when being visualized over
   time. These concepts have a very volatile time series.
 – The programming language ”Scala” was on the rise and then became stable,
   while ”Python” is still increasing in recent years.

   Theil-Sen Estimate. Table 6 and Table 7, respectively, list the top positive
and negative trending noun phrases according to the Theil-Sen’s estimate (see
Sec. 2 for its definition). We can observe that using Theil-Sen leads to many
very generic concepts in the lists of trending concepts, such as ”experiments”
and ”dataset.” Thus, this trend detection method can be used to generate an
upper ontology instead of showing trends of specific scientific concepts.


4    Related Work

Trend Detection Based on Scientific Papers. Various papers presenting
approaches and demonstration systems deal with the evolution of research topics
over time [4,5,6,7,8,9,10]. Apart from the visualization frameworks for paper
collections (e.g., via maps or hierarchical views) [4,5], the approach-describing
papers differ from our paper as follows: (1) the authors cluster topics and, thus,
rather consider high-level concepts [5,6,9]; (2) they do not apply content-based
methods, but methods based on graphs and networks, such as the citation
information [10,9,8] and the author information [7]; (3) they use purely the
papers’ titles or abstracts but no papers’ full texts [5,7,8], which makes it hard
to cover also long-tail concepts.




                                            137
                                         BIR 2019 Workshop on Bibliometric-enhanced Information Retrieval




    Information Extraction from Scientific Papers. In the past, several
kinds of information extraction techniques have been applied to scientific
papers, ranging from noun phrase extraction over entity linking to relation
extraction. Noteworthy in this context are also the SemEval tasks based on
scientific papers as data sets (see SemEval 2010 Task 5 ‘Automatic Keyphrase
Extraction from Scientific Articles” [19] and SemEval 2017 Task 10: “ScienceIE –
Extracting Keyphrases and Relations from Scientific Publications” [20]). While
the extraction of words and bigrams has already been applied to papers [7,8],
no paper dedicated to the analysis of scientific phrases in the papers’ full texts
has been presented to our knowledge.
    Time-Series Analysis and Trend Detection. Among the most frequently
used methods for trend detection are the Mann-Kendall test and Sen’s slope [13].
Related to our work is the analysis of Daniel et al. [15] concerning trending
multi-word expressions in the Google Books data set. However, the domain
of books differs from our domain-specific use case. Furthermore, multi-word
expressions cover not only noun phrases, but also proverbs, greetings, etc.


5   Conclusion

In this paper, we have presented an analysis concerning positively and
negatively trending scientific concepts. We identified statistically trending
concepts included in all computer science papers of arXiv.org based on several
trend detection methods. We thereby found that the Mann-Kendall test performs
well for this task, while the simple regression and Theil-Sen estimate have
deficits, such as detecting rather general, non-scientific concepts. Based on the
trending concepts, we not only found that arXiv.org has a strong orientation
towards machine learning and deep learning, but we also identified rather
surprising usage patterns.
    For the future, we plan to consider various scientific disciplines based on
the new arXiv data set presented in [21]. Moreover, we plan to perform a deeper
linguistic analysis of the arXiv papers’ content. For instance, extracting, storing,
and testing scientific hypotheses [22] might be a worthy task.


References

 1. Bornmann, L., Mutz, R.: Growth rates of modern science: A bibliometric
    analysis based on the number of publications and cited references. Journal of
    the Association for Information Science and Technology 66(11) (2015) 2215–2222
 2. Fortunato, S., Bergstrom, C.T., Börner, K., Evans, J.A., Helbing, D., Milojević,
    S., Petersen, A.M., Radicchi, F., Sinatra, R., Uzzi, B., et al.: Science of science.
    Science 359(6379) (2018) eaao0185
 3. Ware, M., Mabe, M.: The STM Report: An overview of scientific and scholarly
    journal publishing. (2015)
 4. Zhang, C., Li, Z., Zhang, J.: A survey on visualization for scientific literature
    topics. J. Visualization 21(2) (2018) 321–335




                                         138
                                          BIR 2019 Workshop on Bibliometric-enhanced Information Retrieval




 5. Wang, X., Cheng, Q., Lu, W.: Analyzing evolution of research topics with
    NEViewer: a new method based on dynamic co-word networks. Scientometrics
    101(2) (2014) 1253–1271
 6. Salatino, A.A., Osborne, F., Motta, E.: AUGUR: Forecasting the Emergence of
    New Research Topics. In: Proceedings of the 18th ACM/IEEE on Joint Conference
    on Digital Libraries. JCDL’18 (2018) 303–312
 7. Bolelli, L., Ertekin, S., Giles, C.L.: Topic and Trend Detection in Text Collections
    Using Latent Dirichlet Allocation.         In: Proceedings of the 31th European
    Conference on Information Retrieval. (2009) 776–780
 8. Jo, Y., Lagoze, C., Giles, C.L.: Detecting Research Topics via the Correlation
    between Graphs and Texts.           In: Proceedings of the 13th ACM SIGKDD
    International Conference on Knowledge Discovery and Data Mining. KDD’07
    (2007) 370–379
 9. Small, H., Boyack, K.W., Klavans, R.: Identifying emerging topics in science and
    technology. Research Policy 43(8) (2014) 1450–1467
10. Popescul, A., Flake, G.W., Lawrence, S., Ungar, L.H., Giles, C.L.: Clustering and
    Identifying Temporal Trends in Document Databases. In: Proceedings of IEEE
    Advances in Digital Libraries 2000. ADL’00 (2000) 173–182
11. Färber, M., Thiemann, A., Jatowt, A.: A High-Quality Gold Standard for
    Citation-based Tasks. In: Proceedings of the Eleventh International Conference
    on Language Resources and Evaluation. LREC’18 (2018)
12. Gray, K.L.: Comparison of Trend Detection Methods. PhD thesis, University of
    Montana, Department of Mathematical Sciences, Missoula, MT, USA (2007)
13. Interstate Technology and Regulatory Council: Groundwater Statistics and
    Monitoring Compliance. Statistical Tools for the Project Life Cycle. (2013)
14. Gilbert, R.O.: Statistical Methods for Environmental Pollution Monitoring. John
    Wiley & Sons (1987)
15. Daniel, T., Last, M.: Exploring Long-Term Temporal Trends in the Use of
    Multiword Expressions. In: Proceedings of the 12th Workshop on Multiword
    Expressions. MWE@ACL’16 (2016)
16. Sen, P.K.: Estimates of the regression coefficient based on Kendall’s tau. Journal
    of the American statistical association 63(324) (1968) 1379–1389
17. Färber, M., Nishioka, C., Jatowt, A.: ScholarSight: Visualizing Temporal Trends of
    Scientific Concepts. In: Proceedings of the 19th ACM/IEEE on Joint Conference
    on Digital Libraries. JCDL’19 (2019)
18. Zhao, S., Li, C., Ma, S., Ma, T., Ma, D.: Combining POS Tagging, Lucene Search
    and Similarity Metrics for Entity Linking. In: Proceedings of the 14th International
    Conference on Web Information Systems Engineering. WISE’13 (2013) 503–509
19. Kim, S.N., Medelyan, O., Kan, M., Baldwin, T.: SemEval-2010 Task 5: Automatic
    Keyphrase Extraction from Scientific Articles. In: Proceedings of the 5th
    International Workshop on Semantic Evaluation. SemEval@ACL’10 (2010) 21–26
20. Augenstein, I., Das, M., Riedel, S., Vikraman, L., McCallum, A.: SemEval
    2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Scientific
    Publications. In: Proceedings of the 11th International Workshop on Semantic
    Evaluation. SemEval@ACL’17 (2017) 546–555
21. Saier, T., Färber, M.: Bibliometric-Enhanced arXiv: A Data Set for Paper-Based
    and Citation-Based Tasks. In: Proceedings of the 8th International Workshop on
    Bibliometric-enhanced Information Retrieval. BIR’19 (2019)
22. Baker, N.C., Hemminger, B.M.: Mining connections between chemicals, proteins,
    and diseases extracted from Medline annotations.            Journal of Biomedical
    Informatics 43(4) (2010) 510–519




                                         139