=Paper= {{Paper |id=Vol-2345/paper12 |storemode=property |title=Finding Temporal Trends of Scientific Concepts |pdfUrl=https://ceur-ws.org/Vol-2345/paper12.pdf |volume=Vol-2345 |authors=Michael Färber,Adam Jatowt |dblpUrl=https://dblp.org/rec/conf/ecir/FarberJ19 }} ==Finding Temporal Trends of Scientific Concepts== https://ceur-ws.org/Vol-2345/paper12.pdf

BIR 2019 Workshop on Bibliometric-enhanced Information Retrieval

Finding Temporal Trends of Scientific Concepts

Michael Färber1 and Adam Jatowt2
1
Department of Computer Science, University of Freiburg, Germany
michael.faerber@cs.uni-freiburg.de
2
Department of Social Informatics, Kyoto University, Japan
adam@kuis.db.kyoto-u.ac.jp

Abstract. Science evolves very rapidly, and researchers have studied the
evolution of coarse-grained research topics. However, to our knowledge,
no analysis of the temporal trends of fine-grained scientific concepts
has been performed based on papers’ full texts. For this paper, we
extract noun phrases as concepts from all computer science papers of
arXiv.org. We then identify positive and negative trends by means of
simple linear regression, Mann-Kendall test, and Theil-Sen estimate.
In our experiments, we obtain noteworthy findings about trends using
the Mann-Kendall test, while the Theil-Sen estimate and simple
linear regression lead to many non-scientific concepts. Our findings
are potentially relevant for both ordinary researchers and researchers
working in bibliometrics and scientometrics.

Keywords: trend detection, scholarly data, bibliometrics, time series

1 Motivation
Science evolves very rapidly and the increasing number of researchers and
scientific publications worldwide in various disciplines reinforce this effect [1,2,3].
We argue that this phenomenon of scientific evolution is worth investigating
in more detail [4,5,6,7,8,9,10]. Specifically, ordinary researchers, as well as
researchers working in bibliometrics and scientometrics, might be interested in
the answers to the following questions:
Q1: Positive and negative trends: Which scientific concepts have become
common in recent years and which scientific concepts have become less
common? Which concepts have maintained their relevance over time?
Q2: Replacement of concepts: Which concepts have recently been supplanted
by other concepts?
In this work, we target those questions by extracting noun phrases from
a corpus of scientific papers, namely the contents of all computer science
papers of arXiv.org [11]. Then, positive and negative trends over time in the
set of extracted noun phrases are identified, contributing to answering Q1.
Furthermore, concepts that have replaced other concepts over time (reflected
in the usage statistics) are identified, contributing to answering Q2.

132
BIR 2019 Workshop on Bibliometric-enhanced Information Retrieval

2 Trend Detection
To find positive and negative trends in time series data, a variety of algorithms
are available [12,13]. In the following, we focus on those we used in our analysis
(see Sec. 3).
We consider years as intervals of the time series. Furthermore, we use the
normalized relative frequency of the concepts as the basis for our calculations.
Formally, let D = {d1 , ..., d|D| } be our document corpus. Let ci be a concept
in the concept set C occurring nci times in the corpus D. Let Dci be the set
of documents in which ci occurs at least once. Then, the normalized relative
frequency of ci on the document level is defined as the ratio of documents
containing ci with respect to all documents: rfci = |Dci |/|D|.
Simple Linear Regression. For this basic trend detection method, we
calculate for a given concept ci the difference between the relative frequencies
rfci ,k , rfci ,l of two time periods k, l (e.g., year 2007 and 2017): d = rfci ,k −rfci ,l .
Mann-Kendall τ . To obtain statistically significant trends in time series
data, the Mann-Kendall test [14,13] is commonly used. This test can be applied
as a non-parametric test for monotonic trends. The Mann-Kendall statistic can
be used as indication whether a trend exists statistically and whether it is
positive or negative. More formally, the null hypothesis of Kendall’s τ is that
there is no trend (H0 : τ = 0). The alternative hypothesis is that there is a trend
(H1 : τ 6= 0). Given that we have the concepts in a temporal order, let Gi be the
number of data points after yi that are greater than yi . Let Li be the number of
data points after yi that are smaller than yi . Then, the Kendall’s τ coefficient is
calculated as
τ = 2S/n(n − 1)
and S is thereby defined as
n−1
X
S= (Gi − Li )
i=1

and corresponds to the the sum of the differences between Gi and Li along the
time series. Since we are dealing with a sufficiently large number of time slots n,
we can assume normal distribution for the test statistic z [13,15] and write
τ
z=p
2(2n + 5)/9n(n − 1)
Theil-Sen Trend Line. The Theil-Sen estimate [16] can be used to estimate
the slope of a trend. It can be considered a non-parametric alternative to the
parametric ordinary least squares regression line. A Theil-Sen line models how
the median value changes linearly with time [13]. Formally, let
yj − yi
Bn = { : xi 6= xj , 1 ≤ i < j ≤ n}
xj − xi

The Theil-Sen estimator βˆn is then defined as the median of all slopes in Bn ,
i.e., βˆn = med(Bn) with med standing for the median.

133
BIR 2019 Workshop on Bibliometric-enhanced Information Retrieval

3 Trend Detection of Scientific Concepts
We now describe our approach for extracting concepts from scientific papers and
identifying positive and negative trends.1

3.1 Data Set and Extracting Concepts
We use the arXiv CS data set [11] as our database. This data set contains the
plaintexts of all papers hosted at arXiv.org in the field of computer science from
the early beginnings of arXiv.org until December 2017. As corpora covering the
contents of research papers are rare, and the usage of arXiv.org has become
increasingly common in recent years, we believe that this data set is a valid
basis for concept evolution analyses. In total, the data set covers about 90,000
papers, resulting in about 16 million plaintext sentences. Note that in this data
set, formulas have been replaced by placeholders for easier text processing.
Given the papers’ fulltexts, we are interested in the concepts mentioned in
these papers. For this paper, we use case-insensitive noun phrases as concept
representations. Thus, we extract noun phrases from the papers’ fulltexts.
Our approach uses an extended rule set of [18] (in total, eight rules) on the
part-of-speech tags assigned by the Stanford parser.
Given the 15.5M sentences from the initial data set, we obtained 10.67M
unique noun phrases (76.7M non-unique noun phrases). When ordering
the extracted noun phrases by absolute frequency, we can observe that
domain-specific concepts, which are in the focus of this paper, occur particularly
in the mid range, while functional words and phrases common for writing papers
(e.g., ”number,” ”section,” ”figure”) appear at the top.2 We use this observation
to filter out irrelevant concepts during trend detection.

3.2 Sparsity and Thresholds
The set of extracted noun phrases still contains many irrelevant, non-scientific
noun phrases. Processing all of them would result in large databases, unnecessary
trend calculations, and declined querying performance of indices. Thus, we
analyzed the effectiveness of several filtering methods (following the similar
procedure of [15]): (1) each concept needs to appear in at least 100 documents
within the whole corpus; (2) each concept needs to appear in at least three
different years; (3) the combination of methods 1 and 2. Table 1 shows the
results. We can observe that using threshold 1 (i.e., each concept must appear in
at least 100 documents) allows a considerable decrease in the number of concepts.
However, threshold 2 (i.e., the number of years in which each concept needs to
appear) also seems to be very effective. Ultimately, we followed [15] and used
the combination of (1) and (2).
1
See https://github.com/michaelfaerber/scholarly-trends for our source code
and [17] for a demonstration system based on our trend detection approach.
2
The data set of extracted noun phrases is available at https://github.com/
michaelfaerber/scholarly-trends.

134
BIR 2019 Workshop on Bibliometric-enhanced Information Retrieval

Table 1: Frequencies of noun phrases when applying various thresholds.
Threshold Frequency Percent of All Noun Phrases
< 100 occurrences 9,769,907 99.51%
≥ 100 occurrences 47,871 0.49%
occur in < 3 years 8,945,295 91.11%
occur in ≥ 3 years 872,600 8.89%
> 100 occurrences and occur in ≥ 3 years 47,759 0.49%
> 100 occurrences and occur in < 3 years 112 0.00014%

Table 2: Top 15 positively trending Table 3: Top 15 negativley trending
noun phrases by simple linear regression noun phrases by simple linear
(2007-2017). regression (2007-2017).
Noun phrase # Docs Noun phrase # Docs
experiments 26150 proof 31854
dataset 15111 theorem 31378
training 13708 definition 34978
table 36883 fact 53356
methods 29840 course 15154
performance 39505 lemma 22527
data 37760 elements 26122
features 17831 condition 20904
accuracy 16638 case 66931
datasets 10002 whose 39785
models 23789 us 52096
images 12065 length 26846
model 40009 notation 21878
training data 8444 construction 18790
method 38301 sense 24674

3.3 Applying Trend Detection Methods

Given the set of filtered noun phrase series, we apply the trend detection
algorithms outlined in Sec. 2, namely the simple linear regression, the
Mann-Kendall test, and the Theil-Sen estimate. We thereby obtained the
following findings:
Simple Linear Regression. We list the top 15 positively and negatively
trending noun phrases using the simple linear regression in Table 2 and 3.3 Very
general concepts (e.g., ”experiments,” ”dataset,” and ”training”) show a strong
increase in the usage over time in our data set. This might be surprising, but can
be partially explained by the fact that our considered concepts are from 2007
and 2017; rather general concepts remain over such a long time span. Given
the negatively trending concepts, it becomes apparent that concepts concerning
formalism and theories were much more important in 2007 than in 2017. Overall,
we can state that the simple linear regression leads partially to already relevant
3
The full list is available online at https://github.com/michaelfaerber/
scholarly-trends.

135
BIR 2019 Workshop on Bibliometric-enhanced Information Retrieval

Table 4: Top 15 positively trending noun Table 5: Top 15 negatively trending
phrases by Mann-Kendall test. noun phrases by Mann-Kendall test.
Noun phrase Mann- # Docs Noun phrase Mann- # Docs
Kendall z Kendall z
training data 4.20 8444 block length -4.20 937
high accuracy 4.20 2182 bits -4.20 9297
regularizer 4.20 1257 transmitted codeword -4.20 409
pixels 4.20 5817 course -4.20 15154
supplementary material 4.20 1409 capacity -4.20 8716
liu 4.20 1028 spin glasses -4.20 109
ground truth 4.20 4349 shannon -4.20 1912
synthetic images 4.20 278 bit -4.20 7945
gpus 4.20 1389 message -4.20 8146
hours 4.20 2277 codes -4.20 6015
cloud 4.20 1555 real numbers -4.20 2940
gradient 4.20 6963 cdma systems -4.20 94
higher accuracy 4.20 1206 alphabet size -4.20 570
millions 4.20 2963 intermediate nodes -4.05 725
machine learning techniques 4.20 776 codeword -4.05 3347

findings about trends of concepts, although many abstract concepts are also
found to be trending.
Mann-Kendall test. Table 4 and Table 5 list the top 15 positively and
negatively trending noun phrases using the Mann-Kendall test (see Sec. 2).
Out of all 47,759 indexed noun phrases, 19,525 of them are found to have
a statistically significant (positive or negative) trend over the years (using
Kendall’s τ |z| > 3 as in [15]). This value might appear high. However, note that
we have applied a strong filter for obviously irrelevant concepts (see Sec. 3.2).
Considering all trending noun phrases, we can recognize that the
Mann-Kendall test appears to be a reasonable trend detection method for our
case. We obtained noteworthy findings concerning the trending concepts:

– Among the positively trending concepts are many machine
learning-associated concepts, such as ”gradient,” ”deep neural networks,”
”convolutional neural networks,” ”convolutional layer,” and ”gpus.” The
metrics ”ROC” and ”AUC” (capitalized for better readability) are also
trending.
– ”One-shot learning” and ”data science” are identified as positively trending
and render the general orientation of computer science research in recent
years.
– Negatively trending noun phrases are particularly from the area of formal
(i.e., theoretical) computer science, such as the area of information theory.
Representative, negatively trending concepts are ”block length,” ”bits,”
”shannon,” and ”message,” but also ”decision problem” and ”Turing
machine.” It is quite obvious that arXiv was predominated by theoretical
computer science, while nowadays machine learning is the predominant field.
Ultimately, this means that our database is, to some extent, unbalanced.
However, we believe that it is acceptable, as it reflects the general orientation
of computer science research overall over the years.

136
BIR 2019 Workshop on Bibliometric-enhanced Information Retrieval

Table 6: Top 15 positive trending noun Table 7: Top 15 negative trending noun
phrases by Theil-Sen slope. phrases according by Theil-Sen slope.
Noun Phrase Theil-Sen slope Total # Docs. Noun Phrase Theil-Sen slope # Docs.
experiments 2.55 26150 theorem -2.55 31378
performance 2.50 39505 proof -2.32 31854
table 2.44 36883 definition -1.76 34978
dataset 2.24 15111 lemma -1.65 22527
data 2.12 37760 fact -1.40 53356
methods 2.04 29840 course -1.38 15154
training 1.95 13708 notion -1.33 18398
features 1.89 17831 case -1.24 66931
accuracy 1.70 16638 length -1.21 26846
parameters 1.61 33835 elements -1.18 26122
datasets 1.56 10002 us -1.18 52096
method 1.52 38301 following theorem -1.16 14283
experiment 1.45 14829 construction -1.16 18790
work 1.41 55240 sense -1.13 24674
images 1.41 12065 condition -1.09 20904

– Also, the concepts ”DBScan” and ”LDA” have been used with increasing
frequency (statistically proven) and have remained on a stable level in recent
years. This may appear surprising, as those concepts are believed to have
been established for a long time and therefore might be expected to decrease.
– ”Quantum computing” and ”PageRank” have not been identified as trending
but show a strong increase and then a plateau when being visualized over
time. These concepts have a very volatile time series.
– The programming language ”Scala” was on the rise and then became stable,
while ”Python” is still increasing in recent years.

Theil-Sen Estimate. Table 6 and Table 7, respectively, list the top positive
and negative trending noun phrases according to the Theil-Sen’s estimate (see
Sec. 2 for its definition). We can observe that using Theil-Sen leads to many
very generic concepts in the lists of trending concepts, such as ”experiments”
and ”dataset.” Thus, this trend detection method can be used to generate an
upper ontology instead of showing trends of specific scientific concepts.

4 Related Work

Trend Detection Based on Scientific Papers. Various papers presenting
approaches and demonstration systems deal with the evolution of research topics
over time [4,5,6,7,8,9,10]. Apart from the visualization frameworks for paper
collections (e.g., via maps or hierarchical views) [4,5], the approach-describing
papers differ from our paper as follows: (1) the authors cluster topics and, thus,
rather consider high-level concepts [5,6,9]; (2) they do not apply content-based
methods, but methods based on graphs and networks, such as the citation
information [10,9,8] and the author information [7]; (3) they use purely the
papers’ titles or abstracts but no papers’ full texts [5,7,8], which makes it hard
to cover also long-tail concepts.

137
BIR 2019 Workshop on Bibliometric-enhanced Information Retrieval

Information Extraction from Scientific Papers. In the past, several
kinds of information extraction techniques have been applied to scientific
papers, ranging from noun phrase extraction over entity linking to relation
extraction. Noteworthy in this context are also the SemEval tasks based on
scientific papers as data sets (see SemEval 2010 Task 5 ‘Automatic Keyphrase
Extraction from Scientific Articles” [19] and SemEval 2017 Task 10: “ScienceIE –
Extracting Keyphrases and Relations from Scientific Publications” [20]). While
the extraction of words and bigrams has already been applied to papers [7,8],
no paper dedicated to the analysis of scientific phrases in the papers’ full texts
has been presented to our knowledge.
Time-Series Analysis and Trend Detection. Among the most frequently
used methods for trend detection are the Mann-Kendall test and Sen’s slope [13].
Related to our work is the analysis of Daniel et al. [15] concerning trending
multi-word expressions in the Google Books data set. However, the domain
of books differs from our domain-specific use case. Furthermore, multi-word
expressions cover not only noun phrases, but also proverbs, greetings, etc.

5 Conclusion

In this paper, we have presented an analysis concerning positively and
negatively trending scientific concepts. We identified statistically trending
concepts included in all computer science papers of arXiv.org based on several
trend detection methods. We thereby found that the Mann-Kendall test performs
well for this task, while the simple regression and Theil-Sen estimate have
deficits, such as detecting rather general, non-scientific concepts. Based on the
trending concepts, we not only found that arXiv.org has a strong orientation
towards machine learning and deep learning, but we also identified rather
surprising usage patterns.
For the future, we plan to consider various scientific disciplines based on
the new arXiv data set presented in [21]. Moreover, we plan to perform a deeper
linguistic analysis of the arXiv papers’ content. For instance, extracting, storing,
and testing scientific hypotheses [22] might be a worthy task.

References

1. Bornmann, L., Mutz, R.: Growth rates of modern science: A bibliometric
analysis based on the number of publications and cited references. Journal of
the Association for Information Science and Technology 66(11) (2015) 2215–2222
2. Fortunato, S., Bergstrom, C.T., Börner, K., Evans, J.A., Helbing, D., Milojević,
S., Petersen, A.M., Radicchi, F., Sinatra, R., Uzzi, B., et al.: Science of science.
Science 359(6379) (2018) eaao0185
3. Ware, M., Mabe, M.: The STM Report: An overview of scientific and scholarly
journal publishing. (2015)
4. Zhang, C., Li, Z., Zhang, J.: A survey on visualization for scientific literature
topics. J. Visualization 21(2) (2018) 321–335

138
BIR 2019 Workshop on Bibliometric-enhanced Information Retrieval

5. Wang, X., Cheng, Q., Lu, W.: Analyzing evolution of research topics with
NEViewer: a new method based on dynamic co-word networks. Scientometrics
101(2) (2014) 1253–1271
6. Salatino, A.A., Osborne, F., Motta, E.: AUGUR: Forecasting the Emergence of
New Research Topics. In: Proceedings of the 18th ACM/IEEE on Joint Conference
on Digital Libraries. JCDL’18 (2018) 303–312
7. Bolelli, L., Ertekin, S., Giles, C.L.: Topic and Trend Detection in Text Collections
Using Latent Dirichlet Allocation. In: Proceedings of the 31th European
Conference on Information Retrieval. (2009) 776–780
8. Jo, Y., Lagoze, C., Giles, C.L.: Detecting Research Topics via the Correlation
between Graphs and Texts. In: Proceedings of the 13th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining. KDD’07
(2007) 370–379
9. Small, H., Boyack, K.W., Klavans, R.: Identifying emerging topics in science and
technology. Research Policy 43(8) (2014) 1450–1467
10. Popescul, A., Flake, G.W., Lawrence, S., Ungar, L.H., Giles, C.L.: Clustering and
Identifying Temporal Trends in Document Databases. In: Proceedings of IEEE
Advances in Digital Libraries 2000. ADL’00 (2000) 173–182
11. Färber, M., Thiemann, A., Jatowt, A.: A High-Quality Gold Standard for
Citation-based Tasks. In: Proceedings of the Eleventh International Conference
on Language Resources and Evaluation. LREC’18 (2018)
12. Gray, K.L.: Comparison of Trend Detection Methods. PhD thesis, University of
Montana, Department of Mathematical Sciences, Missoula, MT, USA (2007)
13. Interstate Technology and Regulatory Council: Groundwater Statistics and
Monitoring Compliance. Statistical Tools for the Project Life Cycle. (2013)
14. Gilbert, R.O.: Statistical Methods for Environmental Pollution Monitoring. John
Wiley & Sons (1987)
15. Daniel, T., Last, M.: Exploring Long-Term Temporal Trends in the Use of
Multiword Expressions. In: Proceedings of the 12th Workshop on Multiword
Expressions. MWE@ACL’16 (2016)
16. Sen, P.K.: Estimates of the regression coefficient based on Kendall’s tau. Journal
of the American statistical association 63(324) (1968) 1379–1389
17. Färber, M., Nishioka, C., Jatowt, A.: ScholarSight: Visualizing Temporal Trends of
Scientific Concepts. In: Proceedings of the 19th ACM/IEEE on Joint Conference
on Digital Libraries. JCDL’19 (2019)
18. Zhao, S., Li, C., Ma, S., Ma, T., Ma, D.: Combining POS Tagging, Lucene Search
and Similarity Metrics for Entity Linking. In: Proceedings of the 14th International
Conference on Web Information Systems Engineering. WISE’13 (2013) 503–509
19. Kim, S.N., Medelyan, O., Kan, M., Baldwin, T.: SemEval-2010 Task 5: Automatic
Keyphrase Extraction from Scientific Articles. In: Proceedings of the 5th
International Workshop on Semantic Evaluation. SemEval@ACL’10 (2010) 21–26
20. Augenstein, I., Das, M., Riedel, S., Vikraman, L., McCallum, A.: SemEval
2017 Task 10: ScienceIE – Extracting Keyphrases and Relations from Scientific
Publications. In: Proceedings of the 11th International Workshop on Semantic
Evaluation. SemEval@ACL’17 (2017) 546–555
21. Saier, T., Färber, M.: Bibliometric-Enhanced arXiv: A Data Set for Paper-Based
and Citation-Based Tasks. In: Proceedings of the 8th International Workshop on
Bibliometric-enhanced Information Retrieval. BIR’19 (2019)
22. Baker, N.C., Hemminger, B.M.: Mining connections between chemicals, proteins,
and diseases extracted from Medline annotations. Journal of Biomedical
Informatics 43(4) (2010) 510–519

139