<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Revaluating Semantometrics from Computer Science Publications</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Christin Katharina Kreutz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Premtim Sahitaj</string-name>
          <email>sahitaj@uni-trier.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ralf Schenkel</string-name>
          <email>schenkel@uni-trier.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Trier University</institution>
          ,
          <addr-line>54286 Trier, DE</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The identification of important publications is subject in many research projects. While the influence of citations in finding seminal papers has been analysed thoroughly, semantic features of citation networks are regarded with less vigour. In this paper, we revaluate the ideas of semantometrics presented by Herrmannova et al.[9,13] to learn patterns of features extracted from publication distances in their citation networks aiming at distinguishing between seminal and survey papers in the area of computer science. For the evaluation, we present the SeminalSurveyDBLP dataset. By using diferent document content representations, the incorporation of semantic distance measures, as well as multiple machine learning algorithms for the classification, we achieved an accuracy of up to 0.8015 on our dataset. Earlier findings in this area suggest features extracted from references to be more suitable proxies whereas we observed the contrasting importance of features describing citation information.</p>
      </abstract>
      <kwd-group>
        <kwd>Semantometrics</kwd>
        <kwd>Classification</kwd>
        <kwd>Citation Network</kwd>
        <kwd>Natural Language Processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        With the ever growing amount of scientific publications, automatic methods
for finding influential or seminal works are indispensable. While the majority
of research tackles the identification of important works [
        <xref ref-type="bibr" rid="ref7 ref9">9,39,35,40,7,38</xref>
        ], the
influence of semantic features in this context has not been explored thoroughly.
Citation based classification or impact measures are dataset dependent [
        <xref ref-type="bibr" rid="ref20">30,20</xref>
        ].
They need to be handled with care due to self-citations [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ], varying citation
practices in diferent areas [
        <xref ref-type="bibr" rid="ref5">5,30,32</xref>
        ], diverging reasons for citing [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], the
nonexistence of citations of new papers [38] and uncited influences [
        <xref ref-type="bibr" rid="ref19 ref23 ref6">23,19,6</xref>
        ].
      </p>
      <p>
        Distinguishing between seminal publications which advance science and
popular survey papers might pose a problem as both types are typically cited often
[30] but reviews are over-represented amongst highly cited publications while
not contributing any new content [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Seminal papers are ones which are key to
a field while surveys review and compare multiple approaches and can be
comprehensible summaries of a domain. Influential members of both classes can be
distinguished from all other publications by observing their number of citations.
      </p>
      <p>
        Diferentiating between seminal and review papers is not as simple. Therefore,
methods considering more factors than the number of citations and references
are desirable [
        <xref ref-type="bibr" rid="ref20">38,20</xref>
        ] as these values are no suficient proxy in measuring
publication impact and scientific quality [
        <xref ref-type="bibr" rid="ref11">11,30</xref>
        ]. Preferably, an approach with the
potential to measure the contribution of a paper and how much it advanced its
ifeld should be favoured.
      </p>
      <p>
        Herrmannova et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] assume the classification of a paper as seminal or
survey can be performed by observing semantometrics as a new metric for research
evaluation which uses diferences in full texts of a citation network to determine
the contribution or value of a publication [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. They conducted their
experiments on a multi disciplinary dataset [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. We want to access the usefulness of
this approach and revaluate the provided ground truth on a dataset restricted
to a less broad area.
      </p>
      <p>
        Our contribution is two-fold: We introduce SeminalSurveyDBLP, a dataset
suitable for the task of classifying a publication with usage of its citations and
references as seminal or survey paper. Additionally, we analyse the approach
presented by Herrmannova et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] using diferent document representations as
well as numerous classification algorithms and evaluate the usage of single and
multiple features in the classification process on the new dataset in a diferent
and more homogenous domain.
      </p>
      <p>
        The remaining content of this paper is organized as follows. Section 2 gives an
overview of the already established conceptual background and related research.
In Section 3, the SeminalSurveyDBLP dataset is presented. The succeeding
Section 4 introduces utilized document vector representations, distance measures
and classification algorithms to revaluate Herrmannova et al.’s approach [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] on
the new dataset. A detailed evaluation of our dataset on diferent feature modes
is given in Section 5.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Background and Related Work</title>
      <p>2.1</p>
      <sec id="sec-2-1">
        <title>Background</title>
        <p>
          Extraction of mathematical descriptors from data is common in medical
image analysis [
          <xref ref-type="bibr" rid="ref15 ref8">8,15</xref>
          ] but for publication networks, it was initially introduced as
semantometrics in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] to access research contribution.
        </p>
        <p>
          Herrmannova et al.’s approach [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] which uses these principles described in
[
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] is the base of this revaluation. They were the first to work with citation
distance networks (CDN) that each centre around a publication P which is
connected to references X and citations Y to classify if P is a seminal or survey
paper. Semantic distances that describe the relationships between publications
were measured: The distances between titles and abstracts of X and Y are
contained in group A, distances between a publication and its references are
included in group B and group C is composed by distances between P and its
ingoing citations. The semantic distances between entries of X can be found in
group D, symmetrically, distances between citing publications are stored in E.
        </p>
        <sec id="sec-2-1-1">
          <title>References of P</title>
          <p>X
x0
xn
x1
A
P
B
C
Y
y0
yn
y1
E</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Citations of P</title>
          <p>Relevant topics for our work besides semantometrics are text-based methods and
prediction of influence using citation networks.</p>
          <p>
            Several papers can be found in the field of language-based methods and
citation networks. Topical developments of documents with identification of
influential publications [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ] or similarity of full texts of citations [39] are appropriate
proxies in determining publication impact. Prediction of citation counts using
text-similarity of extracted popular terms of publications [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ] is also rooted in
this domain. Context-aware citation analysis on full texts and leading edge
impact assessment [
            <xref ref-type="bibr" rid="ref23">23</xref>
            ] and content similarity of abstracts of citing publications as
well as the cited papers [
            <xref ref-type="bibr" rid="ref26">37,40,26</xref>
            ] are able to identify important references of
publications.
          </p>
          <p>
            For prediction of influence based on the citation network of a publication, a
measure based on the desired audience or purpose of a paper [
            <xref ref-type="bibr" rid="ref22">22</xref>
            ], the fluctuation
or stability of members in research teams (research endogamy) [
            <xref ref-type="bibr" rid="ref10 ref21">21,33,10</xref>
            ] as well
as research contribution of individual authors measured on established links
between communities [
            <xref ref-type="bibr" rid="ref28">28</xref>
            ] can be analysed.
3
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>SeminalSurveyDBLP Dataset</title>
      <p>
        Half of the 1320 publications in our SeminalSurveyDBLP[31] dataset are seminal
while the other half of papers are surveys. All works are from the area of
computer science and adjacent fields as they are contained in dblp [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. For seminal
class
seminal
survey
publications, entries published in conferences attributed as A* at the CORE
Conference Ranking[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] such as SIGIR, JCDL or SIGCOMM were collected as
publications often cited (and thus important) tend to appear in high-impact
venues [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. We assume papers published in a seminal venue as attributed by the
CORE rank are seminal themselves, or they would not have been accepted for
such a venue even if they have not yet accumulated large amounts of citations.
This might be a strong assumption, as not every paper from an A* conference
is seminal and seminal papers can also appear in other venues. Surveys were
extracted from ACM Computing Surveys, Synthesis Digital Library of Engineering
and Computer Science and IEEE Communications Surveys and Tutorials. These
venues are specialized in solely publishing reviews. Every paper used has at least
ten citations and references.
      </p>
      <p>For each of the papers, the citations and references were collected. Citation
information and abstracts from the AMiner dataset [36] were joined with dblp
data to make sure they were also from computer science or adjacent domains.
The join was based on matching DOIs of dblp papers with ones from AMiner or
paper title and author name matches where DOIs were not present. Full texts are
not included in the AMiner dataset. Citations and references not contained in
dblp were omitted. For every paper, its year of release is also enclosed. Considered
publications for P , X and Y needed to have a length of at least ten terms in their
combined title and abstract. The dataset is engineered so that there are similar
numbers of citations and references for publications of the opposing classes. The
total number of unique publications contained in the dataset is 121,084. Table 1
shows statistics regarding the length of abstracts and number of citations for each
type of paper for an unstemmed version of the dataset. As the increased amount
of references is assumed to be a feature of survey papers compared to seminal
publications, the average and total number of references is higher and thus our
dataset is unbalanced in this aspect. Figure 2 shows the distribution of numbers
of references and citations for all papers of groups seminal and survey from
the dataset. Numbers of citations are distributed rather homogenous between
the two classes, but for references, diferences in the distributions can be seen.
While there are fewer publications with a few references for surveys, a gap in
the number of references from 40 to 50 can be seen for seminal papers.</p>
      <p>
        Of the 660 seminal papers 24 have received a best paper award. Only
incorporating publications which received an award would dramatically decrease the
size of the dataset as a similar distribution of references and citations for works
of both classes was prerequisite in its construction.
Herrmannova et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] proposed the usage of citation distance networks to
extract patterns from text which can be represented by distance features for making
assumptions whether publications are seminal or survey. First, document
representations of P , its references X and its citations Y need to be generated.
In a next step, distances between publications for every group A to E can be
calculated. From these sets of distances, 12 features are then computed each:
Minimum, maximum, range, mean, sum of distances in a group, standard
deviation, variance, 25th percentile, 50th percentile, 75th percentile, skewness, and
kurtosis. Those 12 5 = 60 features are named by concatenating the feature with
the group it originates from like minA or rangeE. On these features,
classification algorithms are able to predict the class seminal or survey a publication P
should be associated with.
      </p>
      <p>
        We extracted these features by using diferent configurations on which we
will conduct our evaluation. For document vector representations (V) we used
tf-idf to be able to compare our results with [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] directly as they also used this
text representation and doc2vec [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] as possible improvement. The tf-idf values
are computed on the 121,084 publications in the stemmed (S) or unstemmed
(U) SeminalSurveyDBLP dataset, abstracts of all citations and references were
included in the calculation of term frequencies. Weights for doc2vec (d2v) were
generated using the English unstemmed Wikipedia corpus from 20th January
2019. We refrained from using doc2vec on a stemmed corpus as this preprocessing
is no prerequisite for achieving good results [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. The model was trained to consist
of 300 dimensions with usage of distributed memory as learning algorithm.
      </p>
      <p>
        As distance measures, cosine distance was applied as described in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
additionally Jaccard distance was also used as a second method.
      </p>
      <p>
        Classification algorithms (C) used are logistic regression (LR), random forests
(RF), naïve Bayes (NB), support-vector machines (SVM), gradient boosting
(GB), k-nearest neighbours (KNN) and stochastic gradient descent (SGD). In
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], SVM, LR, NB and decision trees were applied. We wanted to include those
      </p>
      <p>V tf-idf d2v
distance measure Cosine Jaccard Cosine Jaccard</p>
      <p>
        U S U S U U
# significant features 27 31 30 32 50 27
overlap with [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] in % 48.48 54.55 54.55 54-55 78.78 39.39
classifiers except for decision trees which we omitted as we incorporated random
forests.
      </p>
      <p>
        To find influential features for every combination of document vector
representation and distance measure independent two-sided t-tests with p=0.1 were
conducted. Table 2 shows the number of significant features for the diferent
variants as well as the percentage of overlap when compared with the 33
significant features computed in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Usage of doc2vec resulted in the highest number
of significant features when combined with cosine distance. Here, the overlap
in significant features from SeminalSurveyDBLP and TrueImpactDataset [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is
also the highest. In general, the intersection of significant features between the
datasets is modest which indicates diferences between them.
5
      </p>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <p>
        Python 3.7 and scikit-learn [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] implementations of classifiers were used in the
evaluation process.
      </p>
      <p>Based on the single significant features, all significant features, all features,
the 33 significant features of the TrueImpactDataset and the features which
were significant in the TrueImpactDataset as well as in the SeminalSurveyDBLP
dataset were used in the classification process. All accuracies (A) and F1 scores
(F1) are rounded to four decimal places. Values have been calculated by usage
of ten-fold cross validation.
5.1</p>
      <sec id="sec-4-1">
        <title>Single Publications</title>
        <p>
          In a first step, classification solely on the vector representations of publications
P is tested without consideration of citations and references. This approach is
highly successful in determining the class of P . Wile using the doc2vec
representation of the TrueImpactDataset [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] an accuracy of 0.6109 and F1 score of
0.5903 can be reached in using logistic regression. Restricting the
TrueImpactDataset on publications from the area of Computer Science and Informatics
to establish comparability with SeminalSurveyDBLP results in 37 publications
with abstracts available. Classifying on this part of the dataset in doc2vec
document representation, results in an accuracy of 0.7838 and F1 score of 0.7895
dataset
V
C
LR
RF
NB
SVM
GB
KNN
SGD
        </p>
        <p>SeminalSurveyDBLP</p>
        <p>tf-idf</p>
        <p>U
A</p>
        <p>F1</p>
        <p>A</p>
        <p>S
F1
d2v</p>
        <p>U
A</p>
        <p>F1</p>
        <p>A
when applying gradient boosting. Meaningful tf-idf vectors could not be created
for the TrueImpactDataset as the abstracts of citations and references are not
contained in it.</p>
        <p>Using unstemmed tf-idf representations of the SeminalSurveyDBLP dataset
with gradient boosting results in an accuracy of 0.9288 with corresponding F1
score of 0.9258. For this task, tf-idf is a better proxy than doc2vec as highly
descriptive terms such as survey or review are encoded in single features in the
feature vector. Table 3 shows all results in detail.</p>
        <p>
          Remarkably, the dataset proposed by us is much more suitable for
classification of publications as survey or seminal than the TrueImpactDataset. This
might be owed to the creation process of the SeminalSurveyDBLP dataset as
there are papers chosen for class survey which originate from journals
specialized on surveys. It is unsurprising that they oftentimes contain this or similar
keywords in their title or abstract. As there is no easy way other than
resorting to such means in order to create a suficiently large database automatically,
this property of the dataset is nearly ineluctable. Conducting user studies to
ifnd seminal and survey publications leads to the same problems which occurred
in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Multiple submitted publication titles could not be matched to the real
papers, the diferent research areas are not evenly represented in the data and
human bias or misjudgement cannot be eliminated. Another possible
explanation for the good performance of our dataset might be the focus on one area. All
publications are from the wider field of computer science. Publications contained
in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] originate from all diferent disciplines where there might be a multiplicity
of ambiguous descriptions for surveys. This assumption is supported by the
considerably good performance of the TrueImpactDataset restricted on publications
solely attributed to come from computer science.
measure
V
C
LR
RF
NB
SVM
GB
KNN
SGD
measure
V
C
LR
RF
NB
SVM
GB
KNN
SGD
        </p>
        <p>F
F</p>
        <p>Cosine distance
tf-idf</p>
        <p>S
A
S</p>
        <p>A
F1</p>
        <p>F</p>
        <p>F1</p>
        <p>F
minE .6152 .6482 minE .6053 .6552 50pD .6697 .6813
sumE .6561 .647 sumE .6742 .6702 sumE .6379 .6368
rangeE .6083 .6551 rangeE .5939 .6642 50pD .6705 .701
sumC .722 .6959 sumC .7205 .6938 sumE .7167 .6934
sumE .7371 .7274 sumE .7379 .7293 sumE .7318 .7226
sumE .7159 .7017 sumE .7258 .7127 sumE .7045 .6986
sumE .6386 .7054 sumE .6379 .7114 avgD .6636 .6641</p>
        <p>Jaccard distance
tf-idf
F1</p>
        <p>F</p>
        <p>F1</p>
        <p>F</p>
        <p>F1
rangeE .6045 .5515 rangeE .6174 .5781 25pD .6015 .6086
sumE .6803 .6754 sumE .675 .6667 sumE .6205 .6225
skewE .6 .581 rangeE .603 .5313 25pD .5894 .6323
sumC .7311 .7019 sumC .7288 .6992 sumE .7091 .6863
sumE .7432 .7362 sumE .7409 .7328 sumE .7121 .7139
sumE .7121 .6984 sumE .7288 .7163 sumE .6758 .6698
sumE .6402 .7151 sumE .6371 .7023 25pD .5902 .4967
d2v</p>
        <p>U
A</p>
        <p>F1
d2v</p>
        <p>U</p>
        <p>A
U
A
U</p>
        <p>A</p>
      </sec>
      <sec id="sec-4-2">
        <title>5.2 Single Features</title>
        <p>
          In an additional experiment, the whole citation distance network of publications
is considered for the classification task based on a single feature. Each of the 60
features derived from the CDN of a publication is used on its own as input for
the machine learning algorithms. Classifying on these sole features lead to an
accuracy of up to 0.7432 with an F1 score of 0.7362, when using gradient boosting
with tf-idf on unstemmed documents and Jaccard as distance measure which is
+0.0535 compared to the top value from [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. In the most descriptive features, sum
and range of C and E were contained mostly for tf-idf text representations. For
doc2vec text representations sumE and percentiles from group D were found
to be good predictors. Features from group D can be interesting as they are
already present at the submission of the publication when a paper has yet to
gain citations. For all combinations of document representations and distance
measures, classifying on sumE lead to the best results.
        </p>
        <p>
          Herrmannova et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] found features of groups B, C and D to operate well
for this task while we observed contrasting behaviour. The diference in outcome
might be explained by the observation of citation practice of high impact
publications. In computer science, they behave diferently from those in other areas
as they cite from a narrow community [32]. While our dataset is rooted in this
one area, the TrueImpactDataset [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] spans multiple disciplines.
        </p>
        <p>Generally, features extracted from doc2vec performed worse than the tf-idf
ones. While gradient boosting is a good classifier for this task, logistic regression,
naïve Bayes and stochastic gradient descent did in general not perform well.
Table 4 shows the results of the single feature classification for every machine
learning algorithm and document representation in detail.</p>
        <p>Feature values are very similar distributed for cosine and Jaccard distance
with the same document representation. Figure 3 shows the exemplary
probability distributions for sumE, the best feature for unstemmed tfidf vectors combined
with Jaccard similarity.
5.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>All Features</title>
        <p>In the last experiments, the whole citation distance network of publications is
again used for classification based on multiple features. Here, of the 60 features
derived from the CDN of a publication, multiple features are used in combination
as input for the machine learning algorithms.</p>
        <p>Using cosine distance, tf-idf vectors perform worse than doc2vec document
representations when considering all significant features for the classification
task. When using Jaccard distance, tf-idf scores a higher accuracy than doc2vec.
On our dataset Jaccard distance performs better with document representations
of stemmed articles while no clear statement can be adhered for cosine distance.
Overall, gradient boosting and random forests are the best machine learning
algorithms in this context. The best accuracy of 0.8015 with corresponding F1
score of 0.8053 can be achieved using gradient boosting and cosine distance on
doc2vec text representation. This is +0.0583 in accuracy when compared to the
single feature variant.</p>
        <p>When using all features to classify seminal and survey publications, a similar
phenomenon with text representation and distance measure can be observed.
Here, Jaccard distance yields better results on vectors of stemmed text. Again,
the best algorithms are gradient boosting and random forests. With an accuracy
of 0.7955 and F1 score of 0.8, gradient boosting and cosine distance on doc2vec
mode
measure
all significant features
Cosine Jaccard
A A F1</p>
        <p>all features
Cosine Jaccard
A F1 C A F1
V C F1 C C
tf-idf U RF .7583 .754 GB .7644 .7624 GB .7652 .763 GB .7652 .7626
tf-idf S GB .7568 .7533 RF .7727 .7696 GB .7773 .7759 GB .7719 .7701
d2v U GB .8015 .8053 RF .7447 .7502 GB .7955 .8 GB .7341 .7398
vectors generate the best results, which are nearly as high as those achieved
when using only significant features in the classification.</p>
        <p>In Table 5, results of comparisons of the diferent document representations
and distance measures can be found for the classification on all significant
features and the prediction based on all available features.</p>
        <p>
          When using only the 33 features in the classification process, which were
found to be significant in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] or using only the features which were significant
in the TrueImpactDataset [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] as well as in SeminalSurveyDBLP, the same
pattern occurred for each combination of document representation and distance
measure. While cosine distance is better suited in combination with doc2vec,
when using tf-idf Jaccard distance produces better results. Gradient boosting
performed best for the task of classification based on the 33 significant features.
The highest accuracy of 0.7856 with an F1 score of 0.791 was achieved when
using cosine distance and doc2vec vectors. Using only the intersection of
significant features in both datasets resulted in an accuracy of 0.7871 with F1 score of
0.7929 which is marginally better. It is reasonable to observe a higher accuracy
in the classification task when observing less but simultaneously more relevant
features for a dataset. Focusing on a subset of meaningful features can lead to
better generalization of the resulting model [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] and thus higher accuracy after
cross-validation compared to usage of all, partially irrelevant features.
        </p>
        <p>
          Table 6 holds detailed values for the publication vectors and distance
measures for all features significant for Herrmannova et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] and for the features
which were significant in their dataset and in the SeminalSurveyDBLP dataset.
5.4
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>Discussion</title>
        <p>
          Herrmannova et al.’s best performing algorithm was naïve Bayes, for our dataset
gradient boosting and random forests achieved the best results [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. While they
only evaluated tf-idf in combination with cosine distance, we showed the
usefulness of Jaccard distance in the context of single feature classification. In
multi feature scenarios, classification based on doc2vec document
representation achieved the highest scores.
        </p>
        <p>
          Our best accuracy of 0.9288 was scored when classifying on sole tf-idf
document vectors of publications. This outcome suggests our proposed dataset is
mode ASF in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] ASF for us and in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]
measure Cosine Jaccard Cosine Jaccard
V C A F1 C A F1 C A F1 C A F1
tf-idf U GB .7561 .7519 GB .7652 .763 GB .7409 .7353 GB .7515 .7496
tf-idf S GB .7583 .7563 GB .7652 .7623 GB .7485 .745 GB .7606 .7588
d2v U GB .7856 .791 GB .7402 .7461 GB .7871 .7929 RF .7424 .7444
too artificially engineered due to its creation process for solving the
classification task on document vectors representing the publications alone. It could be
argued that this flaw diminished when using document vectors. When using
features extracted from a citation distance network, an accuracy of 0.8015 was
achieved with doc2vec and cosine on the combination of all significant features.
Single feature prediction lead to an accuracy which was 0.056 worse than the
multi feature case so the combination of (significant) features results in better
performance in the classification task at hand.
        </p>
        <p>
          For Herrmannova et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], features in a CDN from groups B and D which
represent references as well as ones from group C which represent citations were
relevant in the classification. We observed features from groups E and C which
represent citations to be performing well for the SeminalSurveyDBLP dataset.
This suggests a higher influence of citations compared to references in
determining if a publication is seminal or survey. Multiple reasons could lead to this:
Most of the referenced papers of a publication are not read by the authors [34].
Another aspect might be that references should not be weighted the same, as
they do not contribute equally to a paper [
          <xref ref-type="bibr" rid="ref23">40,23</xref>
          ]. Diferent referencing practices
in computer science [32] could also contribute to this finding. In the case of
citations and the publication, the diferent influences might cancel each other out
as a seminal paper is probably referenced as its approach is used while surveys
might rather summarize a topic.
        </p>
        <p>The results of the single feature prediction using doc2vec are promising for
cases, in which a publication was not yet able to accumulate lots of citations. In
several cases, features from group D which are already fully known at the time
of the release were the most descriptive ones.
6</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>We revaluated the identification of seminal and survey papers based on
semantometrics derived from our proposed SeminalSurveyDBLP dataset. We used tf-idf
and doc2vec as document vector representation, cosine and Jaccard distance as
well as a multiplicity of machine learning algorithms. Using multiple features
derived from semantic distances in the citation distance network of a publication
is highly useful for the classification in seminal and survey (accuracy 0.8015, F1
score 0.8053). Single feature classification worked well on features from group E
regardless of the underlying document vector representation and distance
measure. Gradient boosting was repeatedly amongst the best performing machine
learning algorithms.</p>
      <p>
        Contrasting Herrmannova et al.’s findings which suggest features of groups B,
C and D were good predictors in the classification task [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], our experiments point
to features from group E and C as best suited. This oppositional observation
could be explained by the nature of the used datasets. While Herrmannova et
al.’s dataset is multi disciplinary, SeminalSurveyDBLP is focused on papers from
computer science. This area is known to behave diferently from other domains
as highly cited publications limit their references to a narrow community [32].
      </p>
      <p>
        A worthwhile extension of the evaluated approach could be the usage of
LDA [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] document vectors combined with a suitable (weak) metric such as earth
mover’s distance or the incorporation of more statistical features such as entropy
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Other distance metrics or text representations such as GloVe [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ] could also
contribute to better results. Automatic feature engineering with deep feature
synthesis [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] could produce more descriptive features which in turn might lead
to higher accuracy.
      </p>
      <p>
        Another direction for further eforts could be hyperparameter tuning via grid
search or the incorporation of more sophisticated machine learning algorithms
such as gpt-2 [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] as classifier.
      </p>
      <p>Lastly, a thorough automatic evaluation of our dataset or even the creation of
a manually evaluated dataset with even more publications and full texts spanning
multiple research areas would be desirable. As the SeminalSurveyDBLP dataset
contains information on years of publications, it can be used to analyse if the
classification performance changes for papers which have had diferent periods
of time to accumulate citations. A new dataset which does not concentrate on
providing similar distributions of citations and references but instead purely
holds publications which received a best paper award as well as surveys could
describe another interesting bibliographic perspective to analyse.
30. P. O. Seglen: Why the impact factor of journals should not be used for evaluating
research. In: British Medical Journal 314(7079): 498-502 (1997).
31. https://doi.org/10.5281/zenodo.3258164
32. X. Shi, J. Leskovec, and D. A. McFarland: Citing for high impact. JCDL 2010:
49-58.
33. T. H. P. Silva, M. M. Moro, A. P. C. da Silva, W. Meira Jr., and A. H. F. Laender:</p>
      <p>Community-based endogamy as an influence indicator. JCDL 2014: 67-76.
34. M. V. Simkin and V. P. Roychowdhury: Read Before You Cite! In: Complex Systems
14(3) (2003).
35. M. V. Simkin and V. P. Roychowdhury: Copied citations create renowned papers?</p>
      <p>In: Annals of Improbable Research 11(1): 24–27 (2005).
36. J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su: ArnetMiner: Extraction and</p>
      <p>Mining of Academic Social Networks. KDD 2008: 990-998.
37. M. Valenzuela, V. Ha, and O. Etzioni: Identifying meaningful citations. Workshops
at the Twenty-Ninth AAAI Conference on Artificial Intelligence (2015).
38. A. D. Wade, K. Wang, Y. Sun, and A. Gulli: WSDM Cup 2016: Entity Ranking</p>
      <p>Challenge. WSDM 2016: 593-594.
39. R. Whalen, Y. Huang, A. Sawant, B. Uzzi, and N. Contractor: Natural
Language Processing, Article Content &amp; Bibliometrics: Predicting High Impact Science.</p>
      <p>ASCW’15 Workshop at Web Science 2015: 6-8.
40. X. Zhu, P. D. Turney, D. Lemire, and A. Vellino: Measuring academic influence:
Not all citations are equal. In: JASIST 66(2): 408-427 (2015).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. D. W. Aksnes:
          <article-title>Characteristics of highly cited papers</article-title>
          .
          <source>In: Research Evaluation</source>
          <volume>12</volume>
          (
          <issue>3</issue>
          ):
          <fpage>159</fpage>
          -
          <lpage>170</lpage>
          (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>D. M. Blei</surname>
            ,
            <given-names>A. Y.</given-names>
          </string-name>
          <string-name>
            <surname>Ng</surname>
            , and
            <given-names>M. I.</given-names>
          </string-name>
          <article-title>Jordan: Latent Dirichlet Allocation</article-title>
          .
          <source>In: Journal of Machine Learning Research</source>
          <volume>3</volume>
          :
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>A.</given-names>
            <surname>Blum</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Langley</surname>
          </string-name>
          :
          <article-title>Selection of Relevant Features and Examples in Machine Learning</article-title>
          .
          <source>In: Artif. Intell</source>
          .
          <volume>97</volume>
          (
          <issue>1-2</issue>
          ):
          <fpage>245</fpage>
          -
          <lpage>271</lpage>
          (
          <year>1997</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>4. http://www.core.edu.au/.</mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>B.</given-names>
            <surname>Cronin</surname>
          </string-name>
          and
          <string-name>
            <surname>L. I.</surname>
          </string-name>
          <article-title>Meho: Using the h-index to rank influential information scientists</article-title>
          .
          <source>In: JASIST</source>
          <volume>57</volume>
          (
          <issue>9</issue>
          ):
          <fpage>1275</fpage>
          -
          <lpage>1278</lpage>
          (
          <year>2006</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. E. Garfield: Can Citation Indexing Be Automated?
          <source>In: Essays of an Information Scientist</source>
          <volume>1</volume>
          :
          <fpage>84</fpage>
          -
          <lpage>90</lpage>
          (
          <year>1964</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>S.</given-names>
            <surname>Gerrish</surname>
          </string-name>
          and
          <string-name>
            <surname>D. M.</surname>
          </string-name>
          <article-title>Blei: A Language-based Approach to Measuring Scholarly Impact</article-title>
          .
          <source>ICML</source>
          <year>2010</year>
          :
          <fpage>375</fpage>
          -
          <lpage>382</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Gillies</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. E.</given-names>
            <surname>Kinahan</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Hricak</surname>
          </string-name>
          :
          <article-title>Radiomics: Images Are More than Pictures, They Are Data</article-title>
          .
          <source>In: Radiology</source>
          <volume>278</volume>
          (
          <issue>2</issue>
          ):
          <fpage>563</fpage>
          -
          <lpage>577</lpage>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>D.</given-names>
            <surname>Herrmannova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Knoth</surname>
          </string-name>
          , and
          <string-name>
            <surname>R. M.</surname>
          </string-name>
          <article-title>Patton: Analyzing Citation-Distance Networks for Evaluating Publication Impact</article-title>
          .
          <source>LREC</source>
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>D.</given-names>
            <surname>Herrmannova</surname>
          </string-name>
          , Petr Knoth:
          <article-title>Semantometrics in Coauthorship Networks: Fulltextbased Approach for Analysing Patterns of Research Collaboration</article-title>
          . In:
          <string-name>
            <surname>D-Lib Mag</surname>
          </string-name>
          .
          <volume>21</volume>
          (
          <issue>11</issue>
          /12) (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <given-names>D.</given-names>
            <surname>Herrmannova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Patton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Knoth</surname>
          </string-name>
          , and
          <string-name>
            <surname>C. G.</surname>
          </string-name>
          <article-title>Stahl: Citations and readership are poor indicators of research excellence: Introducing TrueImpactDataset, a New Dataset for Validating Research Evaluation Metrics</article-title>
          .
          <source>In: Proceedings of the 1st Workshop on Scholarly Web Mining</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>J. M. Kanter</surname>
            and
            <given-names>K.</given-names>
          </string-name>
          <article-title>Veeramachaneni: Deep feature synthesis: Towards automating data science endeavors</article-title>
          .
          <source>DSAA</source>
          <year>2015</year>
          :
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>P.</given-names>
            <surname>Knoth</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Herrmannova: Towards Semantometrics: A New Semantic Similarity Based Measure for Assessing a Research Publication's Contribution</article-title>
          .
          <source>In: D-Lib Magazine</source>
          <volume>20</volume>
          (
          <issue>11</issue>
          /12) (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. F.-T. Krell:
          <article-title>The poverty of citation databases: data mining is crucial for fair metrical evaluation of research performance</article-title>
          .
          <source>In: BioScience</source>
          <volume>59</volume>
          (
          <issue>1</issue>
          ):
          <fpage>6</fpage>
          -
          <lpage>7</lpage>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15. V.
          <string-name>
            <surname>Kumar</surname>
          </string-name>
          et al.:
          <article-title>Radiomics: the process and the challenges</article-title>
          .
          <source>In: Magnetic Resonance Imaging</source>
          <volume>30</volume>
          (
          <issue>9</issue>
          ):
          <fpage>1234</fpage>
          -
          <lpage>1248</lpage>
          (
          <year>2012</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          and
          <string-name>
            <surname>T.</surname>
          </string-name>
          <article-title>Mikolov: Distributed Representations of Sentences and Documents</article-title>
          .
          <source>ICML</source>
          <year>2014</year>
          :
          <fpage>1188</fpage>
          -
          <lpage>1196</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>M. Ley: DBLP - Some Lessons</surname>
          </string-name>
          <article-title>Learned</article-title>
          .
          <source>In: PVLDB</source>
          <volume>2</volume>
          (
          <issue>2</issue>
          ):
          <fpage>1493</fpage>
          -
          <lpage>1500</lpage>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <given-names>A.</given-names>
            <surname>Livne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Adar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Teevan</surname>
          </string-name>
          , and
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>Dumais: Predicting citation counts using text and graph mining</article-title>
          .
          <source>iConference 2013 Workshop on Computational Scientometrics.</source>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>M. H. MacRoberts</surname>
            and
            <given-names>B. R.</given-names>
          </string-name>
          <article-title>MacRoberts: Problems of citation analysis: A study of uncited and seldom-cited influences</article-title>
          .
          <source>In: JASIST</source>
          <volume>61</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>H. F. Moed</surname>
          </string-name>
          <article-title>: The impact-factors debate: the ISI's uses and limits</article-title>
          .
          <source>In: Nature</source>
          <volume>415</volume>
          :
          <fpage>731</fpage>
          -
          <lpage>732</lpage>
          (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <given-names>S. L.</given-names>
            <surname>Montolio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dominguez-Sal</surname>
          </string-name>
          , and J.
          <string-name>
            <surname>-L.</surname>
          </string-name>
          Larriba-Pey:
          <article-title>Research endogamy as an indicator of conference quality</article-title>
          .
          <source>In: SIGMOD Record</source>
          <volume>42</volume>
          (
          <issue>2</issue>
          ):
          <fpage>11</fpage>
          -
          <lpage>16</lpage>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>R. M. Patton</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Herrmannova</surname>
            ,
            <given-names>C. G.</given-names>
          </string-name>
          <string-name>
            <surname>Stahl</surname>
            ,
            <given-names>J. C.</given-names>
          </string-name>
          <string-name>
            <surname>Wells</surname>
            , and
            <given-names>T. E.</given-names>
          </string-name>
          <string-name>
            <surname>Potok</surname>
          </string-name>
          <article-title>: Audience Based View of Publication Impact</article-title>
          .
          <source>WOSP@JCDL</source>
          <year>2017</year>
          :
          <fpage>64</fpage>
          -
          <lpage>68</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>R. M. Patton</surname>
            ,
            <given-names>C. G.</given-names>
          </string-name>
          <string-name>
            <surname>Stahl</surname>
            , and
            <given-names>J. C.</given-names>
          </string-name>
          <article-title>Wells: Measuring Scientific Impact Beyond Citation Counts</article-title>
          . In:
          <string-name>
            <surname>D-Lib</surname>
            <given-names>Magazine</given-names>
          </string-name>
          22(
          <issue>9</issue>
          /10) (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Pedregosa</surname>
          </string-name>
          et al.:
          <article-title>Scikit-learn: Machine Learning in Python</article-title>
          .
          <source>JMLR</source>
          <volume>12</volume>
          :
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>J. Pennington</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Socher</surname>
            , and
            <given-names>C. D.</given-names>
          </string-name>
          <article-title>Manning: GloVe: Global Vectors for Word Representation</article-title>
          .
          <source>EMNLP</source>
          <year>2014</year>
          :
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <given-names>D.</given-names>
            <surname>Pride</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Knoth: Incidental or Influential</surname>
          </string-name>
          ?
          <article-title>- Challenges in Automatically Detecting Citation Importance Using Publication Full Texts</article-title>
          .
          <source>TPDL</source>
          <year>2017</year>
          :
          <fpage>572</fpage>
          -
          <lpage>578</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <string-name>
            <surname>and I.</surname>
          </string-name>
          <article-title>Sutskever: Language Models are Unsupervised Multitask Learners</article-title>
          . (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <given-names>L. M. A.</given-names>
            <surname>Rocha and M. M.</surname>
          </string-name>
          <article-title>Moro: Research Contribution as a Measure of Influence</article-title>
          .
          <source>In: Proceedings of the 2016 International Conference on Management of Data:</source>
          <fpage>2259</fpage>
          -
          <lpage>2260</lpage>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29. M.
          <article-title>Schreiber: The influence of self-citation corrections on Egghe's g index</article-title>
          .
          <source>In: Scientometrics</source>
          <volume>76</volume>
          (
          <issue>1</issue>
          ):
          <fpage>187</fpage>
          -
          <lpage>200</lpage>
          (
          <year>2008</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>