<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Bouche, T., Teschke, O., Wojciechowski, K.: Time lag in mathematical references.
Eur. Math. Soc. Newsl.</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1007/978-3-642-31374-5</article-id>
      <title-group>
        <article-title>Leveraging Mathematical Sub ject Information to Enhance Bibliometric Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maria Koutraki</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Olaf Teschke</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harald Sack</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabian Muller</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adam Bannister</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>FIZ Karlsruhe</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Leibniz Institute for Information Infrastructure</institution>
          ,
          <addr-line>Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2004</year>
      </pub-date>
      <volume>86</volume>
      <issue>54</issue>
      <fpage>12</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>The eld of mathematics is known to be especially challenging from a bibliometric point of view. Its bibliographic metrics are especially sensitive to distortions and are heavily in uenced by the subject and its popularity. Therefore, quantitative methods are prone to misrepresentations, and need to take subject information into account. In this paper we investigate how the mathematical bibliography of the abstracting and reviewing service Zentralblatt MATH (zbMATH) could further bene t from the inclusion of mathematical subject information MSC2010. Furthermore, the mappings of MSC2010 to Linked Open Data resources have been upgraded and extended to also bene t from semantic information provided by DBpedia.</p>
      </abstract>
      <kwd-group>
        <kwd>scientometrics</kwd>
        <kwd>bibliometrics</kwd>
        <kwd>linked data</kwd>
        <kwd>mathematics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The eld of mathematics is known to be especially challenging from a
bibliometric point of view. The application of general bibliometric methods have led to
generally non-satisfactory outcomes, resulting in a broad rejection of such
statistical measures in the mathematical community [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Several speci cs of
mathematical publications come into e ect here: rst of all, since mathematics is a
relatively small area in terms of number of documents and references, metrics
are especially sensitive to distortions [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This leads, e.g., to the situation that a
journal impact factor is currently uncorrelated with its scienti c quality [
        <xref ref-type="bibr" rid="ref3">3,18</xref>
        ].
A second factor is the unusual longevity of mathematical research [5]. With a
citation half-life well beyond the period of ten years, standard measures fail to
count the most signi cant impact quantities.
      </p>
      <p>A third e ect is the very diverse nature of mathematics. Being the language
of modern exact science, mathematical research is interspersed with basically
all scienti c subjects, which heavily in uences the publication behavior in all
possible aspects such as availability, peer review policies, publication delay, or
coauthor networks [13]. Consequently, quantitative measures vary vastly even
when restricted to mathematical research alone. Publication and citation
frequencies in particular are heavily in uenced by the subject. A known e ect is
the perceived topical bias in generalist math journals [6]: several mathematical
areas, which contribute largely to the overall quantitative publication (and
citation) numbers, as e.g. (mathematical) computer science, statistics, mathematical
economics, or mathematical physics, contribute only marginally to generalist top
tier mathematical journals. As shown in [10], this e ect prevails for all generalist
math journals independent of speci cs like region, editorial board, or publisher.</p>
      <p>Therefore, quantitative methods (bibliometric analysis, identi cation of trends
or hot research topics, etc.) are prone to misrepresentations, and need to take
subject information into account. An adequate starting point would be to employ
the Mathematical Subject Classi cation (MSC2010).</p>
      <p>The Mathematics Subject Classi cation (MSC)3 is a classi cation scheme
introduced in 1970 and maintained by Mathematical Reviews and zbMATH. The
MSC has been revised every decade to adapt to the development of mathematics.
Its current version MSC2010 was published in 2009. Traditionally a hierarchical
system, its suitability to re ect the underlying connections between
mathematical subjects is limited, though the recent versions include some attempts like
cross-references to improve on this. For bibliometric studies, it would be highly
desirable to derive further information on the similarity of MSC classes.</p>
      <p>As MSC2010 has already been represented as Linked Data and mapped to
the DBpedia4 knowledge base as well as to the ACM Computing Classi cation
System5, the mapped information sources can be deployed as complementary
information for further bibliometric and scientometric analysis.</p>
      <p>The statistical analysis in this paper is based on the MSC assignments to
mathematical publications in the zbMATH database. zbMATH (formerly
Zentralblatt MATH) 6, is the world's most comprehensive and longest-running
abstracting and reviewing service in pure and applied mathematics. Produced by
FIZ Karlsruhe { Leibniz Institute for Information Infrastructure (FIZ
Karlsruhe), it is edited by the European Mathematical Society (EMS), FIZ
Karlsruhe, and the Heidelberg Academy of Sciences and Humanities, and distributed
by Springer.</p>
      <p>Earlier work [16] applied NLP methods to the zbMATH corpus, and obtained
the following overlap of top-level MSC classes (cf. g. 1). Darker colors indicate
a higher similarity of subjects, hence the overall picture suggests that the
classi cation is far from being relatively homogeneous, but rather contains many
intrinsic relations which can be further studied by similarity analysis.</p>
      <p>In this paper we make the following contributions:
(i) The already existing MSC2010 mapping to DBpedia has been corrected,
upgraded to the most recent version, and enriched with additional subject
mappings via the SKOS vocabulary [11].
3 http://msc2010.org/
4 http://dbpedia.org/
5 http://www.acm.org/about/class/
6 https://zbmath.org/
(ii) Statistical and semantic measures have been applied to compute the
similarity of the MSC categories.
(iii) Inconsistencies and other issues have been detected in MSC2010 that should
be addressed in the subsequent version MSC2020.</p>
      <p>The paper is structured as follows: Sec. 2 presents the underlying zbMATH
data sources as well as a brief outline of related work. In Sec. 3, the statistical
analysis of the MSC2010 subject classi cation based on the bibliographic data
of zbMATH is presented including the linking of MSC2010 to DBpedia as well
as the semantic similarity computation based on DBpedia. Sec. 4 discusses the
achieved results and Sec. 5 concludes the paper with a brief summary and an
outlook on future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>The zbMATH database contains more than 3.5 million bibliographic entries with
reviews or abstracts currently drawn from more than 3,000 journals and serials,
and 170,000 books. Almost 10 million matched references provide the links of the
citation network. Reviews are written by more than 10,000 international experts,
and the entries are classi ed according to the MSC scheme (MSC 2010). The
coverage starts in the 18th century and is complete from 1868 to the present by the
integration of the \Jahrbuch uber die Fortschritte der Mathematik" database.
zbMATH is a subscription service but also allows non-subscribers to ask queries
and access the freely accessible zbMATH author pro le pages.</p>
      <p>The Mathematical Subject Classi cation MSC2010 is organized as
threelevel classi cation tree with 63 rst-level nodes, over 400 second-level nodes and
more than 5,000 leaf nodes. MSC2010 is used by zbMATH to provide subject
information for more than 3 million research articles, chapters, and proceedings
papers, which are already indexed using this schema. MSC is designed for
indexing resources at the granularity of articles or conference proceeding papers, i.e.
it exposes article topics in a general way, but does not include speci c theorems,
functions, or sequences proven, which are discussed in a paper [14]. Documents
from 1970{2009 classi ed by earlier MSC versions have been mapped to the
recent MSC2010 by conversion tables. Fig. 2(left) shows the distribution of the
zbMATH papers in the di erent MSC categories while g. 2(right) shows the
distribution of the authors of the papers to the MSC categories. Fig. 3 illustrates
the highly diverse citation frequency (i.e., average citations to a paper indexed in
zbMATH in a given subject) for the top level MSC categories, which di ers by a
factor greater than 10. This reinforces the necessity to take subject information
into account for bibliometric studies in mathematics.</p>
      <p>MSC taxonomy has become part of the Web of Data being represented via
SKOS (Simple Knowledge Organisation System, [12]) vocabulary and RDF
(Resource Description Framework, [9]) and mapped to identical concepts in DBpedia
as well as ACM Computing Classi cation System via owl:sameAs links [8]. In
[7], Hu et al. describe a similar e ort where data and metadata from the
Semantic Web Journal are exposed as linked data, using a SPARQL endpoint. In
this work, instead of the MSC classi cation, the authors use and extend the
Bibliographic ontology7.</p>
      <p>
        One of the subtasks in this work is to reveal the relations between the
MSC2010 categories. To this purpose, we use the owl:sameAs links of the MSC
categories to the corresponding ones in DBpedia and compute the semantic
similarity of the second to use as an extra factor to the general analysis of the
MSC2010 taxonomy. Several methods have been proposed to compute semantic
similarity in ontologies as [17] and [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. I this work we decided to use the
proposed approach in [15] since it is focused on computing the semantic similarity
among the DBpedia resources.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Statistical and Semantic Analysis</title>
      <p>For this work, already existing mappings of MSC2010 categories and DBpedia
entities by Lange et al. [8] had to be corrected and synchronized with the
current version of DBpedia8. In the course of this process, new mappings between
MSC2010 categories and DBpedia entities have also been created (cf. g. 4). The
following adjustments have been made:
{ The original mappings contain links to many so-called redirect pages, i.e.</p>
      <p>
        URIs (Uniform Resource Identi ers) that do not directly identify a
DBpedia entity. DBpedia entities correspond to Wikipedia pages. Thus, DBpedia
redirect URIs correspond to Wikipedia redirect pages. These serve the
purpose of linking a Wikipedia page to alternative spellings, misspellings, or
synonyms of the title of the underlying subject. DBpedia redirect URIs had
to be replaced by their original DBpedia entity they are redirecting to.
{ Since DBpedia is based on Wikipedia snapshots, which are published only a
few times per year, entities represented in DBpedia are subject to changes [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
As e.g. several names of Wikipedia categories or YAGO categories for
mathematical subjects corresponding to MSC2010 categories have been changed
and thereby had to be substituted by its successors whenever possible.
{ MSC2010 categories are arranged as a taxonomy. Subordinate categories are
linked to their superordinate categories via skos:narrower and vice versa
via skos:broader. Thereby, DBpedia entities linked via owl:sameAs can be
considered identical to their corresponding MSC2010 categories, and new
skos:narrower and skos:broader relations can be created for the
superordinate or subordinate MSC2010 categories and their corresponding DBpedia
entities. Likewise, the MSC2010 categories linked by skos:seeAlso to other
7 http://bibliontology.com/
8 DBpedia version 2016-04, http://wiki.dbpedia.org/dbpedia-version-2016-04
skos:seeAlso
      </p>
      <p>ConceptScheme</p>
      <p>skos:TopConceptOf
skos:hasTopConcept
skos:inScheme</p>
      <p>DBpediaEntity
Concept</p>
      <p>Concept
skos:seeAlso
owl:sameAs
skos:narrower
skos:broader
skos:broader
Concept owl:sameAs
skos:narrower</p>
      <p>DBpediaEntity</p>
      <p>MSC2010 categories can also be linked to their corresponding DBpedia
entities via the same property. Fig. 4 depicts the newly created mappings with
thick arrows and property names in bold face.</p>
      <p>In the original MSC2010 DBpedia mapping, 970 MSC2010 subjects are linked
to 2,690 DBpedia entities via owl:sameAs. 5,245 MSC2010 subjects remained
without a mapping to DBpedia. Via the new mappings mentioned above, 1,525
MSC2010 subjects could be mapped to DBpedia entities via skos:narrower, an
additional 234 MSC2010 subjects via skos:broader, as well as 283 MSC2010
subjects via skos:seeAlso9.</p>
      <p>To better understand the similarity of the MSC2010 categories based on the
number of co-occurring papers, the available mappings of MSC2010 to DBpedia
have been used to compute the semantic similarity between the mapped
DBpedia entities via the semantic similarity measure proposed by [15]. The semantic
similarity of two DBpedia entities is based on taking into account the
similarity of the properties of these resources as well as satisfying the fundamental
axioms for similarity measures such as \equal self-similarity", \symmetry" and
\minimality". For each pair of MSC2010 categories, the semantic similarity of
corresponding DBpedia entities linked via owl:sameAs has been computed as
well and compared to the similarity of MSC2010 categories based on the papers
they are assigned to.</p>
      <p>To compute the similarity of the MSC2010 categories we use the Jaccard
similarity measure (see Equation 1). In our setting, for each MSC2010 category
pair (cata, catb), the similarity is translated as the number of papers assigned to
9 MSC2010 mapping is available at http://bit.ly/MSC2010mapping-ntriples
both categories cata and catb, normalized by the total number of papers assigned
to cata or to catb.</p>
      <p>Jaccard(cata; catb) :=
cata \ catb
cata [ catb
We compute the Jaccard similarity between categories w.r.t the zbMATH
collection of documents from the mathematical domain. Each paper from zbMATH
is assigned to one or more MSC2010 categories.</p>
      <p>Furthermore, apart from the Jaccard coe cient presented in Equation 1, we
compute the asymmetric Jaccard between two categories as in Equation 2.</p>
      <p>In other words, the number of papers assigned to both categories cata and
catb normalized by the number of papers assigned to cata. The asymmetric
Jaccard is a useful indicator for the discovery of subclass relationships between
the categories based on the papers assigned to them.</p>
      <p>asymmetric Jaccard(cata ! catb) :=
cata \ catb
cata</p>
      <p>Since both measures, Jaccard and semantic similarity, are not directly
comparable due to their di erent scaling behavior, we measure their correlation and
see how the di erent measures capture the similarities between the di erent
categories. We measure the correlation via Pearson's rank correlation.</p>
      <p>Table 1 shows the achieved correlation results for symmetric Jaccard as well
as for asymmetric Jaccard with the semantic similarities, based on DBpedia
entities. For the experiment the top 10,000 similar MSC2010 categories for
symmetric Jaccard as well as for asymmetric Jaccard have been taken into account.
Of these categories only 937 for symmetric and 672 for asymmetric (cata ! catb)
could be mapped to DBpedia via owl:sameAs. The results are discussed in the
subsequent section.
(1)
(2)
similarity measure
semantic { Jaccard(cata; catb)
semantic { asymmetric Jaccard(cata ! catb)
r
0.34
-0.04
In this section, we discuss our ndings related to (i) the statistical and semantic
similarity measures applied to the MSC taxonomy and (ii) the issues in the
MSC2010 that can be addressed in the next 2020 version.</p>
      <sec id="sec-3-1">
        <title>Similarity measures comparison</title>
        <p>Table 1 presents the correlation between the two similarity measures, the
semantic and the Jaccard similarity. In the rst row, we show the correlation between
the semantic similarity and the symmetric Jaccard coe cient. We nd moderate
correlation between the two measures with a score of r = :34. As far as it
concerns the last row of the table, the correlation for the semantic similarity and the
asymmetric Jaccard is weak. The explanation for this lies in the di erent nature
of the two measures. The asymmetric Jaccard is a measure that is suitable for
revealing subsumption relationships in contrast to the semantic similarity. The
solution to this will be to compute both ways asymmetric Jaccard for each pair
and consider them as similar if both ways asymmetric Jaccard is high. That is
what the symmetric Jaccard is designed for.</p>
      </sec>
      <sec id="sec-3-2">
        <title>4.2 Issues in MSC2010</title>
        <p>Structural issues. One of the contributions of this work is to be able to discover
and suggest part of the MSC2010 taxonomy that should be changed or improved
in the next revision of the MSC, the MSC2020. A rst nding that came out
of the semantic analysis of the corpus was that subtrees in the taxonomy in
which the rst- or second-level category bears exactly the same name as the
child category in the next level (second or third), the additional level of
hierarchy generally does not contribute any distinguishing value. Furthermore, in the
majority of the cases in the third level there is only one more category labeled
as \None of the above, but in this section". Worth to be mentioned here is that
the vast majority of the papers related to those rst-level categories from the
zbMATH data are directly associated to the second-level category with the same
label. Only very few of them are associated to the second-level category with
label \None of the above, but in this section" and none are directly associated
to the rst-level category.</p>
        <p>To this end, a suggestion for the next MSC version would be to revise those
branches in the taxonomy and possibly merge them into one category per
subtree. Table 2 presents a non-exhaustive list of categories that fall under the
described exceptional case.</p>
        <p>Wrong owl:sameAs links. Another interesting observation we made concerns the
owl:sameAs links that exist between the MSC2010 and DBpedia. Many of the
existing owl:sameAs relations link the MSC categories to the redirect resources
of DBpedia instead of linking them to the correct resources themselves.
Moreover, there are even erroneous links between the two schemas, for examples the
category 65A05 with label Tables is linked (among others) to the dbr:tablets10.
Therefore, together with the new revision of the MSC2020 another e ort can be
made in improving and extending the existing owl:sameAs links.
10 http://dbpedia.org/resource/Tablets</p>
        <p>Supercategory</p>
        <p>Subcategory
label
13Gxx
14Txx
22Cxx
45Bxx
45Dxx
45Pxx
45Qxx
62Qxx
65Axx
70Cxx
83Axx
83Fxx
85-XX
13G05
14T05
22C05
45B05
45D05
45P05
45Q05
62Q05
65A05
70C20
83A05
83F05
85Axx</p>
        <p>Integral domains
Tropical geometry
Compact groups
Fredholm integral equations
Volterra integral equations
Integral operators
Inverse problems
Statistical tables
Tables
Statics
Special relativity
Cosmology
Astronomy and astrophysics</p>
        <p>Conclusion and Outlook on Future Work
In this paper we have shown that beyond the three-level hierarchical structure of
the MSC2010 taxonomy, intrinsic similarities can be measured in several
alternative ways. These indicate options to enhance the taxonomy structure towards
a more detailed ontology. The updated and partially corrected DBpedia
linking provides additional information. A rst useful result is the detection of a
systematic aw in the current MSC2010 scheme concerning the \None of the
above, but in this section" leafs. A more detailed future analysis will be aimed to
suggest more adaptations. Trend mining of publications since 2010 will indicate
new research areas, which currently are not covered su ciently, and a further
discussion of similarities is likely to provide structural insights supporting the
ongoing task of MSC2020 revision.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Adler</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ewing</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , P.: Citation statistics.
          <source>A report from the International Mathematical Union</source>
          (
          <article-title>IMU) in cooperation with the International Council of Industrial and Applied Mathematics (ICIAM) and the Institute of Mathematical Statistics (IMS)</article-title>
          .
          <source>Stat. Sci</source>
          .
          <volume>24</volume>
          (
          <issue>1</issue>
          ),
          <volume>1</volume>
          {
          <fpage>14</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Alsubait</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parsia</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sattler</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>Measuring similarity in ontologies: A new family of measures</article-title>
          .
          <source>In: EKAW</source>
          . pp.
          <volume>13</volume>
          {
          <fpage>25</fpage>
          . Lecture Notes in Computer Science (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Arnold</surname>
            ,
            <given-names>D.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fowler</surname>
            ,
            <given-names>K.K.</given-names>
          </string-name>
          :
          <article-title>Nefarious numbers</article-title>
          .
          <source>Notices Am. Math. Soc</source>
          .
          <volume>58</volume>
          (
          <issue>3</issue>
          ),
          <volume>434</volume>
          {
          <fpage>437</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kobilarov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Becker</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hellmann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Dbpedia - a crystallization point for the web of data</article-title>
          .
          <source>Web Semant</source>
          .
          <volume>7</volume>
          (
          <issue>3</issue>
          ),
          <volume>154</volume>
          {165 (Sep
          <year>2009</year>
          ), http://dx.doi.org/10.1016/j.websem.
          <year>2009</year>
          .
          <volume>07</volume>
          .002
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>