<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Delineating Fields Using Mathematical Jargon</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jevin D. West</string-name>
          <email>jevinw@uw.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jason Portenoy</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information School, University of Washington</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>63</fpage>
      <lpage>71</lpage>
      <abstract>
        <p>Tracing ideas through the scientific literature is useful in understanding the origin of ideas and for generating new ones. Machines can be trained to do this at large scale, feeding search engines and recommendation algorithms. Citations and text are the features commonly used for these tasks. In this paper, we focus on a largely ignored facet of scholarly papers-the equations. Mathematical language varies from field to field but original formulae are maintained over generations (e.g., Shannon's Entropy equation). Here we extract a common set of mathematical symbols from more than 250,000 LATEX source files in the arXiv repository. We compare the symbol distributions across di↵erent fields and calculate the jargon distance between fields. We find a greater difference between the experimental and theoretical disciplines than within these fields. This provides a first step at using equations as a bridge between disciplines that may not cite each other or may speak di↵erent natural languages but use a similar mathematical language.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        There has been considerable e↵ort building and designing new recommendation
algorithms to help scholars find relevant papers. Most of these methods depend
on citations [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], full text [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] or usage data [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. One feature that has been largely
ignored are equations and the mathematical language surrounding these
equations. These formal languages can tie together papers and ideas across fields and
time periods. Shannon’s famous entropy equation (also used in this paper) is an
example of this kind of trace [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Unlike natural languages, formal languages such
as mathematics are exempt from plagiarism rules. The norm is for an author to
copy an equation from the original source. This provides a unique opportunity
for tracing ideas back to their origins and for tracking them forward in time.
      </p>
      <p>
        There have been attempts at utilizing the equations as a search facet. For
example, Springer’s LaTeX Search tool1 allows authors to search formulae (in
LATEX format) from more than 8 million documents in Springer journals and
articles. This is used both as a tool for searching similar formulae and for
translating formulae to existing documents. But what if a researcher wants to find
not just individual manuscripts with the same equations, but fields of study and
groups of papers using similar language? What kind of formalism can be used
to map jargon di↵erences across the quantitative sciences?
1 http://latexsearch.com
In this paper, we measure the communication eciency or “jargon distance”
between fields using mathematical symbols. The jargon metric is derived from
a recent paper by Vilhena et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] which measures the distance between
disciplines using n-grams extracted from millions of papers in the JSTOR corpus.
In our paper, we find that the metric separates fields that are di↵erent both in
content and mathematical notation2.
      </p>
      <p>
        Our ultimate research goal is to find ways to utilize equations and formal
notation in scholarly recommendation. The more proximate goal of this paper
is to validate that equation jargon can delineate the relationship among fields
at the scale of hundreds of thousands of papers in similar ways to citation and
text-based clustering. We show in this paper that the jargon metric proposed in
this paper does a reasonable job at identifying fields with similar attributes and
language. The next steps will involve full-corpus scaling, extracting
mathematical grammar, and incorporating the metric into citation-based recommendation
algorithms [
        <xref ref-type="bibr" rid="ref13 ref14">13,14</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <p>
        Data
Research papers were downloaded from arXiv.org3, an open-access e-print
service in the fields of physics, mathematics, computer science, quantitative
biology, quantitative finance, and statistics. For this study, we used a sample (N =
266,906) of papers published between 2000 and 2009. We downloaded papers
using the arXiv’s bulk data download, analyzing papers that had field designations
in their filenames. We compiled a list of the LATEX representations of 103
symbols commonly used in mathematical formulae4. These symbols included some
commonly used Greek letters (e.g. “\alpha” [↵ ], “\omega” [!]), arrows (e.g.
“\rightarrow” [!]), binary operation/relation symbols (e.g. “=”, “&gt;”, “\geq”
[ ], “\times” [⇥ ]), and some other symbols (e.g. “\forall” [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]).5 A
shortcoming of this approach is that it does not take into account the ability of authors
to redefine commands and refer to symbols with these new command names,
but it gives a rough idea of the usage of these symbols over the corpus. Using
the LATEX source for the arXiv papers, we counted the occurrence of each of
these symbols. We used field designations provided by the arXiv; however, the
fields could be determined by other methods, including citation clustering [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ],
co-citations [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and topic modeling [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. We intend to combine the jargon metric
with these di↵erent clustering methods in future studies.
2 These content comparisons were based on inspection of sample papers in the
respective fields. More rigorous inspection is needed.
3 https://arxiv.org/help/bulk_data
4 Modified from https://www.sharelatex.com/learn/List_of_Greek_letters_and_
math_symbols
5 Our list of symbols is not exhaustive; for example, it does not include certain
structural elements of equations such as sums. We intend to expand the list in future
analysis.
To measure the communication barrier between fields, we adapt a metric
developed by Vilhena et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. In this study, the authors develop a model of
communication to topographically map the JSTOR corpus. Instead of using n-grams
from full text like the Vilhena study, we use the language of mathematics—such
as symbolic notation and Greek letters—extracted from LATEX source files.
      </p>
      <p>The jargon distance (Eij ) between field i and j is determined by calculating
the ratio between two things: (1) the entropy H of a random variable Xi with a
probability distribution based on the frequency of mathematical symbols within
field i and (2) the cross entropy6 Q between the distributions in field i and j,
Eij =</p>
      <p>H(Xi)
Q(pi||pj )
=</p>
      <p>P
P
x2X
x2X
pi(x) log2 pi(x)
pi(x) log2 pj (x)
(1)</p>
      <p>
        This calculation derives itself from a general communication heuristic whereby
a mathematician communicates with another mathematician via a channel [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
The mathematician writing the formulae in field i has a codebook Pi that maps
mathematical concepts to codewords, which mathematician in field j with
codebook Pj has to decode. Note that the metric is not symmetrical—field j may
more eciently decode concepts in field i than field j does for field i. The model
assumes that the codebooks are optimized based on the frequency of di↵erent
terms. This power-law assumption holds well for English words [
        <xref ref-type="bibr" rid="ref15 ref16">15,16</xref>
        ]. It also
seems to hold for mathematical terms; we find a Zipfian distribution in our
sample (see Figure 1).
      </p>
      <p>This simple metric of communication eciency has advantages. It is based on
a model of communication, which has firm theoretical foundations and additional
tools to build upon. Second, it is easy to calculate and can run an entire corpus
over a relatively short amount of time7.</p>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>
        6 The cross entropy is the the entropy of Xi plus the Kullback-Leibler divergence
between pi and pj [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
7 We calculated distances for 200k papers in less than 5 minutes on a micro EC2
instance with Amazon’s Web Services (AWS).
2,970,664
2,876,255
2,827,491
2,820,464
2,677,700
2,483,652
2,482,211
2,443,861
2,320,697
2,273,644
this, we applied standard hierarchical clustering methods to the distance
matrix in order to infer which fields were most like each other. We visualized the
groupings using dendrograms. The dendrogram in Figure 2 was produced using
a hierarchical clustering algorithm implemented in SciPy’s linkage function8. We
used UPGMA [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], which is an agglomerative method that uses average linkage as
its criterion. The clustering is done on the adjacency matrix of fields where Eij
is the distance. Although the distance metric is not symmetric, for the clustering
we symmetrize the matrix by taking the average of the the distances Eij and
We find a separation among theoretical and experimental sciences. Nuclear
Experiment is most similar to High Energy Physics - Experiment (see Figure 2).
We see Computer Science as an outlier to the rest of the fields. Mathematics
shares a closer branch with Mathematical Physics than with any other discipline
other than Quantum Physics. Quantitative Biology sits on its own but within
the branch that includes Condensed Matter, Physics, Nonlinear Sciences, among
others. We need to further investigate whether these similarities are real.
      </p>
    </sec>
    <sec id="sec-4">
      <title>Discussion</title>
      <p>
        Scientific papers contain many features that are used in search engines and
recommendation algorithms: authors, titles, full text, citations, figures, etc. One
feature largely ignored within the digital library community (at least relative
to the other features) are equations. In this paper, we apply a model of
communication eciency between groups of papers first proposed by Vilhena et al.
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. We find that field relationships (i.e., how fields are grouped hierarchcially)
are recapitulated when using the jargon distance metric (Figure 2). The results
are surprising given the simplicity of the data extraction and distance
calculation. We only use isolated symbols and the frequency of these symbols to infer
the groupings of fields. The resultant groups seem to assemble in logical ways
(e.g., the experimental sciences group together, while the more theoretical fields
assemble in another area of the tree). However, we see these results as
preliminary evidence for using mathematical symbols as a way of clustering papers and
topics.
      </p>
      <p>
        We need to further investigate the true di↵erences in fields. We plan to use
citation based clustering as another means of field designation. We also plan
to talk to scholars in the various fields to assess the validity of the clusters.
In addition, we will extend our analysis to the full arXiv corpus. For corpora
with no LATEXavailable, we plan to use computer vision techniques from the
viziometrics.org project for automatically extracting equations from PDFs [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
We plan also to compare the jargon method to other well-known methods such as
cosine similarity [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], LDA [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and word2vec [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The primary di↵erence we see
from these methods is the communication theory underlying the jargon metric,
but there needs to be analytic work for making this argument. In addition, we
plan to expand beyond isolated symbols and analyze mathematical grammar.
      </p>
      <p>Assuming the methods hold, our ultimate goal for this research project is
to integrate our methods into existing recommendation engines at the scale of
micro-fields. It is at these finer scales where the method could bridge seemingly
disparate, emerging fields that are using similar mathematical language.</p>
      <p>We also plan to extend this analysis to Science of Science questions,
investigating the birth and death of ideas and the sociology surrounding these ideas.
We see equations as an e↵ective way of tracing ideas both forwards and
backwards in time. The relative stability of equations and mathematical language
provides a unique opportunity for tracking the movement of ideas across time
and across disciplines.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>We would like to thank three anonymous reviewers for their helpful feedback.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>M.I.</given-names>
          </string-name>
          :
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>the Journal of machine Learning research 3</source>
          ,
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Levy</surname>
            ,
            <given-names>O.:</given-names>
          </string-name>
          <article-title>word2vec explained: Deriving mikolov</article-title>
          et al.'
          <article-title>s negativesampling word-embedding method</article-title>
          .
          <source>arXiv preprint arXiv:1402.3722</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Kantor</surname>
            ,
            <given-names>P.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rokach</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ricci</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shapira</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Recommender systems handbook</article-title>
          . Springer (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kullback</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leibler</surname>
            ,
            <given-names>R.A.</given-names>
          </string-name>
          :
          <article-title>On information and suciency</article-title>
          .
          <source>The Annals of Mathematical Statistics</source>
          <volume>22</volume>
          (
          <issue>1</issue>
          ),
          <fpage>79</fpage>
          -
          <lpage>86</lpage>
          (
          <year>1951</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. Lu¨,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Medo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Yeung</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.H.</surname>
          </string-name>
          , Zhang,
          <string-name>
            <given-names>Y.C.</given-names>
            ,
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.K.</given-names>
            ,
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          : Recommender Systems p.
          <volume>97</volume>
          (
          <year>Feb 2012</year>
          ), http://arxiv.org/abs/1202.1112
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>P.</given-names>
            <surname>Lee</surname>
          </string-name>
          , West,
          <string-name>
            <surname>J.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Howe</surname>
          </string-name>
          :
          <article-title>Viziometrix: A platform for analyzing the visual information in big scholarly data</article-title>
          .
          <source>In: Proceedings of the 25th International Conference on World Wide Web. ACM</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Shannon</surname>
            ,
            <given-names>C.E.</given-names>
          </string-name>
          :
          <source>The mathematical theory of communication</source>
          , vol.
          <volume>27</volume>
          (
          <year>1948</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Small</surname>
            ,
            <given-names>H.:</given-names>
          </string-name>
          <article-title>Co-Citation in Scientific Literature: A new measure of the relationship between two documents</article-title>
          .
          <source>Journal of the American Society for Information Science</source>
          <volume>24</volume>
          (
          <issue>4</issue>
          ),
          <fpage>265</fpage>
          -
          <lpage>269</lpage>
          (
          <year>1973</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Sokal</surname>
            ,
            <given-names>R.R.:</given-names>
          </string-name>
          <article-title>A statistical method for evaluating systematic relationships</article-title>
          .
          <source>Univ Kans Sci Bull</source>
          <volume>38</volume>
          ,
          <fpage>1409</fpage>
          -
          <lpage>1438</lpage>
          (
          <year>1958</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Steinbach</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karypis</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          , et al.:
          <article-title>A comparison of document clustering techniques</article-title>
          .
          <source>In: KDD workshop on text mining</source>
          . vol.
          <volume>400</volume>
          , pp.
          <fpage>525</fpage>
          -
          <lpage>526</lpage>
          . Boston (
          <year>2000</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Vilhena</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Foster</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosvall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>West</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Evans</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bergstrom</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Finding cultural holes: How structure and culture diverge in networks of scholarly communication</article-title>
          .
          <source>Sociological Science</source>
          <volume>1</volume>
          (
          <issue>June</issue>
          ),
          <fpage>221</fpage>
          -
          <lpage>238</lpage>
          (
          <year>2014</year>
          ), http://www. sociologicalscience.com/articles-vol1-
          <fpage>15</fpage>
          -221/
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D.M.:</given-names>
          </string-name>
          <article-title>Collaborative topic modeling for recommending scientific articles</article-title>
          .
          <source>In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining</source>
          . pp.
          <fpage>448</fpage>
          -
          <lpage>456</lpage>
          . ACM (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Wesley-Smith</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dandrea</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>West</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>An experimental platform for scholarly article recommendation</article-title>
          .
          <source>In: Proc. of the 2nd Workshop on Bibliometric-enhanced Information Retrieval (BIR2015)</source>
          . pp.
          <fpage>30</fpage>
          -
          <lpage>39</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>West</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wesley-Smith</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bergstrom</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>A recommendation system based on hierarchical clustering of an article-level citation network</article-title>
          .
          <source>IEEE Transactions on Big Data (in press)</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Zipf</surname>
            ,
            <given-names>G.K.</given-names>
          </string-name>
          :
          <article-title>The psycho-biology of language</article-title>
          . Houghton, Mi✏in (
          <year>1935</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Zipf</surname>
            ,
            <given-names>G.K.</given-names>
          </string-name>
          :
          <article-title>Human behavior and the principle of least e↵ort</article-title>
          . Addison-Wesley Press (
          <year>1949</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>