Introduction

Delineating Fields Using Mathematical Jargon

Jevin D. West

jevinw@uw.edu 0

Jason Portenoy

0 0 Information School, University of Washington , USA

2016

63 71

Tracing ideas through the scientific literature is useful in understanding the origin of ideas and for generating new ones. Machines can be trained to do this at large scale, feeding search engines and recommendation algorithms. Citations and text are the features commonly used for these tasks. In this paper, we focus on a largely ignored facet of scholarly papers-the equations. Mathematical language varies from field to field but original formulae are maintained over generations (e.g., Shannon's Entropy equation). Here we extract a common set of mathematical symbols from more than 250,000 LATEX source files in the arXiv repository. We compare the symbol distributions across di↵erent fields and calculate the jargon distance between fields. We find a greater difference between the experimental and theoretical disciplines than within these fields. This provides a first step at using equations as a bridge between disciplines that may not cite each other or may speak di↵erent natural languages but use a similar mathematical language.

Introduction

There has been considerable e↵ort building and designing new recommendation algorithms to help scholars find relevant papers. Most of these methods depend on citations [ 14 ], full text [ 5 ] or usage data [ 3 ]. One feature that has been largely ignored are equations and the mathematical language surrounding these equations. These formal languages can tie together papers and ideas across fields and time periods. Shannon’s famous entropy equation (also used in this paper) is an example of this kind of trace [ 7 ]. Unlike natural languages, formal languages such as mathematics are exempt from plagiarism rules. The norm is for an author to copy an equation from the original source. This provides a unique opportunity for tracing ideas back to their origins and for tracking them forward in time.

There have been attempts at utilizing the equations as a search facet. For example, Springer’s LaTeX Search tool1 allows authors to search formulae (in LATEX format) from more than 8 million documents in Springer journals and articles. This is used both as a tool for searching similar formulae and for translating formulae to existing documents. But what if a researcher wants to find not just individual manuscripts with the same equations, but fields of study and groups of papers using similar language? What kind of formalism can be used to map jargon di↵erences across the quantitative sciences? 1 http://latexsearch.com In this paper, we measure the communication eciency or “jargon distance” between fields using mathematical symbols. The jargon metric is derived from a recent paper by Vilhena et al. [ 11 ] which measures the distance between disciplines using n-grams extracted from millions of papers in the JSTOR corpus. In our paper, we find that the metric separates fields that are di↵erent both in content and mathematical notation2.

Our ultimate research goal is to find ways to utilize equations and formal notation in scholarly recommendation. The more proximate goal of this paper is to validate that equation jargon can delineate the relationship among fields at the scale of hundreds of thousands of papers in similar ways to citation and text-based clustering. We show in this paper that the jargon metric proposed in this paper does a reasonable job at identifying fields with similar attributes and language. The next steps will involve full-corpus scaling, extracting mathematical grammar, and incorporating the metric into citation-based recommendation algorithms [ 13,14 ].

Methods

Data Research papers were downloaded from arXiv.org3, an open-access e-print service in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, and statistics. For this study, we used a sample (N = 266,906) of papers published between 2000 and 2009. We downloaded papers using the arXiv’s bulk data download, analyzing papers that had field designations in their filenames. We compiled a list of the LATEX representations of 103 symbols commonly used in mathematical formulae4. These symbols included some commonly used Greek letters (e.g. “\alpha” [↵ ], “\omega” [!]), arrows (e.g. “\rightarrow” [!]), binary operation/relation symbols (e.g. “=”, “>”, “\geq” [ ], “\times” [⇥ ]), and some other symbols (e.g. “\forall” [ 8 ]).5 A shortcoming of this approach is that it does not take into account the ability of authors to redefine commands and refer to symbols with these new command names, but it gives a rough idea of the usage of these symbols over the corpus. Using the LATEX source for the arXiv papers, we counted the occurrence of each of these symbols. We used field designations provided by the arXiv; however, the fields could be determined by other methods, including citation clustering [ 14 ], co-citations [ 8 ], and topic modeling [ 12 ]. We intend to combine the jargon metric with these di↵erent clustering methods in future studies. 2 These content comparisons were based on inspection of sample papers in the respective fields. More rigorous inspection is needed. 3 https://arxiv.org/help/bulk_data 4 Modified from https://www.sharelatex.com/learn/List_of_Greek_letters_and_ math_symbols 5 Our list of symbols is not exhaustive; for example, it does not include certain structural elements of equations such as sums. We intend to expand the list in future analysis. To measure the communication barrier between fields, we adapt a metric developed by Vilhena et al. [ 11 ]. In this study, the authors develop a model of communication to topographically map the JSTOR corpus. Instead of using n-grams from full text like the Vilhena study, we use the language of mathematics—such as symbolic notation and Greek letters—extracted from LATEX source files.

The jargon distance (Eij ) between field i and j is determined by calculating the ratio between two things: (1) the entropy H of a random variable Xi with a probability distribution based on the frequency of mathematical symbols within field i and (2) the cross entropy6 Q between the distributions in field i and j, Eij =

H(Xi) Q(pi||pj ) =

P P x2X x2X pi(x) log2 pi(x) pi(x) log2 pj (x) (1)

This calculation derives itself from a general communication heuristic whereby a mathematician communicates with another mathematician via a channel [ 7 ]. The mathematician writing the formulae in field i has a codebook Pi that maps mathematical concepts to codewords, which mathematician in field j with codebook Pj has to decode. Note that the metric is not symmetrical—field j may more eciently decode concepts in field i than field j does for field i. The model assumes that the codebooks are optimized based on the frequency of di↵erent terms. This power-law assumption holds well for English words [ 15,16 ]. It also seems to hold for mathematical terms; we find a Zipfian distribution in our sample (see Figure 1).

This simple metric of communication eciency has advantages. It is based on a model of communication, which has firm theoretical foundations and additional tools to build upon. Second, it is easy to calculate and can run an entire corpus over a relatively short amount of time7.

Results

6 The cross entropy is the the entropy of Xi plus the Kullback-Leibler divergence between pi and pj [ 4 ]. 7 We calculated distances for 200k papers in less than 5 minutes on a micro EC2 instance with Amazon’s Web Services (AWS). 2,970,664 2,876,255 2,827,491 2,820,464 2,677,700 2,483,652 2,482,211 2,443,861 2,320,697 2,273,644 this, we applied standard hierarchical clustering methods to the distance matrix in order to infer which fields were most like each other. We visualized the groupings using dendrograms. The dendrogram in Figure 2 was produced using a hierarchical clustering algorithm implemented in SciPy’s linkage function8. We used UPGMA [ 9 ], which is an agglomerative method that uses average linkage as its criterion. The clustering is done on the adjacency matrix of fields where Eij is the distance. Although the distance metric is not symmetric, for the clustering we symmetrize the matrix by taking the average of the the distances Eij and We find a separation among theoretical and experimental sciences. Nuclear Experiment is most similar to High Energy Physics - Experiment (see Figure 2). We see Computer Science as an outlier to the rest of the fields. Mathematics shares a closer branch with Mathematical Physics than with any other discipline other than Quantum Physics. Quantitative Biology sits on its own but within the branch that includes Condensed Matter, Physics, Nonlinear Sciences, among others. We need to further investigate whether these similarities are real.

Discussion

Scientific papers contain many features that are used in search engines and recommendation algorithms: authors, titles, full text, citations, figures, etc. One feature largely ignored within the digital library community (at least relative to the other features) are equations. In this paper, we apply a model of communication eciency between groups of papers first proposed by Vilhena et al. [ 11 ]. We find that field relationships (i.e., how fields are grouped hierarchcially) are recapitulated when using the jargon distance metric (Figure 2). The results are surprising given the simplicity of the data extraction and distance calculation. We only use isolated symbols and the frequency of these symbols to infer the groupings of fields. The resultant groups seem to assemble in logical ways (e.g., the experimental sciences group together, while the more theoretical fields assemble in another area of the tree). However, we see these results as preliminary evidence for using mathematical symbols as a way of clustering papers and topics.

We need to further investigate the true di↵erences in fields. We plan to use citation based clustering as another means of field designation. We also plan to talk to scholars in the various fields to assess the validity of the clusters. In addition, we will extend our analysis to the full arXiv corpus. For corpora with no LATEXavailable, we plan to use computer vision techniques from the viziometrics.org project for automatically extracting equations from PDFs [ 6 ]. We plan also to compare the jargon method to other well-known methods such as cosine similarity [ 10 ], LDA [ 1 ], and word2vec [ 2 ]. The primary di↵erence we see from these methods is the communication theory underlying the jargon metric, but there needs to be analytic work for making this argument. In addition, we plan to expand beyond isolated symbols and analyze mathematical grammar.

Assuming the methods hold, our ultimate goal for this research project is to integrate our methods into existing recommendation engines at the scale of micro-fields. It is at these finer scales where the method could bridge seemingly disparate, emerging fields that are using similar mathematical language.

We also plan to extend this analysis to Science of Science questions, investigating the birth and death of ideas and the sociology surrounding these ideas. We see equations as an e↵ective way of tracing ideas both forwards and backwards in time. The relative stability of equations and mathematical language provides a unique opportunity for tracking the movement of ideas across time and across disciplines.

Acknowledgements

We would like to thank three anonymous reviewers for their helpful feedback.

1. Blei , D.M. , Ng , A.Y. , Jordan , M.I. : Latent dirichlet allocation . the Journal of machine Learning research 3 , 993 - 1022 ( 2003 )

2. Goldberg , Y. , Levy , O.: word2vec explained: Deriving mikolov et al.' s negativesampling word-embedding method . arXiv preprint arXiv:1402.3722 ( 2014 )

3. Kantor , P.B. , Rokach , L. , Ricci , F. , Shapira , B. : Recommender systems handbook . Springer ( 2011 )

4. Kullback , S. , Leibler , R.A. : On information and suciency . The Annals of Mathematical Statistics 22 ( 1 ), 79 - 86 ( 1951 )

5. Lu¨, L. , Medo , M. , Yeung , C.H. , Zhang, Y.C. , Zhang , Z.K. , Zhou , T. : Recommender Systems p. 97 ( Feb 2012 ), http://arxiv.org/abs/1202.1112

Lee , West, J. ,

Howe : Viziometrix: A platform for analyzing the visual information in big scholarly data . In: Proceedings of the 25th International Conference on World Wide Web. ACM ( 2016 )

7. Shannon , C.E. : The mathematical theory of communication , vol. 27 ( 1948 )

8. Small , H.: Co-Citation in Scientific Literature: A new measure of the relationship between two documents . Journal of the American Society for Information Science 24 ( 4 ), 265 - 269 ( 1973 )

9. Sokal , R.R.: A statistical method for evaluating systematic relationships . Univ Kans Sci Bull 38 , 1409 - 1438 ( 1958 )

10. Steinbach , M. , Karypis , G. , Kumar , V. , et al.: A comparison of document clustering techniques . In: KDD workshop on text mining . vol. 400 , pp. 525 - 526 . Boston ( 2000 )

11. Vilhena , D. , Foster , J. , Rosvall , M. , West , J. , Evans , J. , Bergstrom , C. : Finding cultural holes: How structure and culture diverge in networks of scholarly communication . Sociological Science 1 ( June ), 221 - 238 ( 2014 ), http://www. sociologicalscience.com/articles-vol1- 15 -221/

12. Wang , C. , Blei , D.M.: Collaborative topic modeling for recommending scientific articles . In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining . pp. 448 - 456 . ACM ( 2011 )

13. Wesley-Smith , I. , Dandrea , R. , West , J.: An experimental platform for scholarly article recommendation . In: Proc. of the 2nd Workshop on Bibliometric-enhanced Information Retrieval (BIR2015) . pp. 30 - 39 ( 2015 )

14. West , J. , Wesley-Smith , I. , Bergstrom , C. : A recommendation system based on hierarchical clustering of an article-level citation network . IEEE Transactions on Big Data (in press) ( 2016 )

15. Zipf , G.K. : The psycho-biology of language . Houghton, Mi✏in ( 1935 )

16. Zipf , G.K. : Human behavior and the principle of least e↵ort . Addison-Wesley Press ( 1949 )