Mathematical Formula Representation via Tree Embeddings

Mathematical Formula Representation via Tree Embeddings ZichaoWang jzwang@rice.edu Rice University AndrewLan andrewlan@cs.umass.edu University of Massachusetts Amherst RichardBaraniuk Rice University Mathematical Formula Representation via Tree Embeddings 200E38048304853E4C0D6C89C76B79CF GROBID - A machine learning software for extracting information from scholarly documents

We propose a new framework for learning formula representations using tree embeddings to facilitate search and similar content retrieval in textbooks containing mathematical (and possibly other types of) formula. By representing each symbolic formula (such as math equation) as an operator tree, we can explicitly capture its inherent structural and semantic properties. Our framework consists of a tree encoder that encodes the formula's operator tree into a vector and a tree decoder that generates a formula from a vector in operator tree format. To improve the quality of formula tree generation, we develop a novel tree beam search algorithm that is of independent scientific interest. We validate our framework on a formula reconstruction task and a similar formula retrieval task on a new real-world dataset of over 770k formulae collected online. Our experimental results show that our framework significantly outperforms various baselines.

Introduction

Recent years have seen increasing proliferation of mathematical language such as equations and formulae. Table 1 shows a few examples of formulae not only in mathematics but also in other scientific subjects that often appear in science, technology, engineering, and math (STEM) textbooks.

A number of practical tasks have recently gained traction because of the ubiquitous presence of mathematical language. For example, one common task is similar formula retrieval, i.e., finding relevant formulae similar to a query formula (e.g., [5]). This task arises in a wide range of scenarios, such as when students look for relevant math content in a textbook when doing algebra homework. Another common task is automatic formula generation, which arises in scenarios such as formula auto-completion and math summary and headline generation [24]. Both these tasks are potentially labor-intensive and time-consuming; an automatic method that tackles these tasks would be of great benefit. In this work, we focus on mathematical language processing (MLP), which involves the formula representation problem, i.e., processing a formula into an appropriate format for downstream tasks such as similar formula retrieval and formula generation. Existing research mostly focuses on either the retrieval or the generation task but rarely both. In terms of formula retrieval, an emerging line of research explores the idea of symbolic tree representation. Indeed, mathematical formulae are inherently hierarchical and tree structures are appropriate for organizing the math symbols in a formula. Compared to representing a formula simply as a sequence of math symbols, the symbolic tree representation has the advantage to encode both the semantics and the inherent hierarchical structure of a formula. Similar to works in natural language processing (NLP) that leverage inherent language structure (i.e., [20,16,22,12,18]), a number of recent works in formula retrieval exploit formula's unique structural properties, leading to improved results [5,28,11,27] compared to other formula representations, e.g., [6]. However, none of the aforementioned works is capable of generating a formula.

In terms of formula generation, existing research usually combines processing mathematical and natural language. For example, [23] trains a topic model on scientific documents and learns the keywords (topics) of a formula. [24] generates a headline from mathematical questions. However, these works treat formulae as sequences of math symbols and thus neither leverage formula's inherent tree structure. Some other works focus on solving math problems, i.e., generating a solution for an input formula [9,15]. However, these works are fully supervised, i.e., they rely on large, labeled datasets that are difficult to collect. To overcome the data issue, these works design methods to artificially generate formulae. However, such synthetic data is simple and does not cover many complicated formulae in practice (e.g., matrices), limiting the generalization capability of models trained on such data.

Contributions. We propose FORTE, a novel unsupervised framework for Mathematical FOrmula Representation learning via Tree Embeddings. Our framework fully exploits the tree structure of math formulae to enable both effective formula encoding and formula generation. FORTE consists of 2 key components. The "E" nodes represents the special "end" node attached as the last child to every node. (2b) Illustration of FORTE's decoding process at a particular time step. First, the position of the next node to be generated is computed (dark blue). Next, the next node (light blue) is generated by the decoder using already generated nodes and positions and the newly computed position. Finally, the partial tree and the stack are updated.

First, a tree encoder encodes a formula tree into an embedding that can benefit various downstream tasks, i.e., formula retrieval. Second, a tree decoder generates a formula tree from an embedding. We also propose a novel tree beam search algorithm that extends the beam search in sequence-to-sequence models (e.g., [19,1]) to improve the generation quality for tree-structured data. To evaluate our framework, we have collected a dataset of over 770k formulae, the largest to date to our knowledge, from professional, real-world sources such as Wikipedia and arXiv articles. On a formula autoencoding task and a formula retrieval task, we show that our framework (sometimes significantly) outperforms existing methods.

The FORTE Framework

We now present our FORTE framework. We first describe the tree representation of a formula (operator tree), which forms the foundation of our framework. We then set up the MLP problem and introduce the various FORTE components, including the tree encoder, tree decoder, and tree beam search. Figures 1-?? together provide a high-level overview of our framework.

Formulae As Operator Trees

Every formula X is inherently tree structured [26,5] and can be represented as a symbolic operator tree (OT):

X = (U, <), u ∈ U, U ⊆ V (1)

where u is a math symbol (a node in OT), 1 U is the set of math symbols in the operator tree X, and V is the "vocabulary", i.e., all unique math symbols in the data set. < represents partial binary parent-child relation ∀u ∈ U [8]. Procedure (a) in Fig. 1 illustrates the conversion from a formula to its OT. Intuitively, the OT organizes the math symbols in a formula, such as operators, variables, and numerical values, as nodes in an explicit, hierarchical tree structure. We choose OT because of its intuitive interpretation and rich semantics. We emphasize that our FORTE framework is agnostic to the underlying tree representation; other tree representations such as symbol layout tree [5] can also be used.

Math Symbol Vocabulary. The size of the math symbol vocabulary V may be unbounded (e.g., every element in the real number set R, which is uncountably infinite, could be an element in V ); however, most symbols rarely appear. We thus propose the following truncation method in order to work with a finite vocabulary in practice. First, we partition the vocabulary V into five disjoint sub-vocabulary according to symbol types, including numeric V num (numbers, decimals), functional V fun (multiplication, subtraction etc.), variable V var , textual V txt and others V o . We do so because different types of math symbols carry different semantic meanings. Then, we retain only the most frequent K symbols in each sub-vocabulary and convert others to an "unknown" symbol specific to each type. This setup guarantees that the semantics of symbols that do not occur frequently are preserved.

The MLP Problem Formulation

We set up the MLP problem as an unsupervised "autoencoding" task (see e.g., Ch.14 in [7]), motivated by the downstream tasks that we envision our framework will perform. Specifically, our framework aims to reconstruct the input formula in its OT representation through an encoder-decoder bottleneck model design. This problem setup allows us to use the latent embedding from the encoder output for many downstream tasks, i.e., formula retrieval, and the generated formula from the decoder output for generation-related tasks.

Concretely, the training objective of our framework is

L(θ, φ) = − 1 N N i=1 P x (X (i) out ) log f d (f e (X (i) in ; θ); φ) ,(2)

where N is the number of formulae, f e is the encoder function with parameter θ, f d is the decoder function with parameter φ and P x is the empirical data distribution.

X (i) in and X (i)

out are the input and output representations of the i-th formula tree in our new dataset, which we will introduce in Sec. 3. We will drop the data point index i in the remainder of the paper for simplicity of exposition.

Formula Tree Encoder

Our tree encoder takes a formula tree as input and outputs an embedding of this formula. The key idea is to properly encode all information underlying the formula tree. To this end, we use two methods including tree traversal, which extracts content (node) information, and tree positional encoding, which extracts structural (relative positions of nodes) information. Figure 1 provides an overview of our formula tree encoder.

Formula Tree Traversal and Node Embedding. To obtain the content in a formula tree, we employ tree traversal, which visits each node in the tree in a particular order and extracts its content. Process (b) in Fig. 1 illustrates this process. In this work, we consider traversal using depth-first search (DFS), although other traversal orders can also be used. This step returns a DFS-ordered list of nodes {u t } T t=1 where t is the position of node u in the DFS order and T is the number of nodes in the formula tree. Each node is then represented as a trainable embedding x t ∈ R M with dimension M .

Tree Positional Embedding. To extract the structure of a formula tree, we propose a two-step method that first retains and then embeds the relative positions of nodes in the tree. First, we recursively encode the position t of node u as q t ∈ R t where t = 0, . . . , T is the depth of node u in the tree. q t is composed of the node's parent's position appended with its relative positions to its siblings. q t = [0] for the root node. The formula tree in Fig. 1 illustrates this step. For example, the position [0, 1, 1] of the numeric node "4" is composed of [0, 1] which is its parent's position and [1] because it is the second child of its parent. Second, we propose a binary tree positional embedding to convert the encoded position q t of each node to a fixed-dimensional vector p t ∈ R D where

p t log 2 (C) j: log 2 (C) j+ log 2 (q t [j]) =bin(q t [j]) (3) ∀j = 0, . . . , t .

C is the maximum degree (i.e., number of children) of all trees in the dataset,

• is the ceiling function, bin(•) is the binarization operator (e.g., bin(5) = 101) and p t [j] selects the j-th index of the vector p t . The resulting dimension of the tree positional embedding p t is D = L log 2 (C) where L is the maximum depth of all trees in the dataset.

Formula Tree Embedding. To transform the formula tree into its embedding, we utilize an embedding function f e : R (M +D)×T → R K where K is the dimension of the formula tree embedding and T is the total number of nodes in the formula tree. Because M is not necessarily the same as D, we concatenate the node and tree positional embeddings. Concretely, the formula tree embedding is computed as

h = f e ({x t } T t=1 ; θ), x t = x t , p t .(4)

There are many options for instantiating the embedding function f e because, thanks to our node and tree positional embedding methods, the tree content and structure are fully preserved in the encoded input sequence. In this work, we use the gated recurrent unit network (GRU) [4] for f e , but one can freely choose other appropriate models.

Relation to Prior Work. Our tree encoder design differs from existing approaches in 2 regards. First, compared to [20,3], which perform tree traversal during training and thus only allow a single data point per iteration, our encoder performs traversal before training, which enables mini-batch processing during training. As a result, our approach removes this computationally expensive traversal step from the training process and significantly speeds up training. Second, compared to [17], which uses a onehot-style tree positional embedding, our encoder employs a different binary tree positional embedding which reduces the space complexity from O(LC) to O(L log 2 (C)). This reduction is especially significant for trees with a large degree.

Formula Tree Decoder

The decoder takes a formula embedding vector, i.e., the output from our tree encoder, as input and generates a formula tree as output. We face two main challenges when we design the decoder. First, the terminating condition of tree generation is unclear because the formula tree can have multiple leaf nodes, each of which terminates the generation in a single branch but not necessarily the entire tree. Second, the order of generation, i.e., which node to generate next, is unclear because there are at least two directions at each node: its siblings (horizontal) and its children (vertical). We now present our decoder, which tackles these two challenges by modifying the decoder target from the encoder input (recall that in a usual autoencoding task, the input and target are exactly the same) and generating by traversing the tree. We also present our novel tree beam search generation algorithm, which improves formula tree generation quality during test time.

Modified Decoder Target Tree. To address the challenge of unclear generation termination condition, we propose to slightly modify the decoder target from the input formula tree by attaching an additional special "end" node to each node as its last child. Doing so informs the decoder when a node has no more children to generate and provides a clear generation termination signal for each node. Figure 2a compares the encoder's input and decoder's target of the same formula tree.

Generation Via Tree Traversal. To address the challenge of the unclear generation order, we propose to traverse the tree during generation in the same order as the encoder. A data structure is required to track the traversal process. Therefore, for an encoder using DFS traversal, We propose to use a stack which maintains nodes whose generation is unfinished in the DFS order. In addition to the node itself, the stack also maintains the number of children and the position of each node in the stack.

During generation, we use the current node's embedding xt together with the next node's position p t+1 to generate the next node. We do so because a node can have multiple children; therefore, we use the next node's position to inform the model which node to generate next. This is a major difference from typical sequence generation models, i.e., Transformers, in which the input position is the input token's position itself. Concretely, the next node is generated as

u t+1 = argmax u∈V softmax(f d ({x s } t s=1 ; φ)) ,(5)

s = 1, . . . , t and x s = [ x s ; p s+1 ; h] ,

where we also concatenate the formula tree embedding h to the decoder's input at each generation step. The next node's position p s+1 is computed using the number of children and the position of the current input node; see Section 2.3 for details. When a node's generation finishes, as signaled by the generation of an "end" node, this node is removed from the stack. The "end" node itself is never added to the stack. The entire tree generation is finished when the stack is empty, i.e., no more nodes to expand further. We use a special "start" token to mark the beginning of the generation process. Figure 2b illustrates the generation process.

Tree Beam Search for Tree Generation. The above generation process is a greedy algorithm which may be sub-optimal. To improve tree generation quality, we propose the tree beam search (TBS). The idea is to maintain a stack for each beam, which records the node generation order. During the generation process, the decoder generates the top B most probable next nodes from the current generated tree of each beam, resulting in a total of B 2 candidate trees to expand. We then select B most probable candidate trees to expand further. This process continues until B trees finish generation or until a preset maximum number of steps have been reached. TBS extends beam search in NLP (e.g., in [19,1]) which only concerns sequential data and is incapable of generating tree-structured data. Similar to beam search in NLP, by expands the search space of candidate trees by a factor of B, TBS enables more flexible and higher quality generation during the decoding process compared to greedy generation.

Relation to Prior Work. To our knowledge, this is the first beam search algorithm for generating tree structured data. Our work differs from [29] which proposed a tree beam search algorithm for recommender systems that treats the entire dataset as a tree, where each node is a data point (user, item); in our work, we treat each data point (formula) as a tree. Our work also differs from [17] which specifies the number of children each node must have in the tree, significantly constrains the generation process, unnecessarily increases the node vocabulary and leads to worse generation quality; in our work, there are no constraints on the number of children that each node must have, resulting in flexible and varied generated formula tree.

Experiments

We conduct two experiments to validate FORTE. In the first experiment, we demonstrate the advantage of FORTE compared to other tree and sequence generation methods for formulae in the formula reconstruction task. In the second experiment, we demonstrate an application of FORTE in the context of the formula retrieval task and show the advantage of FORTE over existing formula retrieval systems. For our framework in all experiments, we use a 2-layer bidirectional GRU for the encoder and a 2-layer unidirectional GRU for the decoder.

Dataset. We collected a large real-world dataset of more than 770k formulae from a subset of articles on Wikipedia and arXiv. We extracted formulae from these articles and processed them into OT representations.

Formula Reconstruction

In this experiment, we test FORTE's ability to reconstruct a formula. Because some baselines only works on binary trees [3,17], we select a subset of 170k formulae whose operator trees are binary.

Baselines. We consider the following baselines: seq2seqRNN which implements the same encoder and decoder as our framework but processes formulae as sequences of math symbols; tree2treeRNN [3] which is an RNN-based method capable of encoding and decoding only binary trees; treeTransformer [17] which is a Transformer-based method that shows success only on binary trees. The latter two baselines were originally developed and evaluated on a very different task (program translation) than MLP. We also include four variants of our framework to evaluate the utility of (1) binary against onehot tree positional embedding and (2) TBS against greedy search for tree generation. We construct training, validation, and test sets by splitting the 170k dataset 80%-10%-10%. We train each model 5 times for 50 epochs, record the model with the best performance on the validation set. We then perform formula reconstruction on the test set using beam size B = 10 for applicable methods and report the average values of the following metrics on the test set.

Evaluation Metrics. We use two groups of metrics. The first group of metrics measures the reconstruction accuracy, i.e., the percentage of the generated formulae that are exactly the same as the ground-truth. We compute both ACC top-1, using only the most probable generated formulae, and ACC top-5, using the five most probable generated formulae. The second group of metrics measures how much the generated formula tree differs from the ground truth formula tree. We use tree edit distance (TED) which measures the distance of two trees by computing the minimum number of operations needed, including changing nodes and node connections, to convert one tree to the other. See [14,13] for an overview of the TED algorithm. We compute both TED-overall which considers both node and connection editing and TED-structural which only considers connection editing.

Results. Table 2 presents the formula reconstruction results comparing our framework with baselines. Comparing to the two tree2tree baselines that struggle at this task, the encoder and decoder designs in FORTE enable near-perfect formulae reconstruction. Comparing to the seq2seqRNN baseline, FORTE shows improvement when processing formulae as OT against as sequences. Moreover, the results from the 4 FORTE variants clearly demonstrates the benefits of binary tree positional embedding and TBS, leading to improvements in all 4 metrics compared to onehot tree positional embedding and greedy search, respectively. We repeat this experiment on the full dataset comparing only seq2seqRNN and FORTE since the tree-based baselines cannot process non-binary trees. FORTE achieves 85.87% compared to seq2seqRNN's 84.30% on the TOP-1 ACC and 90.30% compared to seq2seqRNN's 88.52% on TOP-5 ACC, respectively. These results further show the benefits of representing formulae as OT against as sequences.

Formula Retrieval

In this qualitative experiment, we evaluate FORTE's capabilities in a formula retrieval application. Given an input formula (query), a retrieval method aims to return the top relevant formulae (retrievals) from a collection of formulae. We use the entire 770k formula dataset to train our framework and then use the trained encoder to obtain an embedding for each formula. For each query, we compute the cosine similarity between its embedding and the embedding of each formula in the dataset. Finally, we choose the formulae with the highest similarity scores as the retrievals. We use the queries from the NTCIR-12 formula retrieval task [25]. Table 1 shows a few examples of the queries.

Model and Baselines. We consider three state-of-the-art baselines designed specifically for the formula retrieval task including Tangent-CFT [11], which is one of the few data-driven formula retrieval systems to date, and Tangent-S [5] and Approach0 [28], both of which are based on symbolic sub-tree matching and are data independent. We train Tangent-CFT on the same dataset as FORTE.

Evaluation Metrics. Because it is difficult to algorithmically judge the relevance of a retrieval to a query, we perform a human evaluation for this task as follows. First, for each method and each query, we choose the top 25 retrieved formulae and mix them into a single pool of retrievals. Second, for each query, three human evaluators independently provide a ternary rating for each retrieval in the pool, i.e., whether the retrieval is relevant, half-relevant, or irrelevant to the query. The above evaluation procedure is consistent with [25], including the number of evaluators involved. We then use the mean average precision (MAP) [21] and bpref [2] as the evaluation metrics. Compared to other retrieval evaluation metrics, Both MAP and bpref are easy to interpret and appropriate for evaluating multiple queries and for comparing multiple retrieval systems.

Results. Table 3 presents the quantitative evaluation results, averaged over the three evaluators' scores. Both metrics indicate that our framework achieves superior performance than the other data-driven baseline, TangentCFT, and one of the data-independent baselines, Tangent-S. Our method falls slightly behind Approach0. This may be caused by the fact that Approach0 uses a more explicit symbolic matching using directly subtrees whereas we rely on the more abstract vector representations of formulae which may lose information during similarity computation. This result is consistent with existing literature [11]. We then propose a new model, FORTE-App that combines the strengths of FORTE and Approach0. The last row of Table 3 shows that the combined method achieves new state-of-the-art performance on the formula retrieval experiment. In particular, the bpref score implies that, on average, in FORTE's retrieved formulae, relevant formulae (as judged by evaluators) rank higher than irrelevant ones more often than in those retrieved by baselines.

Conclusions and Future Work

In this work, we propose FORTE, a novel, unsupervised mathematical language processing framework by leveraging tree embeddings. By encoding formulae as operator trees, we can explicitly capture the inherent structure and semantics of a formula. We propose an encoder and a decoder capable of embedding and generating formula trees, respectively, and a novel tree beam search algorithm to improve generation quality at test time. We evaluate our framework on formula reconstruction and demonstrate our framework's superior performance in both experiments compared to baselines. There are many avenues of future work. One direction is to combine our framework's dedicated capability to encode and generate formulae with state-of-the-art NLP methods to enable cross-modality applications that involve both mathematical and natural language. For example, our framework can serve as a drop-in replacement for the formulae processing part in a number of existing works to potentially improve performance, i.e., in [23] for joint text and math retrieval, in [24] for math headline generation, in [10] for grading students' math homework solutions, and in [15,9] for neural math reasoning. We also look forward to conducting a thorough human evaluation for relevant formula retrieval.

N= 0.5 − log 2 Frequency of this item Frequency of most common item (physics) 238 92 U + 64 28 Ni → 302 120 Ubn * → fission only (chemistry) ax 2 + bx + c = 0 (algebra)

Fig. 1 :1Fig. 1: Illustration of FORTE's encoding process of a formula.

(a) Input and output formula tree. (b) The decoding process.

Fig. 2 :2Fig. 2: (2a) Illustration of FORTE's input and output operator tree of the same formula.The "E" nodes represents the special "end" node attached as the last child to every node. (2b) Illustration of FORTE's decoding process at a particular time step. First, the position of the next node to be generated is computed (dark blue). Next, the next node (light blue) is generated by the decoder using already generated nodes and positions and the newly computed position. Finally, the partial tree and the stack are updated.

Table 1 :1A few examples of mathematical language in our context.

Table 2 :2Formula reconstruction results. FORTE outperforms all other methods.

Table 3 :3Formula retrieval results.

MethodsMetricsmap bprefApproach00.486 0.507Tangent-S0.461 0.472TangentCFT 0.462 0.464FORTE0.475 0.485FORTE-App 0.509 0.513

We will refer to "math symbol" and "node" interchangeably depending on context.

DBahdanau KCho YBengio arXiv e-prints Neural Machine Translation by Jointly Learning to Align and Translate Sep 2014 Retrieval evaluation with incomplete information CBuckley EMVoorhees Prof. Intl. ACM SIGIR Conf. Res. Develop. Info. Retrieval. p 2004 SIGIR '04 Tree-to-tree neural networks for program translation XChen CLiu DSong Proc. Intl. Conf. Neural Info. Process. Syst. p Intl. Conf. Neural Info. ess. Syst. p 2018 Learning phrase representations using RNN encoder-decoder for statistical machine translation KCho BVan Merriënboer CGulcehre DBahdanau FBougares HSchwenk YBengio Proc. Conf. Empirical Methods Natural Lang. Process Conf. Empirical Methods Natural Lang. ess Oct 2014 Layout and semantics: Combining representations for mathematical formula search KDavila RZanibbi Prof. Intl. ACM SIGIR Conf. Res. Develop. Info. Retrieval 2017 Preliminary Exploration of Formula Embedding for Mathematical Information Retrieval: can mathematical formulae be embedded like a natural language? LGao ZJiang YYin KYuan ZYan ZTang arXiv e-prints Jul 2017 Deep Learning IGoodfellow YBengio ACourville 2016 MIT Press Set theory -an introduction to independence proofs KKunen Studies in logic and the foundations of mathematics

North-Holland

1983 102 Deep learning for symbolic mathematics GLample FCharton Proc. Intl. Conf. Learn. Representations Intl. Conf. Learn. Representations 2020 Mathematical language processing: Automatic grading and feedback for open response mathematical questions ASLan DVats AEWaters RGBaraniuk Proc. ACM Conf. Learn. @ Scale. p ACM Conf. Learn. @ Scale. p 2015 Tangentcft: An embedding model for mathematical formulas BMansouri SRohatgi DWOard JWu CLGiles RZanibbi Proc. Intl. ACM SIGIR Conf. Res. Develop. Info. Retrieval. p Intl. ACM SIGIR Conf. Res. Develop. Info. Retrieval. p 2019 Tree-structured attention with hierarchical accumulation XPNguyen SJoty SHoi RSocher Proc. Intl. Conf. Learn. Representations Intl. Conf. Learn. Representations 2020 Efficient computation of the tree edit distance MPawlik NAugsten ACM Trans. Database Syst 40 1 Mar 2015 Tree edit distance: Robust and memory-efficient MPawlik NAugsten Info. Syst 56 2016 Analysing mathematical reasoning abilities of neural models DSaxton EGrefenstette FHill PKohli Proc. Intl. Conf. Learn. Representations Intl. Conf. Learn. Representations 2019 Ordered neurons: Integrating tree structures into recurrent neural networks YShen STan ASordoni ACourville Proc. Intl. Conf. Learn. Representations Intl. Conf. Learn. Representations 2019 Novel positional encodings to enable tree-based transformers VShiv CQuirk Proc. Intl. Conf. Neural Info. Process. Syst Intl. Conf. Neural Info. ess. Syst 2019 Transformer based multi-grained attention network for aspect-based sentiment analysis JSun PHan ZCheng EWu WWang IEEE Access 8 2020 Sequence to sequence learning with neural networks ISutskever OVinyals QVLe Proc. Intl. Conf. Neural Info. Process. Syst. p Intl. Conf. Neural Info. ess. Syst. p 2014 Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks KSTai RSocher CDManning Feb 2015 arXiv e-prints EMVoorhees DKHarman TREC: Experiment and evaluation in information retrieval MIT press Cambridge 2005 63 Tree transformer: Integrating tree structures into self-attention YWang HYLee YNChen Proc. Conf. Empirical Methods Natural Lang. Process. and Intl. Joint Conf. Natural Lang. Process Conf. Empirical Methods Natural Lang. ess. and Intl. Joint Conf. Natural Lang. ess Nov 2019 TopicEq: A Joint Topic and Mathematical Equation Model for Scientific Texts MYasunaga JLafferty Proc. AAAI conf. Artificial Intell AAAI conf. Artificial Intell 2019 Automatic generation of headlines for online math questions KYuan DHe ZJiang LGao ZTang CLGiles Proc. AAAI conf. Artificial Intell AAAI conf. Artificial Intell 2020 Ntcir-12 mathir task overview RZanibbi AAizawa MKohlhase IOunis GTopic KDavila Proc. NTCIR Conf. Eval. Info. Access NTCIR Conf. Eval. Info. Access 2016 Recognition and retrieval of mathematical expressions RZanibbi DBlostein Intl. J. Document Anal. Recognit 15 4 Dec 2012 Accelerating substructure similarity search for formula retrieval WZhong SRohatgi JWu CGiles RZanibbi Proc. European Conf. Info. Retrieval European Conf. Info. Retrieval 2020 Structural similarity search for formulas using leaf-root paths in operator subtrees WZhong RZanibbi Proc. Intl. Conf. Neural Info. Process. Syst LAzzopardi BStein NFuhr PMayr CHauff DHiemstra Intl. Conf. Neural Info. ess. Syst 2019 Learning optimal tree models under beam search JZhuo ZXu WDai HZhu HLi JXu KGai Proc. Intl. Conf. Mach. Learn Intl. Conf. Mach. Learn Jul 2020 119