<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Mathematical Formula Representation via Tree Embeddings</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zichao Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrew Lan</string-name>
          <email>andrewlan@cs.umass.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Richard Baraniuk</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Rice University</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Massachusetts Amherst</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We propose a new framework for learning formula representations using tree embeddings to facilitate search and similar content retrieval in textbooks containing mathematical (and possibly other types of) formula. By representing each symbolic formula (such as math equation) as an operator tree, we can explicitly capture its inherent structural and semantic properties. Our framework consists of a tree encoder that encodes the formula's operator tree into a vector and a tree decoder that generates a formula from a vector in operator tree format. To improve the quality of formula tree generation, we develop a novel tree beam search algorithm that is of independent scienti c interest. We validate our framework on a formula reconstruction task and a similar formula retrieval task on a new real-world dataset of over 770k formulae collected online. Our experimental results show that our framework signi cantly outperforms various baselines.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Recent years have seen increasing proliferation of mathematical language such
as equations and formulae. Table 1 shows a few examples of formulae not only
in mathematics but also in other scienti c subjects that often appear in science,
technology, engineering, and math (STEM) textbooks.</p>
      <p>
        A number of practical tasks have recently gained traction because of the
ubiquitous presence of mathematical language. For example, one common task is
similar formula retrieval, i.e., nding relevant formulae similar to a query formula
(e.g., [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]). This task arises in a wide range of scenarios, such as when students look
for relevant math content in a textbook when doing algebra homework. Another
common task is automatic formula generation, which arises in scenarios such as
formula auto-completion and math summary and headline generation [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]. Both
these tasks are potentially labor-intensive and time-consuming; an automatic
method that tackles these tasks would be of great bene t. In this work, we
focus on mathematical language processing (MLP), which involves the formula
representation problem, i.e., processing a formula into an appropriate format for
downstream tasks such as similar formula retrieval and formula generation.
      </p>
      <p>N = j0:5 log2</p>
      <p>Frequency of this item</p>
      <p>Frequency of most common item
29328U + 2684Ni ! 132020Ubn !</p>
      <p>ssion only
ax2 + bx + c = 0
k (physics)
(chemistry)
(algebra)</p>
      <p>
        Existing research mostly focuses on either the retrieval or the generation task
but rarely both. In terms of formula retrieval, an emerging line of research
explores the idea of symbolic tree representation. Indeed, mathematical formulae
are inherently hierarchical and tree structures are appropriate for organizing the
math symbols in a formula. Compared to representing a formula simply as a
sequence of math symbols, the symbolic tree representation has the advantage to
encode both the semantics and the inherent hierarchical structure of a formula.
Similar to works in natural language processing (NLP) that leverage inherent
language structure (i.e., [
        <xref ref-type="bibr" rid="ref12 ref16 ref18 ref20 ref22">20,16,22,12,18</xref>
        ]), a number of recent works in formula
retrieval exploit formula's unique structural properties, leading to improved
results [
        <xref ref-type="bibr" rid="ref11 ref27 ref28 ref5">5,28,11,27</xref>
        ] compared to other formula representations, e.g., [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. However,
none of the aforementioned works is capable of generating a formula.
      </p>
      <p>
        In terms of formula generation, existing research usually combines processing
mathematical and natural language. For example, [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] trains a topic model on
scienti c documents and learns the keywords (topics) of a formula. [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] generates
a headline from mathematical questions. However, these works treat formulae
as sequences of math symbols and thus neither leverage formula's inherent tree
structure. Some other works focus on solving math problems, i.e., generating a
solution for an input formula [
        <xref ref-type="bibr" rid="ref15 ref9">9,15</xref>
        ]. However, these works are fully supervised,
i.e., they rely on large, labeled datasets that are di cult to collect. To overcome
the data issue, these works design methods to arti cially generate formulae.
However, such synthetic data is simple and does not cover many complicated
formulae in practice (e.g., matrices), limiting the generalization capability of
models trained on such data.
      </p>
      <p>Contributions. We propose FORTE, a novel unsupervised framework for
Mathematical FOrmula Representation learning via Tree Embeddings. Our
framework fully exploits the tree structure of math formulae to enable both e ective
formula encoding and formula generation. FORTE consists of 2 key components.
(a) Input and output formula tree.</p>
      <p>(b) The decoding process.</p>
      <p>
        First, a tree encoder encodes a formula tree into an embedding that can bene t
various downstream tasks, i.e., formula retrieval. Second, a tree decoder
generates a formula tree from an embedding. We also propose a novel tree beam
search algorithm that extends the beam search in sequence-to-sequence
models (e.g., [
        <xref ref-type="bibr" rid="ref1 ref19">19,1</xref>
        ]) to improve the generation quality for tree-structured data. To
evaluate our framework, we have collected a dataset of over 770k formulae, the
largest to date to our knowledge, from professional, real-world sources such as
Wikipedia and arXiv articles. On a formula autoencoding task and a formula
retrieval task, we show that our framework (sometimes signi cantly) outperforms
existing methods.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>The FORTE Framework</title>
      <p>We now present our FORTE framework. We rst describe the tree representation
of a formula (operator tree), which forms the foundation of our framework. We
then set up the MLP problem and introduce the various FORTE components,
including the tree encoder, tree decoder, and tree beam search. Figures 1{??
together provide a high-level overview of our framework.</p>
      <sec id="sec-2-1">
        <title>Formulae As Operator Trees</title>
        <p>
          Every formula X is inherently tree structured [
          <xref ref-type="bibr" rid="ref26 ref5">26,5</xref>
          ] and can be represented as
a symbolic operator tree (OT):
        </p>
        <p>X = (U; &lt;);
u 2 U;</p>
        <p>U</p>
        <p>
          V
(1)
where u is a math symbol (a node in OT),1 U is the set of math symbols in the
operator tree X, and V is the \vocabulary", i.e., all unique math symbols in the
data set. &lt; represents partial binary parent-child relation 8u 2 U [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Procedure
(a) in Fig. 1 illustrates the conversion from a formula to its OT. Intuitively, the
OT organizes the math symbols in a formula, such as operators, variables, and
numerical values, as nodes in an explicit, hierarchical tree structure. We choose
OT because of its intuitive interpretation and rich semantics. We emphasize that
our FORTE framework is agnostic to the underlying tree representation; other
tree representations such as symbol layout tree [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] can also be used.
Math Symbol Vocabulary. The size of the math symbol vocabulary V may be
unbounded (e.g., every element in the real number set R, which is uncountably
in nite, could be an element in V ); however, most symbols rarely appear. We
thus propose the following truncation method in order to work with a nite
vocabulary in practice. First, we partition the vocabulary V into ve disjoint
sub-vocabulary according to symbol types, including numeric Vnum (numbers,
decimals), functional Vfun (multiplication, subtraction etc.), variable Vvar,
textual Vtxt and others Vo. We do so because di erent types of math symbols carry
di erent semantic meanings. Then, we retain only the most frequent K symbols
in each sub-vocabulary and convert others to an \unknown" symbol speci c to
each type. This setup guarantees that the semantics of symbols that do not occur
frequently are preserved.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>The MLP Problem Formulation</title>
        <p>
          We set up the MLP problem as an unsupervised \autoencoding" task (see e.g.,
Ch.14 in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]), motivated by the downstream tasks that we envision our framework
will perform. Speci cally, our framework aims to reconstruct the input formula
in its OT representation through an encoder-decoder bottleneck model design.
This problem setup allows us to use the latent embedding from the encoder
output for many downstream tasks, i.e., formula retrieval, and the generated
formula from the decoder output for generation-related tasks.
        </p>
        <p>Concretely, the training objective of our framework is</p>
        <p>L( ; ) =</p>
        <p>N i=1
1 XN Px(Xo(iu)t) log fd(fe(Xi(ni); ); ) ;
(2)
1 We will refer to \math symbol" and \node" interchangeably depending on context.
where N is the number of formulae, fe is the encoder function with parameter
, fd is the decoder function with parameter and Px is the empirical data
distribution. Xi(ni) and Xo(iu)t are the input and output representations of the i-th
formula tree in our new dataset, which we will introduce in Sec. 3. We will drop
the data point index i in the remainder of the paper for simplicity of exposition.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Formula Tree Encoder</title>
        <p>Our tree encoder takes a formula tree as input and outputs an embedding of
this formula. The key idea is to properly encode all information underlying
the formula tree. To this end, we use two methods including tree traversal,
which extracts content (node) information, and tree positional encoding, which
extracts structural (relative positions of nodes) information. Figure 1 provides
an overview of our formula tree encoder.</p>
        <p>Formula Tree Traversal and Node Embedding. To obtain the content in a formula
tree, we employ tree traversal, which visits each node in the tree in a particular
order and extracts its content. Process (b) in Fig. 1 illustrates this process.
In this work, we consider traversal using depth- rst search (DFS), although
other traversal orders can also be used. This step returns a DFS-ordered list</p>
        <p>T
of nodes futgt=1 where t is the position of node u in the DFS order and T is
the number of nodes in the formula tree. Each node is then represented as a
trainable embedding xet 2 RM with dimension M .</p>
        <p>
          Tree Positional Embedding. To extract the structure of a formula tree, we
propose a two-step method that rst retains and then embeds the relative positions
of nodes in the tree. First, we recursively encode the position t of node u as
qt 2 R`t where `t = 0; : : : ; T is the depth of node u in the tree. qt is composed of
the node's parent's position appended with its relative positions to its siblings.
qt = [0] for the root node. The formula tree in Fig. 1 illustrates this step. For
example, the position [0; 1; 1] of the numeric node \4" is composed of [0; 1] which
is its parent's position and [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] because it is the second child of its parent. Second,
we propose a binary tree positional embedding to convert the encoded position
qt of each node to a xed-dimensional vector pt 2 RD where
        </p>
        <p>
          h i
pt dlog2(C)ej:dlog2(C)ej+dlog2(qt[j])e =bin(qt[j])
(3)
8j = 0; : : : ; `t :
C is the maximum degree (i.e., number of children) of all trees in the dataset,
d e is the ceiling function, bin( ) is the binarization operator (e.g., bin(5) = 101)
and pt[j] selects the j-th index of the vector pt. The resulting dimension of the
tree positional embedding pt is D = L log2(C) where L is the maximum depth
of all trees in the dataset.
Formula Tree Embedding. To transform the formula tree into its embedding, we
utilize an embedding function fe : R(M+D) T ! RK where K is the dimension
of the formula tree embedding and T is the total number of nodes in the formula
tree. Because M is not necessarily the same as D, we concatenate the node and
tree positional embeddings. Concretely, the formula tree embedding is computed
as
h = fe(fxtgtT=1; );
xt = xet&gt;; pt&gt; &gt; :
(4)
There are many options for instantiating the embedding function fe because,
thanks to our node and tree positional embedding methods, the tree content
and structure are fully preserved in the encoded input sequence. In this work,
we use the gated recurrent unit network (GRU) [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] for fe, but one can freely
choose other appropriate models.
        </p>
        <p>
          Relation to Prior Work. Our tree encoder design di ers from existing approaches
in 2 regards. First, compared to [
          <xref ref-type="bibr" rid="ref20 ref3">20,3</xref>
          ], which perform tree traversal during
training and thus only allow a single data point per iteration, our encoder performs
traversal before training, which enables mini-batch processing during training.
As a result, our approach removes this computationally expensive traversal step
from the training process and signi cantly speeds up training. Second, compared
to [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], which uses a onehot-style tree positional embedding, our encoder employs
a di erent binary tree positional embedding which reduces the space complexity
from O(LC) to O(L log2(C)). This reduction is especially signi cant for trees
with a large degree.
2.4
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>Formula Tree Decoder</title>
        <p>The decoder takes a formula embedding vector, i.e., the output from our tree
encoder, as input and generates a formula tree as output. We face two main
challenges when we design the decoder. First, the terminating condition of tree
generation is unclear because the formula tree can have multiple leaf nodes, each
of which terminates the generation in a single branch but not necessarily the
entire tree. Second, the order of generation, i.e., which node to generate next,
is unclear because there are at least two directions at each node: its siblings
(horizontal) and its children (vertical).</p>
        <p>We now present our decoder, which tackles these two challenges by modifying
the decoder target from the encoder input (recall that in a usual autoencoding
task, the input and target are exactly the same) and generating by traversing
the tree. We also present our novel tree beam search generation algorithm, which
improves formula tree generation quality during test time.</p>
        <p>
          Modi ed Decoder Target Tree. To address the challenge of unclear generation
termination condition, we propose to slightly modify the decoder target from
the input formula tree by attaching an additional special \end" node to each
node as its last child. Doing so informs the decoder when a node has no more
Methods
seq2seqRNN
tree2treeRNN [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
tree2treeTF [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]
children to generate and provides a clear generation termination signal for each
node. Figure 2a compares the encoder's input and decoder's target of the same
formula tree.
        </p>
        <p>Generation Via Tree Traversal. To address the challenge of the unclear
generation order, we propose to traverse the tree during generation in the same
order as the encoder. A data structure is required to track the traversal process.
Therefore, for an encoder using DFS traversal, We propose to use a stack which
maintains nodes whose generation is un nished in the DFS order. In addition to
the node itself, the stack also maintains the number of children and the position
of each node in the stack.</p>
        <p>During generation, we use the current node's embedding x~t together with the
next node's position pt+1 to generate the next node. We do so because a node
can have multiple children; therefore, we use the next node's position to inform
the model which node to generate next. This is a major di erence from typical
sequence generation models, i.e., Transformers, in which the input position is
the input token's position itself. Concretely, the next node is generated as
ubt+1 = argmax softmax(fd(fx0sgts=1; )) ;
u2V
s = 1; : : : ; t and
x0s = [xs; ps+1; h] ;
e
(5)
where we also concatenate the formula tree embedding h to the decoder's input
at each generation step. The next node's position ps+1 is computed using the
number of children and the position of the current input node; see Section 2.3 for
details. When a node's generation nishes, as signaled by the generation of an
\end" node, this node is removed from the stack. The \end" node itself is never
added to the stack. The entire tree generation is nished when the stack is empty,
i.e., no more nodes to expand further. We use a special \start" token to mark the
beginning of the generation process. Figure 2b illustrates the generation process.
Tree Beam Search for Tree Generation. The above generation process is a greedy
algorithm which may be sub-optimal. To improve tree generation quality, we
propose the tree beam search (TBS). The idea is to maintain a stack for each
beam, which records the node generation order. During the generation process,
the decoder generates the top B most probable next nodes from the current
generated tree of each beam, resulting in a total of B2 candidate trees to expand.
We then select B most probable candidate trees to expand further. This process
continues until B trees nish generation or until a preset maximum number of
steps have been reached.</p>
        <p>
          TBS extends beam search in NLP (e.g., in [
          <xref ref-type="bibr" rid="ref1 ref19">19,1</xref>
          ]) which only concerns
sequential data and is incapable of generating tree-structured data. Similar to beam
search in NLP, by expands the search space of candidate trees by a factor of
B, TBS enables more exible and higher quality generation during the decoding
process compared to greedy generation.
        </p>
        <p>
          Relation to Prior Work. To our knowledge, this is the rst beam search
algorithm for generating tree structured data. Our work di ers from [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ] which
proposed a tree beam search algorithm for recommender systems that treats
the entire dataset as a tree, where each node is a data point (user, item); in
our work, we treat each data point (formula) as a tree. Our work also di ers
from [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] which speci es the number of children each node must have in the
tree, signi cantly constrains the generation process, unnecessarily increases the
node vocabulary and leads to worse generation quality; in our work, there are
no constraints on the number of children that each node must have, resulting in
exible and varied generated formula tree.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <p>We conduct two experiments to validate FORTE. In the rst experiment, we
demonstrate the advantage of FORTE compared to other tree and sequence
generation methods for formulae in the formula reconstruction task. In the
second experiment, we demonstrate an application of FORTE in the context of the
formula retrieval task and show the advantage of FORTE over existing formula
retrieval systems. For our framework in all experiments, we use a 2-layer
bidirectional GRU for the encoder and a 2-layer unidirectional GRU for the decoder.
Dataset. We collected a large real-world dataset of more than 770k formulae
from a subset of articles on Wikipedia and arXiv. We extracted formulae from
these articles and processed them into OT representations.
3.1</p>
      <sec id="sec-3-1">
        <title>Formula Reconstruction</title>
        <p>
          In this experiment, we test FORTE's ability to reconstruct a formula. Because
some baselines only works on binary trees [
          <xref ref-type="bibr" rid="ref17 ref3">3,17</xref>
          ], we select a subset of 170k
formulae whose operator trees are binary.
        </p>
        <p>
          Baselines. We consider the following baselines: seq2seqRNN which implements
the same encoder and decoder as our framework but processes formulae as
sequences of math symbols; tree2treeRNN [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] which is an RNN-based method
capable of encoding and decoding only binary trees; treeTransformer [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]
which is a Transformer-based method that shows success only on binary trees.
The latter two baselines were originally developed and evaluated on a very di
erent task (program translation) than MLP. We also include four variants of our
framework to evaluate the utility of (1) binary against onehot tree positional
embedding and (2) TBS against greedy search for tree generation. We construct
training, validation, and test sets by splitting the 170k dataset 80%-10%-10%.
We train each model 5 times for 50 epochs, record the model with the best
performance on the validation set. We then perform formula reconstruction on the
test set using beam size B = 10 for applicable methods and report the average
values of the following metrics on the test set.
        </p>
        <p>
          Evaluation Metrics. We use two groups of metrics. The rst group of metrics
measures the reconstruction accuracy, i.e., the percentage of the generated
formulae that are exactly the same as the ground-truth. We compute both ACC
top-1, using only the most probable generated formulae, and ACC top-5, using
the ve most probable generated formulae. The second group of metrics
measures how much the generated formula tree di ers from the ground truth formula
tree. We use tree edit distance (TED) which measures the distance of two trees
by computing the minimum number of operations needed, including changing
nodes and node connections, to convert one tree to the other. See [
          <xref ref-type="bibr" rid="ref13 ref14">14,13</xref>
          ] for an
overview of the TED algorithm. We compute both TED-overall which considers
both node and connection editing and TED-structural which only considers
connection editing.
        </p>
        <p>Results. Table 2 presents the formula reconstruction results comparing our
framework with baselines. Comparing to the two tree2tree baselines that
struggle at this task, the encoder and decoder designs in FORTE enable near-perfect
formulae reconstruction. Comparing to the seq2seqRNN baseline, FORTE shows
improvement when processing formulae as OT against as sequences. Moreover,
the results from the 4 FORTE variants clearly demonstrates the bene ts of
binary tree positional embedding and TBS, leading to improvements in all 4
metrics compared to onehot tree positional embedding and greedy search,
respectively. We repeat this experiment on the full dataset comparing only seq2seqRNN
and FORTE since the tree-based baselines cannot process non-binary trees.
FORTE achieves 85.87% compared to seq2seqRNN's 84.30% on the TOP-1 ACC
and 90.30% compared to seq2seqRNN's 88.52% on TOP-5 ACC, respectively.
These results further show the bene ts of representing formulae as OT against
as sequences.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Formula Retrieval</title>
        <p>In this qualitative experiment, we evaluate FORTE's capabilities in a formula
retrieval application. Given an input formula (query), a retrieval method aims
to return the top relevant formulae (retrievals) from a collection of formulae.
We use the entire 770k formula dataset to train our framework and then use
the trained encoder to obtain an embedding for each formula. For each query,</p>
        <p>Methods
Approach0
Tangent-S
TangentCFT</p>
        <p>
          Metrics
map bpref
we compute the cosine similarity between its embedding and the embedding of
each formula in the dataset. Finally, we choose the formulae with the highest
similarity scores as the retrievals. We use the queries from the NTCIR-12 formula
retrieval task [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]. Table 1 shows a few examples of the queries.
        </p>
        <p>
          Model and Baselines. We consider three state-of-the-art baselines designed
specifically for the formula retrieval task including Tangent-CFT [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], which is one
of the few data-driven formula retrieval systems to date, and Tangent-S [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and
Approach0 [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ], both of which are based on symbolic sub-tree matching and
are data independent. We train Tangent-CFT on the same dataset as FORTE.
Evaluation Metrics. Because it is di cult to algorithmically judge the relevance
of a retrieval to a query, we perform a human evaluation for this task as follows.
First, for each method and each query, we choose the top 25 retrieved formulae
and mix them into a single pool of retrievals. Second, for each query, three
human evaluators independently provide a ternary rating for each retrieval in
the pool, i.e., whether the retrieval is relevant, half-relevant, or irrelevant to
the query. The above evaluation procedure is consistent with [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ], including
the number of evaluators involved. We then use the mean average precision
(MAP) [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] and bpref [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] as the evaluation metrics. Compared to other retrieval
evaluation metrics, Both MAP and bpref are easy to interpret and appropriate
for evaluating multiple queries and for comparing multiple retrieval systems.
Results. Table 3 presents the quantitative evaluation results, averaged over the
three evaluators' scores. Both metrics indicate that our framework achieves
superior performance than the other data-driven baseline, TangentCFT, and one
of the data-independent baselines, Tangent-S. Our method falls slightly behind
Approach0. This may be caused by the fact that Approach0 uses a more explicit
symbolic matching using directly subtrees whereas we rely on the more abstract
vector representations of formulae which may lose information during
similarity computation. This result is consistent with existing literature [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. We then
propose a new model, FORTE-App that combines the strengths of FORTE and
Approach0. The last row of Table 3 shows that the combined method achieves
new state-of-the-art performance on the formula retrieval experiment. In
particular, the bpref score implies that, on average, in FORTE's retrieved formulae,
relevant formulae (as judged by evaluators) rank higher than irrelevant ones
more often than in those retrieved by baselines.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and Future Work</title>
      <p>
        In this work, we propose FORTE, a novel, unsupervised mathematical language
processing framework by leveraging tree embeddings. By encoding formulae as
operator trees, we can explicitly capture the inherent structure and semantics
of a formula. We propose an encoder and a decoder capable of embedding and
generating formula trees, respectively, and a novel tree beam search algorithm
to improve generation quality at test time. We evaluate our framework on
formula reconstruction and demonstrate our framework's superior performance in
both experiments compared to baselines. There are many avenues of future work.
One direction is to combine our framework's dedicated capability to encode and
generate formulae with state-of-the-art NLP methods to enable cross-modality
applications that involve both mathematical and natural language. For example,
our framework can serve as a drop-in replacement for the formulae processing
part in a number of existing works to potentially improve performance, i.e.,
in [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] for joint text and math retrieval, in [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ] for math headline generation,
in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] for grading students' math homework solutions, and in [
        <xref ref-type="bibr" rid="ref15 ref9">15,9</xref>
        ] for neural
math reasoning. We also look forward to conducting a thorough human
evaluation for relevant formula retrieval.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bahdanau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.:</given-names>
          </string-name>
          <article-title>Neural Machine Translation by Jointly Learning to Align and Translate</article-title>
          . arXiv e-prints (
          <year>Sep 2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Buckley</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Voorhees</surname>
            ,
            <given-names>E.M.:</given-names>
          </string-name>
          <article-title>Retrieval evaluation with incomplete information</article-title>
          .
          <source>In: Prof. Intl. ACM SIGIR Conf. Res. Develop. Info. Retrieval</source>
          . p.
          <volume>25</volume>
          {
          <fpage>32</fpage>
          . SIGIR '
          <volume>04</volume>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Tree-to-tree neural networks for program translation</article-title>
          .
          <source>In: Proc. Intl. Conf. Neural Info. Process. Syst</source>
          . p.
          <volume>2552</volume>
          {
          <issue>2562</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Cho</surname>
            , K., van Merrienboer,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gulcehre</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bahdanau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bougares</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schwenk</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Learning phrase representations using RNN encoder{decoder for statistical machine translation</article-title>
          .
          <source>In: Proc. Conf. Empirical Methods Natural Lang. Process</source>
          . pp.
          <volume>1724</volume>
          {
          <issue>1734</issue>
          (Oct
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Davila</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zanibbi</surname>
          </string-name>
          , R.:
          <article-title>Layout and semantics: Combining representations for mathematical formula search</article-title>
          .
          <source>In: Prof. Intl. ACM SIGIR Conf. Res. Develop. Info. Retrieval</source>
          . p.
          <volume>1165</volume>
          {
          <issue>1168</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yin</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yuan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Preliminary Exploration of Formula Embedding for Mathematical Information Retrieval: can mathematical formulae be embedded like a natural language? arXiv e-prints (</article-title>
          <year>Jul 2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Goodfellow</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Courville</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Deep Learning</article-title>
          . MIT Press (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Kunen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Set theory - an introduction to independence proofs, Studies in logic and the foundations of mathematics</article-title>
          , vol.
          <volume>102</volume>
          . North-Holland (
          <year>1983</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Lample</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Charton</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Deep learning for symbolic mathematics</article-title>
          .
          <source>In: Proc. Intl. Conf. Learn. Representations</source>
          (
          <year>2020</year>
          ), https://openreview.net/forum?id= S1eZYeHFDS
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Lan</surname>
            ,
            <given-names>A.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vats</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Waters</surname>
            ,
            <given-names>A.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baraniuk</surname>
          </string-name>
          , R.G.:
          <article-title>Mathematical language processing: Automatic grading and feedback for open response mathematical questions</article-title>
          .
          <source>In: Proc. ACM Conf. Learn. @ Scale</source>
          . p.
          <volume>167</volume>
          {
          <issue>176</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Mansouri</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rohatgi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oard</surname>
            ,
            <given-names>D.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giles</surname>
            ,
            <given-names>C.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zanibbi</surname>
          </string-name>
          , R.:
          <article-title>Tangentcft: An embedding model for mathematical formulas</article-title>
          .
          <source>In: Proc. Intl. ACM SIGIR Conf. Res. Develop. Info. Retrieval</source>
          . p.
          <volume>11</volume>
          {
          <issue>18</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>X.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joty</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hoi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
          </string-name>
          , R.:
          <article-title>Tree-structured attention with hierarchical accumulation</article-title>
          .
          <source>In: Proc. Intl. Conf. Learn. Representations</source>
          (
          <year>2020</year>
          ), https: //openreview.net/forum?id=HJxK5pEYvr
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Pawlik</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Augsten</surname>
          </string-name>
          , N.:
          <article-title>E cient computation of the tree edit distance</article-title>
          .
          <source>ACM Trans. Database Syst</source>
          .
          <volume>40</volume>
          (
          <issue>1</issue>
          ) (
          <year>Mar 2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Pawlik</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Augsten</surname>
          </string-name>
          , N.:
          <article-title>Tree edit distance: Robust and memory-e cient</article-title>
          .
          <source>Info. Syst</source>
          .
          <volume>56</volume>
          ,
          <issue>157</issue>
          {
          <fpage>173</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Saxton</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grefenstette</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hill</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kohli</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Analysing mathematical reasoning abilities of neural models</article-title>
          .
          <source>In: Proc. Intl. Conf. Learn. Representations</source>
          (
          <year>2019</year>
          ), https://openreview.net/forum?id=H1gR5iR5FX
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sordoni</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Courville</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Ordered neurons: Integrating tree structures into recurrent neural networks</article-title>
          .
          <source>In: Proc. Intl. Conf. Learn. Representations</source>
          (
          <year>2019</year>
          ), https://openreview.net/forum?id=B1l6qiR5F7
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Shiv</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quirk</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Novel positional encodings to enable tree-based transformers</article-title>
          .
          <source>In: Proc. Intl. Conf. Neural Info. Process. Syst</source>
          . pp.
          <volume>12081</volume>
          {
          <issue>12091</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Sun</surname>
            , J., Han, P., Cheng,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Transformer based multi-grained attention network for aspect-based sentiment analysis</article-title>
          .
          <source>IEEE Access 8</source>
          ,
          <issue>211152</issue>
          {
          <fpage>211163</fpage>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>Q.V.</given-names>
          </string-name>
          :
          <article-title>Sequence to sequence learning with neural networks</article-title>
          .
          <source>In: Proc. Intl. Conf. Neural Info. Process. Syst</source>
          . p.
          <volume>3104</volume>
          {
          <issue>3112</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Tai</surname>
            ,
            <given-names>K.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , C.D.:
          <article-title>Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks</article-title>
          . arXiv e-prints (
          <year>Feb 2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Voorhees</surname>
            ,
            <given-names>E.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harman</surname>
            ,
            <given-names>D.K.</given-names>
          </string-name>
          , et al.:
          <article-title>TREC: Experiment and evaluation in information retrieval</article-title>
          , vol.
          <volume>63</volume>
          . MIT press Cambridge (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>H.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.N.</given-names>
          </string-name>
          :
          <article-title>Tree transformer: Integrating tree structures into self-attention</article-title>
          .
          <source>In: Proc. Conf. Empirical Methods Natural Lang. Process. and Intl. Joint Conf. Natural Lang. Process</source>
          . pp.
          <volume>1061</volume>
          {
          <issue>1070</issue>
          (Nov
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Yasunaga</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>La</surname>
            <given-names>erty</given-names>
          </string-name>
          , J.:
          <article-title>TopicEq: A Joint Topic</article-title>
          and
          <article-title>Mathematical Equation Model for Scienti c Texts</article-title>
          .
          <source>In: Proc. AAAI conf. Arti cial Intell</source>
          .
          <article-title>(</article-title>
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Yuan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giles</surname>
            ,
            <given-names>C.L.</given-names>
          </string-name>
          :
          <article-title>Automatic generation of headlines for online math questions</article-title>
          .
          <source>In: Proc. AAAI conf. Arti cial Intell</source>
          . pp.
          <volume>9490</volume>
          {
          <issue>9497</issue>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Zanibbi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aizawa</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kohlhase</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ounis</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Topic</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davila</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Ntcir-12 mathir task overview</article-title>
          .
          <source>In: Proc. NTCIR Conf. Eval. Info. Access</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Zanibbi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blostein</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Recognition and retrieval of mathematical expressions</article-title>
          .
          <source>Intl. J. Document Anal. Recognit</source>
          .
          <volume>15</volume>
          (
          <issue>4</issue>
          ),
          <volume>331</volume>
          {357 (Dec
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Zhong</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rohatgi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giles</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zanibbi</surname>
          </string-name>
          , R.:
          <article-title>Accelerating substructure similarity search for formula retrieval</article-title>
          .
          <source>In: Proc. European Conf. Info. Retrieval</source>
          . pp.
          <volume>714</volume>
          {
          <issue>727</issue>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Zhong</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zanibbi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Structural similarity search for formulas using leaf-root paths in operator subtrees</article-title>
          . In: Azzopardi,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Fuhr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Mayr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Hau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Hiemstra</surname>
          </string-name>
          ,
          <string-name>
            <surname>D</surname>
          </string-name>
          . (eds.)
          <source>Proc. Intl. Conf. Neural Info. Process. Syst</source>
          . pp.
          <volume>116</volume>
          {
          <issue>129</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Zhuo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gai</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Learning optimal tree models under beam search</article-title>
          .
          <source>In: Proc. Intl. Conf. Mach. Learn</source>
          . vol.
          <volume>119</volume>
          , pp.
          <volume>11650</volume>
          {
          <issue>11659</issue>
          (
          <issue>13</issue>
          {
          <issue>18</issue>
          <year>Jul 2020</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>