=Paper= {{Paper |id=Vol-2578/ETMLP1 |storemode=property |title=Interpretable Nearest Neighbor Queries for Tree-Structured Data in Vector Databases of Graph-Neural Network Embeddings |pdfUrl=https://ceur-ws.org/Vol-2578/ETMLP1.pdf |volume=Vol-2578 |authors=Lukas Pfahler,Jan Richter |dblpUrl=https://dblp.org/rec/conf/edbt/PfahlerR20 }} ==Interpretable Nearest Neighbor Queries for Tree-Structured Data in Vector Databases of Graph-Neural Network Embeddings== https://ceur-ws.org/Vol-2578/ETMLP1.pdf
    Interpretable Nearest Neighbor Queries for Tree-Structured
        Data in Vector Databases of Graph-Neural Network
                           Embeddings
                                 Lukas Pfahler                                                                Jan Richter
                      lukas.pfahler@tu-dortmund.de                                               jan-philip.richter@tu-dortmund.de
                         TU Dortmund University                                                       TU Dortmund University
                           Dortmund, Germany                                                            Dortmund, Germany

ABSTRACT                                                                               results are based on neural network computations and it is often
We propose a method for interpreting similarity computations                           unclear how they came about. Traditional keyword queries are
between neural embeddings of trees. We showcase our approach                           easily interpreted: We can highlight where the query-keywords
in a search engine for mathematical equations crawled from                             or expressions related to the keywords appear in the results,
arxiv.org. The equations are encoded as MathML, a XML file                             thereby justifying the systems output. Because our system does
format that describes math in tree structures. These trees are pro-                    not rely on exact matching of sub-terms, but rates overall seman-
cessed by a graph convolutional neural network to obtain fixed-                        tic relatedness, the outputs are harder to interpret and harder to
size low dimensional and dense embedding vectors for each tree.                        visualize, as we cannot necessarily highlight overlapping parts of
Search is performed by performing nearest neighbor query in the                        result and query. To overcome this issue, we propose to apply so-
set of embeddings. However just by the embeddings it is difficult                      called salience map for visualizing the similarity scores predicted
to judge how the similarity came about. We propose two different                       by a graph convolutional neural network. These visualizations
approaches to highlight parts of the equation that are important                       help both the end user in understanding how results came about
for the similarity, one based on the forward-computation of the                        and the machine learning engineer who deploys the models in
neural network and one based on the gradient in the backward-                          understanding how the model rates similarities.
computation. In a qualitative study on math retrieval we show                              The rest of this work is structured as follows: We begin by dis-
the advantages of both methods.                                                        cussing related work, including aspects of interpretable machine
                                                                                       learning as well as geometric deep learning. Then we present
KEYWORDS                                                                               our two methods for visualizing embeddings of trees in Section
                                                                                       3. In Section 4 we discuss our results in the context of a search
neural networks, interpretability, information retrieval
                                                                                       engine for mathematical content. We discuss our dataset and data
                                                                                       preparation and perform a qualitative study of our visualizations.
1    INTRODUCTION                                                                      We conclude this paper with an outlook in Section 5.
The current advances in artificial intelligence, particularly in
deep learning, have had a huge impact on information retrieval                         2   RELATED WORK
and has changed the way we process, index and retrieve data.                           The call for interpretable machine learning models is popular in
Artificial neural networks transform multimedia content like text,                     times where more and more decisions are automated using tools
images, audio or video and compute latent representations which                        of artificial intelligence. It is connected to questions of account-
we can use to detect semantic similarities.                                            ability as well as ethics and morality and is often accompanied
   In this work we focus on retrieval of mathematical expressions.                     by dystopian visions of out-of-control artificial intelligence sys-
We showcase a neural-network based math retrieval engine that                          tems [9]. In practice, interpretable machine learning can roughly
finds semantically related formulas based on contextual similar-                       be grouped into two groups: interpretable model families and
ity. The system is designed to help scientists find related research                   "model-agnostic interpretation tools" [9]. The former look at
based on math-queries. We leverage machine learning to process                         models that are interpretable [12]: Usually we mention decision
a large corpus of mathematical expressions and learn a vectorial                       trees and linear models here. However some restrictions apply:
representation that uncovers relations between equations. A hu-                        In order for either of them to be interpretable, the number of pa-
man reader can use background knowledge on conventions and                             rameters they use must be sufficiently small and the interactions
notations to infer the context of an mathematical expression or                        between the parameters must be easily understandable: A deep
judge the relevance of an equation for a search query. We hope                         decision tree is no longer easy to interpret, even when it uses
to learn a model that can perform this task using large amounts                        only a small number of features. A linear classification can be in-
of data. Our system starts by representing formulas or equa-                           terpretable with a larger number of used features, as the features
tions in an XML-tree format, namely MathML, and uses a graph                           can be interpreted independently. A polynomial classification
convolutional neural network to embed these trees in a dense,                          is more complicated to interpret in that regard. When decision
low-dimensional vector space that we can search efficiently using                      trees become too complex to be interpreted, simpler models like
index data-structures for fast nearest neighbor retrieval. At the                      decision lists or ordered decision lists offer a simpler solution [2]
end of this machine learning pipeline, the users are interacting                       as they do not recursively branch. Many of the recent successes
with the search interface and are presented with results. These                        of machine learning models for practical problems come from
© 2020 Copyright for this paper by its author(s). Published in the Workshop Proceed-   the use of massively over-parametrized neural models. These are
ings of the EDBT/ICDT 2020 Joint Conference (March 30-April 2, 2020, Copenhagen,       diametrically opposed to the two characteristics of interpretable
Denmark) on CEUR-WS.org. Use permitted under Creative Commons License At-              models mentioned above, as they use millions of parameters and
tribution 4.0 International (CC BY 4.0)
                                                                                       their non-linear nature obfuscates how these parameters interact.
Here we can apply the latter group of approaches, that either aim       where ϕ,ψ are (sub-)differentiable operators such as linear trans-
to make the blackbox model interpretable[13] or aim to make             formations or multi-layer perceptrons, □ denotes a differentiable,
the decisions of black-box models interpretable (e.g. [5]). In the      permutation invariant function like sum, mean or max and N (i)
context of image classification the most popular techniques are         denotes the set of all neighboring nodes of i in the tree with
based on visualizations where we highlight regions of the input         edges E. Note that ϕ might use information about the edges in
image that are important for the overall classification. One of the     the form of vectorial edge-features ei j . An example of a simple
popular approaches in image classification computes a derivative        graph convolutional layer is
in the first layer and projects it onto the image [14]. Building
onto the deconvolution technique[14, 18], Springenberg et al.
                                                                                                                Õ
                                                                                                 x i′ = σ ­W           xj ®
                                                                                                          ©               ª
propose a kind of guided backpropagation where negative gradi-
                                                                                                          « j ∈N(i)∪i ¬
ents are set to zero in the backward step to focus solely on input
features which increase the activation of a higher layer unit [15].     which linearly transforms all nodes using a weight matrix W ,
Another approach by Stylianou and Pless visualizes the output of        aggregates by computing the sum of all neighborhoods and ap-
a convolutional neural network using just the activations of the        plying a component-wise activation function σ like the sigmoid
last convolutional layer to identify the regions of interest for the    function or ReLU.
image classification [16]. We apply both directions for visualiza-           As long as all layers in a graph neural network are (sub-
tions in a similarity learning task rather than a classification task   )differentiable operations, we can train the network via back-
and modify them to work with graph convolutional networks               propagation. Efficient software libraries for training models with
instead of plain convolutional neural networks.                         GPU-support are available, e.g. we use torch-geometric [4].
   Geometric Deep Learning is an umbrella term for deep neural               To store the fixed-size embedding of the whole tree in a vector
                                                                        database, we compute the mean of all nodes denoted by ϕ(x)    ¯ =
model approaches for structured data like manifolds or graph-
                                                                                  i ϕ(x)i . Then we can compute the cosine similarity or the
                                                                        |x | −1 Í
s/trees. The term graph convolutional neural network is also
prominent. The results allow the application of deep architec-          dot product
tures to inputs that represent e.g. molecules [3], point-clouds                                sim(x, x ′ ) := ⟨ϕ(x),
                                                                                                                ¯     ¯ ′ )⟩
                                                                                                                      ϕ(x
[17] or social networks [6]. We work with XML-representations           in order to obtain the similarity of two trees.
of math.
   Finally we want to discuss our application example of infor-         3.2    Forward-Visualization
mation retrieval for mathematical expressions. Approaches in            As the embedding of a tree is the mean of all its nodes, we can
this area can be divided into two partitions: Approaches that use       see that each pair of nodes contributes to the overall similarity.
a sequential representation of the mathematical content and ap-         The similarity i.e. dot-product can be decomposed as
proaches that apply a tree-representation [7]. Similarity computa-                                                        ′
                                                                                                                   |x | Õ
                                                                                                                        |x |
tion in tree-based approaches often requires computing dynamic                                                     Õ
                                                                                ¯
                                                                               ⟨ϕ(x), ¯ ′ )⟩ = (|x | · |x ′ |)−1
                                                                                      ϕ(x                                      ⟨ϕ(x)i , ϕ(x ′ )j ⟩.
programming algorithms that match subtrees, hence complex
                                                                                                                   i=1 j=1
index data-structures are required for efficient retrieval. More
recently Mansouri et al. proposed to embed mathematical expres-         Following the ideas of Stylianou and Pless [16], we visualize the
sion in a dense vector space[7] based on a sequential representa-       similarity by coloring the nodes according to their contributions
tion of the equations. Pfahler et al. propose to embed equations        to the overall similarity. Hence given the embedding of a query
using an image representation and regular convolutional neural          q,
                                                                        ® we color the i-th node proportional to
networks that are trained to predict contextual similarity [11].                                   c i ∼ ⟨ϕ i (x), q⟩.
                                                                                                                   ®
                                                                        This approach is computationally cheap, we just have to run one
3     METHOD
                                                                        forward pass for every search result we want to visualize.
We begin to describe our method by reviewing the definition
of graph convolutional neural networks. Then we propose two             3.3    Backward-Visualization
different methods for obtaining visualizations of nearest neighbor
                                                                        Following the ideas of salience maps, we color nodes in the tree
queries: one based on forward information and one based on
                                                                        according to the gradient of the similarity. Looking at a first order
backward information.
                                                                        Taylor approximation of the similarity, we can write

3.1    Graph Neural Networks for Trees                                                x + ∆, q)
                                                                                  sim(®              x, q)
                                                                                             ® ≈ sim(®  ® + ⟨∆, ∇x® sim(®
                                                                                                                        x, q)⟩
                                                                                                                           ®

We define tree structures x = (X , E) as a tuple of node-features X     to approximate how the similarity changes based on perturba-
and edges E. Let |x | denote the number of nodes in x. We assume        tions ∆ to the input x.® If we decrease the value of x in the i-th
that X ∈ R |x |×d . A graph neural network maps a given tree to an      coordinate, the corresponding coordinate in the gradient ∇x® in-
output tree with transformed feature vectors in a d ′ -dimensional      dicates whether the first-order approximation of the similarity
output space but with identical edge structure. We denote the           decreases or increases. If the gradient is positive, then reducing
                                                           ′            the feature reduces the similarity. Thus the feature is important
neural network by ϕ(x) = (ϕ(X , E), E). Let ϕ(x)i ∈ Rd denote
the output of the i-th node.                                            for the overall similarity. Hence we use the values in the gradient
   Graph neural networks are defined by composing different             ∇x® to visualize similarity.
layers. Borrowing the notation of Morris et al. [10], an abstract          In classical convolutional neural networks for image process-
graph network layer can be described by its output                      ing, the input is either a greyscale or an RGB-image. Thus at
                                                                        every position, there is either a 1-dimensional or 3-dimensional
                                                                     feature vector likewise gradient vector. If there is only one di-
                x i′ = ψ x i , □j ∈N(i) ϕ x i , x j , ei j              mension, we do not need to aggregate the gradient, in the case
of three dimensions we often aggregate by using the maximum                             < math xmlns =" http :// www . w3 . org /..." >
gradient over all color channels.                                                          < semantics >
                                                                                              < mrow >
   In our case, we have higher-dimensional feature vectors in Rd .                               < mfrac >
Hence we have to aggregate the gradient                                                             1  < mi >n 
                                                                                                 
                          c i ∼ f (∇x i sim(x, q)).                                              < msubsup >
                                                                                                    Σ 
where f is an aggregation such as max, mean or the sum of all                                       < mrow >
non-negative gradients.                                                                                i  < mo >=  < mn >1 
                                                                                                    
   We are particularly interested in the case where x is a one-                                     n 
hot encoding of XML-data. Hence x is sparse and binary. We                                       
                                                                                                 < mi mathvariant =" normal " >ℓ 
propose to use only gradient information of components that are                                  < mo stretchy =" false " >( 
set to 1 in the input to answer the question how the similarity                                  f 
changes if we flip inputs from one to zero. The component-wise                                   < mo stretchy =" false " >( 
                                                                                                 < msub >
multiplication of the input and the gradient eliminates all gradient                                x  < mi >i 
mass at input components that are zero. We aggregate only the                                    
remaining components                                                                             < mo stretchy =" false " >) 
                                                                                                 < mo separator =" true " > , 
                       c i ∼ f (x i ⊙ ∇x i sim(®
                                               x, q))
                                                  ®                                              < msub >
                                                                                                       y  < mi >i 
and color the nodes proportional to this aggregation.                                            
                                                                                                 < mo stretchy =" false " >) 
                                                                                                 < mo separator =" true " > , 
4     INTERPRETABLE SEARCH FOR                                                                
      FORMULAS                                                                             
                                                                                        
We showcase our approach in a semantic search engine for math-
ematical equations which we have crawled from arxiv.org. First                                                                     n
                                                                                        Figure 1: Example MathML for n1                  ℓ(f (x i ), yi ).
                                                                                                                                   Í
we give details on our dataset and machine learning model. Then
                                                                                                                                   i=1
we perform a number of example queries and discuss the quality
of our visualizations.
                                                                                  of the XML into one-hot-encoded vectors. We distinguish three
4.1      Data                                                                     groups of features: tag for encoding the XML-tag (e.g ), at-
We outline how we gather data from arxiv.org, transform them to                   tribute for encoding optional attributes (e.g. stretchy="false")
tree structured data and encode them into vectorial embeddings.                   and content for the text in leaf nodes (e.g. x,y, i, etc.). We encode
                                                                                  this data using a total of 256 features where we use 32 dimensions
   4.1.1 Dataset. We are working on data obtained from arxiv.org,                 to encode the tags, 32 dimensions to encode the most frequent
a service where scientists can upload their manuscripts or pre-                   attributes (including one special symbol for unknown attributes)
prints without reviewing process. We have downloaded all pub-                     and the remaining 192 dimensions to encode the most frequent
lications up to April 2019 and filtered all publications that use                 characters of the content (also including one unknown symbol).
the subject code cs.LG which covers computer science publica-                     In addition to the one-hot encoded features, each node has an
tions on machine learning. The publications are available as the                  attribute that specifies the position within the parent node. We
original LaTeX-files.                                                             assume that edges in the tree are bidirectional.
   Of these 9,936 publications, we sample two subsets, train and
test of size 7,949 and 1,987 respectively. We use the train-set for                  4.1.3 Embedding Network. We use a graph convolutional net-
building our index and the test-set as search queries.                            work to map the 256-dimensional tree to a 64-dimensional tree
   From all publications, we extract mathematical expressions by                  and compute a fixed-size embedding by averaging over all nodes.
using regular expressions for the most common math-environments                   We use a network architecture with 5 layers that use a hidden
like ’equation’, ’align’, etc. Furthermore we extract user-defined                dimensionality of 256. Each layer computes a linear transforma-
commands and macros. Using the library Katex1 we compile the                      tion of the inputs, computes the mean over all neighborhoods
raw LaTeX-equations to the XML-based MathML format2 , origi-                      (□ = mean) and applies the ReLU activation to the outputs. After
nally designed for displaying and storing mathematical content                    each layer, we use batch normalization.
in the semantic web. Since the equations are now in XML-format,                      The network is trained using the similarity task proposed by
we have to work with tree-structured data.                                        Pfahler et al. [11] that proposes to rate two equations as similar
   An earlier version of the dataset used in [11] is published                    when they appear in the same paper. Using this cheap proxy label
at https://whadup.github.io/EquationLearning/ and the version                     for similarity, we can train a Siamese network [1, 8]. The detailed
used in this study will be made available as well.                                procedures for training the embedding network are beyond the
                                                                                  scope of this work.
   4.1.2 Data-Representation. We see an example of an equation
encoded as MathML in Figure 1. The MathML standard defines                        4.2    Qualitative Study
30 different XML-tags like  for math identifiers or  for
                                                                                  In this section we present a qualitative study of our proposed
math operators. Some of these tags use attributes, for instance
                                                                                  method in the context of math retrieval.
to change font or spacing. Leaf nodes contain text like numbers,
parenthesis or letters (greek, latin, etc...). We encode each node                   4.2.1 Rendering and Color Palettes. First we want to discuss
1 http://katex.org                                                                some peculiarities of MathML trees. The advantage of these trees
2 More specifically, we use the Presentation-MathML format for displaying maths   is that the underlying XML-format of our representation is a
on the web as specified at https://www.w3.org/TR/MathML3/                         mark-up language. Hence there is a native way to visualize them
                                                                          Example 4: Example Query (Submodular Functions)
   Example 1: Example Query (Taylor Approximation)




                                                                        Example 5: Example Query (Rademacher Complexity)
Example 2: Example Query (Empirical Risk Minimiza-
tion)




                                                                            Example 6: Example Query (Hoeffding’s Bound)


   Example 3: Example Query (Multi-Layer Perceptron)                        In Example 1, we have queried the definition of the first-
                                                                       order Taylor approximation. The embedding model retrieved the
                                                                       second-order Taylor approximation as nearest neighbor. Both
by rendering the equation with a corresponding markup proces-          visualization put the emphasis on the ∇-symbol, the symbol
sor like Katex or MathJax3 . We can add additional style informa-      commonly used for gradients. Interestingly, the backward visu-
tion to the original markup to color the nodes according to our        alization indicates that the H symbol, commonly used for the
computed visualization. This way we obtain beautiful visualiza-        Hessian matrix i.e. the second order derivative, also contributes
tions without worrying about issues like tree layouting etc. One       to the overall similarity score. Apparently the model has learned
problem with this approach is that we can only visualize some of       that both symbols appear frequently together, e.g. in the context
the nodes. Particularly, with some exceptions like root or fraction    of differentiable functions.
bars, we will color only leaf-nodes, as most inner nodes belong             The query in Example 2 is the empirical risk functional com-
to XML-tags that do not directly output visible symbols. For in-       monly used in machine learning. Both approaches highlight that
stance in Figure 1, the -tag corresponds to a node that       we have a sum over objects subscripted by i. In addition, the
does not output any symbols on its own, but only encapsulates          backward approach highlights theb·-symbol commonly used to
the symbols of its children.                                           indicate that a random quantity is estimated using a finite sample.
   One way to mitigate this issue is to set the color intensity        The model seems to have learned that it frequently co-occurs
of the i-th node to the sum of all colors intensities on the path      with estimators of the form n1 i=1
                                                                                                        ÍN
                                                                                                              zi .
from root to i-th node. However we have seen that this puts too             Example 3 shows a query for the definition of a fully con-
much emphasis on nodes that are deeper in the tree. Hence we           nected neural network layer. The forward-visualization focusses
currently omit all hidden nodes from the visualization.                on the variables W , B for weight matrix and bias vector, while the
   We use max-aggregation of the gradients and compute col-            backward variant focusses on the superscripted index indicating
ors by linear interpolation where black font indicates the most        the number of the layer l.
significant parts and grey parts are less significant.                      In Example 4, we see the definition of gain used in the context
   4.2.2 Results. We find semantically related equations by ap-        of submodular function maximization, where both visualizations
plying annoy to create an index structure for nearest neighbor         highlight the ∪ symbol. There is almost no difference between
retrieval with dot-product similarity. The retrieved equations will    the two variants, most notable is the stronger emphasis on the
not be equivalent or strictly equal, but depicting similar concepts.   opening curly-brace in the forward-visualization.
We analyze the visualizations for the nearest neighbor as output            For Example 5, we have queried the definition of the empirical
by the index. We compare the visualization based on the for-           Rademacher complexity, a measure used e.g. in statistical learning
ward pass with visualization based on backward. In Examples 1-6        theory. Both approaches highlight the sup operation, which is
we present pairs of query (first row) and nearest neighbor and         indeed essential for the definition. Without the sup, the expected
color the result according to the forward pass (second row) and        value would trivially amount to 0. The forward visualization
backward pass (third row). All examples are from the context of        seems to highlight the combination of expectation and supremum,
machine learning, which allows us to judge the results based on        whereas the backward variant highlights mostly supremum and
our background-knowledge of the research area. We now discuss          εi . This indicates that the model has correctly learned that both
each example in detail:                                                σ and ε are frequently used for Rademacher averages.
                                                                            The last Example 6 shows concentration inequality, more pre-
3 http://mathjax.org                                                   cisely Hoeffding’s bound that bound the deviation from a sample
mean to the true expected value of a random variable X . Both                              [7] Behrooz Mansouri, Douglas W Oard, C Lee Giles, and Richard Zanibbi. [n.d.].
visualizations highlight the ¯· symbol that is frequently used to                              Tangent-CFT : An Embedding Model for Mathematical Formulas. ([n. d.]).
                                                                                           [8] Hossein Mobahi, Ronan Collobert, and Jason Weston. 2009. Deep Learning
indicate sample means and hence is related to theb· symbol for                                 from Temporal Coherence in Video. Proceedings of the 26th Annual Interna-
empirically estimated quantities used by the query. Additionally                               tional Conference on Machine Learning (ICML) (2009), 737–744.
                                                                                           [9] Christoph Molnar. 2019. Interpretable Machine Learning.
both visualization highlight the X that is generally used for ran-                        [10] Christopher Morris, Martin Ritzert, Matthias Fey, William L Hamilton, Jan Eric
dom variables in statistics literature. It is interesting to note that                         Lenssen, Gaurav Rattan, and Martin Grohe. 2019. Weisfeiler and leman go
the forward vizualization also places emphasis on the ≤ operator,                              neural: Higher-order graph neural networks. In Proceedings of the AAAI Con-
                                                                                               ference on Artificial Intelligence, Vol. 33. 4602–4609.
while the backward visualization mostly highlights the ≥.                                 [11] Lukas Pfahler, Jonathan Schill, and Katharina Morik. 2019. The Search for
   Overall we see that the forward visualizations spread the em-                               Equations – Learning to Identify Similarities between Mathematical Expres-
phasis more evenly over the tokens than the backward pass. We                                  sions. In Machine Learning and Knowledge Discovery in Databases. ECML PKDD
                                                                                               2019.
think that with each layer of the encoder network the informa-                            [12] Cynthia Rudin. 2019. Stop explaining black box machine learning models for
tion in the nodes becomes less local and more global, as the                                   high stakes decisions and use interpretable models instead. Nature Machine
                                                                                               Intelligence 1, 5 (2019), 206–215.
feature vector at each node is a function of all its neighbors on                         [13] Stefan Rüping. 2006. Learning interpretable models. (2006).
the previous layer. This mixing makes it more difficult to high-                          [14] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2013. Deep Inside
light where the important information is originally coming from.                               Convolutional Networks: Visualising Image Classification Models and Saliency
                                                                                               Maps. CoRR (Dec. 2013).
The backward-based approach does not have this issue. Both                                [15] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin
approaches put a lot of emphasis on math operators like equal or                               Riedmiller. 2014. Striving for simplicity: The all convolutional net. Technical
plus-signs.                                                                                    Report.
                                                                                          [16] Abby Stylianou and Robert Pless. 2019. Visualizing Deep Similarity Networks.
                                                                                               Technical Report. arXiv:arXiv:1901.00536v1
                                                                                          [17] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein,
5    CONCLUSION AND OUTLOOK                                                                    and Justin M Solomon. 2019. Dynamic graph cnn for learning on point clouds.
In this work we have proposed two different approaches for visu-                               ACM Transactions on Graphics (TOG) 38, 5 (2019), 1–12.
                                                                                          [18] Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and Understanding
alizing similarity computations between embeddings computed                                    Convolutional Networks. In Computer Vision – ECCV 2014, David Fleet, Tomas
by graph convolutional networks. We have applied our method                                    Pajdla, Bernt Schiele, and Tinne Tuytelaars (Eds.). Springer International
in the domain of math retrieval. Both methods have allowed us to                               Publishing, 818–833.
gain insights into the way our model scores similarities. The visu-
alizations indicate that our model was able to learn conventions
and notations to infer the context in which equations appear.
   We are confident that the methods proposed in this work are
also useful in other application domains. For instance when we
work with graph representations of molecules [3] and use geomet-
ric deep learning to solve classification or metric learning tasks,
we can color two-dimensional or three-dimensional illustrations
of the molecules using our methods.
   In the future we want to also visualize the alignment between
query and result. A possibility is building interactive visualiza-
tions that allow us to focus on individual symbols of the equation
and highlight how the model judges the similarity for each indi-
vidual token.

ACKNOWLEDGMENTS
This work has been supported by Deutsche Forschungsgemein-
schaft (DFG), as part of the Collaborative Research Center SFB
876 "Providing Information by Resource-Constrained Analysis".

REFERENCES
 [1] Vassileios Balntas, Edgar Riba, Daniel Ponsa, and Krystian Mikolajczyk. 2016.
     Learning local feature descriptors with triplets and shallow convolutional neu-
     ral networks. In Proceedings of the British Machine Vision Conference (BMVC),
     Richard C. Wilson, Edwin R. Hancock, and William A. P. Smith (Eds.).
 [2] Chaofan Chen and Cynthia Rudin. 2018. An Optimization Approach to Learn-
     ing Falling Rule Lists. In International Conference on Artificial Intelligence and
     Statistics.
 [3] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell,
     Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. 2015. Convolutional
     networks on graphs for learning molecular fingerprints. In Advances in neural
     information processing systems. 2224–2232.
 [4] Matthias Fey and Jan E. Lenssen. 2019. Fast Graph Representation Learning
     with PyTorch Geometric. In ICLR Workshop on Representation Learning on
     Graphs and Manifolds.
 [5] Riccardo Guidotti, Anna Monreale, Stan Matwin, and Dino Pedreschi. [n.d.].
     Black Box Explanation by Learning Image Exemplars in the Latent Feature
     Space. In Machine Learning and Knowledge Discovery in Databases. ECML
     PKDD 2019.
 [6] Will Hamilton, Zhitao Ying, and Jure Leskovec. 2017. Inductive representation
     learning on large graphs. In Advances in Neural Information Processing Systems.
     1024–1034.