An Analysis of Variable-Size Vector Based
Approach for Formula Searching
Pankaj Dadure, Partha Pakray, and Sivaji Bandyopadhyay
Department of Computer Science and Engineering
National Institute of Technology Silchar, India
{krdadure, parthapakray and sivaji.cse.ju}@gmail.com
Abstract. The continuously increasing research in the field of science,
engineering, and technology has generated textual and mathematical
data in huge amounts. The research in retrieval and searching of textual
data achieved the state-of-the-art results while searching and retrieval of
mathematical information is in the early stage of research and requires
significant improvement. Motivated from the concept of formula embed-
ding and term-document matrix, in this paper we have introduced the
variable size formula embedding approach where the formula is trans-
formed into the variable size vector. In a vector, each bit represents their
occurrence and corresponds to their position in BPIT. The proposed ap-
proach has been tested on the Math type data of Math Stack Exchange
of ARQMath task and the proficiency of the same are represented in
terms of nDCG´, MAP´ and Precision at 10 measures. The obtained
results have shown that the approach of variable size formula embed-
ding requires significant improvement to retrieve the syntactically and
semantically similar formula.
Keywords: Searching · Math Stack Exchange · Term-Document Matrix
· Formula Embedding.
1 Introduction
Math formulas are the key constitutes in any scientific documents or Math-
Based Question Answering post to communicate and deliver the idea behind it.
Nowadays, the web is a well-known information sharing platform where user’s
share their knowledge and opinion. The process of information sharing creates
a huge amount of data which consists of text, images, videos, etc. However,
the data generated from the knowledge sharing sites mainly contains textual
and mathematical information. The retrieval of such information based on the
mathematical formula is a complex and laborious one. Moreover, the number
of textual information retrieval systems have been developed and exhibit their
retrieval potential as well. On the contrary, Mathematical Information Retrieval
Copyright © 2020 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25
September 2020, Thessaloniki, Greece.
(MIR) have faced the many barriers which leads to retrieval of irrelevant infor-
mation. For instance, the mathematical symbols owned the predefined scripting
styles which are qualitatively different from linear textual information. In math-
ematical information, some mathematical notations posses the alternative rep-
resentation like a permutation of k objects chosen from n distinct objects have
several representations: Pkn , n Pk , Pn ,k , n Pk and P(n, k). In scientific documents,
sometimes, some mathematical notation holds multiple meanings. For instance,
P(x) depicts either the probability of event “x” or multiplication of variable “P”
and “x”. In such cases, the context is a significant factor to deliver the actual
idea of mathematical notations. In the retrieval of mathematical information,
user query depicts the user need and a well-formed user query may lead to a
more relevant search. The well-formed user query has the potential to retrieve
the relevant search and satisfy the user’s needs. To address the above-mentioned
obstacles, several approaches have been implemented and light-up the research
of mathematical information retrieval. Although still there is a huge scope of
improvement in the research field of mathematical information retrieval.
The ARQMath lab at CLEF 20201 aims to push-up the state-of-the-art eval-
uation design of math-aware information retrieval, and which seek to support the
development and ultimate deployment of new techniques. There are two tasks
namely Answer Finding (Main Task) and Formula Search (Secondary Task) are
organized by ARQMath Lab at CLEF 2020. In answer finding task, the sys-
tem returns the ranked list of answers post for the user query selected from the
question post of Math Stack Exchange2 (MSE) whereas the formula search task
returns the ranked list of formulas from MSE which are synthetically or semanti-
cally similar to queried formula. From these two tasks, we are mainly focused on
the formula search task and contributed a variable-size vector based technique.
The data provided by the ARQMath organizer has contained the posts which
hold the textual information and mathematical formulas. As per the flexibility
provided by the task organizer, the participant can use both textual and formula
information or only textual information or only formulas. In our proposed ap-
proach, we have used only the formulas extracted from the Math Stack Exchange
site.
The major contribution of the proposed work are given as follow:
• The proposed approach transformed the formula into a variable size vector
of weight.
• Each weight of the vector represents the occurrence of a particular entity in
a formula and correspond to their position in BPIT.
• The proposed approach has been tested on the formulas expressed in Pre-
sentation MathML format which are extracted Math Stack Exchange.
The structure of the paper is as follows: Section 2 describes related work
in the domain of formula retrieval. Section 3 illustrates the corpus description
and proposed methodology. Section 4 covered the experimental results and their
analysis. Section 5 concludes the paper with a key future direction.
1
https://www.cs.rit.edu/ dprl/ARQMath
2
https://math.stackexchange.com
2 Related Work
The retrieval of mathematical information has seen growing attention in the last
decade and since the mid-1990’s the search for Math Formula has been studied.
The past research works have addressed the number of mathematical information
retrieval challenges. However, there is a huge scope for improvement to mitigate
the challenges. In the retrieval of mathematical information, the formula embed-
ding approach [12] transformed the formula into the vector of size 202 where “0”
depicts the presence and “1” depicts the absence of a particular entity in the
formula. The three layer model of formula representation and searching [4] where
the first layer choose candidates using spectral matching over tree node pairs,
the second layer aligned a query with candidates and estimates the similarity
based on spectral matching, and the third layer estimates the similarity score of
two representation using linear regression.
To furnish the retrieval results for formula query, the Tangent-CFT [10] have
used the two effective representations of mathematical information i.e. Sym-
bol Layout Trees (SLTs) and Operator Trees (OPTs) which may contribute to
the more accurate retrieval. In which, symbol layout tree focused on the ap-
pearance of the formula whereas the operator tree focused on the content of
the formula. In which, the tuples have been generated based on the path be-
tween the pair of symbols and embed them using the fast n-gram embedding
model. For state-of-the-art performance, the Tangent-CFT combined the SLTs
and OPTs embeddings and uses the structural similarity for partial-match for-
mula retrieval. In the research field of MIR, the transformation of mathematical
information language to natural language is an arduous process. To highlight this
transformation, AnnoMathTex- a recommender system [14] enables the formula
annotation by assigning a meaning to formula identifiers from the surrounded
text.
In mathematical information retrieval, relevance measurement is a significant
ingredient. For instance, the estimation of the cosine similarity between the in-
dexed formulas and query formula can also lead to some useful relevant results
[3]. The state-of-the-art results of deep learning techniques in textual information
retrieval motivate the researcher to incorporate it in the retrieval of mathemati-
cal information. The positive impact of LSTM in sequence-to-sequence problem,
LSTM neural network based formula entailment approach [11] formulated the en-
tailment between the index formulas and query formula. In substructural match-
ing of formulas, formula retrieval requires the high computation time. To speed
up the formula search in sub-structural similarity, dynamic pruning strategies,
and specialized inverted index produced noticeable outcomes [17]. For further
improvement, the query structure representation is associated with posting lists
to boots the overall outcomes of the sub-tree matching. To increase the acces-
sibility of scientific documents, the MathAlign system [1] has introduced the
rule-based approach which extracts the latex form of formula and linked the
identifiers of extracted formula to their text description.
To analyze the frequency distributions of mathematical expressions in the
large scientific corpus, André et al. [6] have incorporated the Zipf’s law on a text
corpus. Motivated from the similarity in linguistic properties, they have intro-
duced a novel approach to rank formulae by their relevance via a customized
version of the ranking function BM25. They have also demonstrated the appli-
cability of the results by presenting auto completion for math inputs as the first
contribution to math recommendation systems.
Motivated from the emerging association between mathematical formulas
and the textual context in scientific documents, the topic model called TopicEq
[15] has generated the context from a mixture of latent topics, and the equation
has generated by an RNN that depends on the latent topic activation and en-
ables intelligible processing of equations by considering the relationship between
the mathematical equations and topics. To investigate the feasibility of neural
representation techniques in MIR, the symbol2vec approach [5] learn the vector
representations of formula symbols. For more refined results, the textual infor-
mation has combined with the Formula2vec model and achieves better retrieval
performance. To learn distributed representations of equations, the unsuper-
vised approach called equation embeddings (EqEmb) [9] where the equation has
been treated as singleton word. In which, the semantic representations of math-
ematical equations and their surrounding words have embedded and obtained
results are compared with the CBOW, PV-DM, GloVe model where EqEmb-U
achieves the highest performance. The exploratory investigation of the effective-
ness and use of word embedding [7] where word2vec models have trained on
DLMF and arXiv with slightly different approaches for embedding math. The
DLMF trained model discovered the mathematical term similarities and term
analogies and generates the query expansions. The arXiv trained model has
beneficial to extract the textual descriptive phrases for math terms. The word
embedding model mainly focused on term similarity, math analogies, concept
modeling, query expansion, and knowledge extraction.
3 Methodology
3.1 Data Description
To evaluate the performance of the proposed formula retrieval approach, the
task organizer provides the dataset collected from the knowledge sharing plat-
form i.e. Math Stack Exchange (MSE). The dataset comprised the question,
answer, and comment posts. As per the flexibility on data, the participant of
the task can use the only formula or only textual or both (formula and textual
information) to perform the formula search. In the dataset, the formulas are
represented in three different formats i.e. LA TE X, Presentation MathML, and
Content MathML format. These formulas are extracted from the posts of Math
Stack Exchange of the year 2010-2018. The number of formulas comprised in
LA TE X, Presentation MathML and Content MathML formats are 28,320,920,
26,075,012 and 25,366,913 of size 1.5 GB, 11.5 GB and 10.9 respectively. Each
format has five distinct attributes namely formula id, post id, thread id, type,
and formula. To execute the formula search task, we have used only a formula
dataset and out of these three formula representation files, we have used only
the Presentation MathML format. The dataset are available at the official site
of ARQMath task3 . The metadata about the formula dataset is shown in table
1.
Table 1. Data Description
Corpus Math Stack Exchange
Type Formula
Format Presentation MathML
Size 11.5 GB
No. of formulas 26,075,012
3.2 Query Dataset
To estimate the performance of the proposed work, the task organizer provided
87 mathematical formulae, each of the formula is selected from the question’s
topic of task 1. The topics (queries) for the formula search task provided in an
XML file that has a predefined format as shown in figure 1. Each topic in XML
file tagged by and tag and each topic has a unique topic
number i.e. B.x where “x” represents the topic number. Formula Id shows the
id of formula, Latex shows the latex representation of formula, Title shows the
question title of the post from which the formula is selected, Question shows the
question body from which the formula is selected and Tags shows the comma-
separated tags of the question.
Fig. 1. Topic representation
3
https://www.cs.rit.edu/ dprl/ARQMath/
3.3 System Architecture
The system architecture of the proposed approach is shown in figure 2. The
proposed system architecture is inspired by the existing formula embedding ap-
proach [12] and term-document matrix [2] where each module work interdepen-
dently to make the faster retrieval and accurate search.
Fig. 2. Proposed system architecture
3.3.1 Formulas: The formulas are extracted from the posts of Math Stack Ex-
change from 2010-2018 in three different format LA TE X, Presentation MathML
and Content MathML. To investigate the ability of the proposed approach in
the formula search task, we have used only the Presentation MathML format.
3.3.2 Formula Preprocessing: The prime task of the formula preprocessing
module is to transform the formulas into the unified form by removing irrele-
vant elements and attributes. In this process, the preprocessing module trim the
tags to their root form for example trimmed
to , trimmed to etc. Some examples of
the preprocessed tags are shown in Table 2. The formula preprocessing mod-
ule also discarded a few Presentation MathML tags like , ,
etc. as these tags does not have much contribution regarding semantic
of the mathematical notation.
Table 2. Some examples of preprocessing done by formula Preprocessor
Original Preprocessed
symmetric=““true””>
3.3.3 Formula Embedding: The proposed formula embedding approach
motivates from the existing Bit Position Information Table (BPIT) [12] and
Term-Document matrix [2] which transformed the formulas into the variable
size vector of weight. Each weight of vector represents the occurrence count of
a particular entity in a formula and correspond to entity position in BPIT. The
process of formula to variable size vector transformation is shown in figure 2. In
the proposed formula embedding approach, entities in a formula are categorized
into the three categories based on the tags. The entities tagged by comes
under the first category, entities tagged by comes under the second cat-
egory and the third category holds the essential MathML tags which contribute
to the semantics of formula. As per the position of entities in BPIT, the
tags hold the positions 0-25, 57-65 & 71-100, the tags hold the positions
26-45, 66-70 & 101-149 and the positions 46-56 holds the essential MathML tags.
As defined in the figure 3, the generated vector 3[23, 24, 25], 2[26, 30], 2[47, 49]
from the formula x2 + y 2 = z 2 where 3, 2 and 2 defined the occurrence count
of , and essential MathML tags respectively and 23, 24, 25, 26, 30,
47, 49 represents the bit position of the , and essential MathML
tags in BPIT.
3.3.4 Indexer: In the process of formula searching, the effective indexing
is the key constituent to speeding up the searching process. After a successful
formula transformation into the variable size vector, the indexer module indexed
the formula vector into an index. Each index stored the three different fields
namely embedded formula vector, formula id, and post id from which the formula
is originated.
3.3.5 Query Preprocessing: To test the formula searching effectivity of the
proposed approach, the ARQMath task organizer provided the queries (Topics)
in LA TE X format as described in section 3.2. As we have selected the Presen-
tation MathML format of formula for formula embedding and to maintain the
Fig. 3. Formula to Vector Transformation
unified structure between formulas and queries, we have transformed the queries
into the Presentation MathML format using the tool Demo MathType4 .
3.3.6 Query Embedding: Query embedding module converts the prepro-
cessed Presentation MathML query formula into a variable size vector. For query
vector generation, the query embedding module considered the entity position
format of BPIT [12]. In a generated vector, each weight represents the occur-
rence of a particular entity in a query formula and their position in BPIT. In
MIR, user’s expressed their need in the form of formula and expects that the sys-
tem returns the documents/posts which contain syntactically and semantically
similar formulae.
3.3.7 Searcher and Ranker Module: The main objective of the searcher
module is to search for relevant formulas that are syntactically or semantically
similar to query formula and satisfy the user’s need. Searching for relevant in-
formation is a time consuming process and requires the effective formula repre-
sentation and indexing technique [8]. In the proposed architecture, the searcher
module compared the query formula vector with all the indexed formula vec-
tors and computes the similarity for each indexed formula vector. For similarity
calculation, the proposed approach compared each bit of query vector with the
4
http://www.wiris.com/editor/demo/en/developers
bits of formula vector and those formulas contained the maximum number of
similar bit that formula have maximum similarity score. The process of similar-
ity calculation is shown in figure 4 where the occurrence count of ,
and essential MathML tags of formula vector are compared with the occurrence
count of , and essential MathML tags of query vector. The reason
behind the consideration of occurrence count is to help in assigning a priority
to those formulae which have a similar number of , , and essential
MathML tags.
After a successful comparison and similarity calculation between the formula
vectors and query vector, the ranker module retrieves and ranks the post (the
post which contains the search formula) based on the higher similarity score.
Those post of MSE contains more than one similar formula with respect to query
formula, that post of MSE assigned higher priority as compared to those post
which contains the formula only one time. As a final search result with respect
to query formula, we have retrieved the top 20 posts of MSE which contains the
relevant formulas.
Fig. 4. The process of similarity calculation
4 Results and Discussion
The results of the proposed approach have been estimated using the 45 queries
out of provided 87 queries [16]. For each query formula, the proposed approach
retrieves the top 20 posts which contain the relevant formulas and ranked them
based on the maximum similarity score. Those post holds the more than one
similar formulas that post get the higher priority as compared those posts which
contained the formula only once. The obtained search results stored in TSV file
which have six attributes namely query id, formula id, post id, rank, score and
run number where query id represents the query unique id in the form of “B.x”
where x is the query number, formula id represents the unique identifier for the
retrieved formula instance, post id represents the unique identifier of the post
where the formula is contained, rank attributes represents the rank of retrieved
formula, score attribute represents the similarity score between the query formula
and index formula and run number represents the number of runs submitted by
the participant.
For evaluating the performance of the proposed approach, trec tool5 is used,
which compared the gold dataset (qrel file) with the result set obtained from
the proposed system. The obtained results of the proposed work for the formula
search task are shown in table 3 [16]. All the obtained results of the participants
have been ranked based on obtained nDCG´ metric [13]. For generic and more
comparative analysis, the results are also estimated in terms of MAP´ and P@10.
The results of the baseline system have achieved the highest score as compared
to all the participant teams. The obtained result disclose that the proposed ap-
proach requires significant improvement to generated the state-of-the-art result
for formula search and retrieval tasks.
The nDCG´ measure is based on nDCG and is a commonly used scale when
ratings for relevance judgments are available and a single value figure is generated
over a series of ranking lists. The retrieved document receives a gain value (0, 1,
2, or 3) and the rank of each post is gradually decreasing. The discounted gain
values result are obtained and then normalized to [0,1] by dividing the maximum
discounted cumulative gain possible which is called normalized Discounted Cu-
mulative Gain (nDCG). The only difference when nDCG’s is calculated is that
unjudged posts are discarded before the measurement is performed. In addition,
the nDCG´ produces a single measure with graded relevance, while Precision@k
and Mean Average Precision (MAP) require binarized judgments in terms of
relevance [13]. Moreover to nDCG´, the MAP´ is also calculated by removing
unassessed posts and precision 10 posts (P@10).
Table 3. Formula search task obtained results
Team/Approach Data P nDCG´ MAP´ P@10
Baseline
√
Tangent-S Math 0.506 0.288 0.478
DPRL
TangentCFTED Math 0.420 0.258 0.502
TangentCFT Math 0.392 0.219 0.396
√
TangentCFT+ Both 0.135 0.047 0.207
MIRMU
SCM Math 0.119 0.056 0.058
√
Formula2Vec Math 0.108 0.047 0.076
Ensemble Math 0.100 0.033 0.051
Formula2Vec Math 0.077 0.028 0.044
√
SCM Math 0.059 0.018 0.049
NLP NITS
√
FormulaEmbedding Math 0.026 0.005 0.042
5
https://github.com/usnistgov/trec eval
The results shown in table 4 discovered that the proposed formula embedding
approach can retrieve the√ exact match formula with respect to query formula.
The outcomes of (1 + 3i)1/2 retrieved the syntactically similar formula and
which shows the positive impact of the formula embedding in the retrieval of
mathematical information. The proposed formula embedding approach has effi-
ciently retrieved all those formulae which hold the maximum number of similar
entities and rank them based on the maximum similarity.
Table 4. Best relevant retrieved search results
Topic id Query/Topic Retrieved Formulas
√
(1 + 3i)1/2
√
(1 + i 3)/2
√ 1 √
B.12 (1 + i 3) 2 (1 + i 3)n
√
z3 = −1/2(1 − i 3)
1
√
2
(1 + i 3)
The results shown in table 5 depict that the approach of formula embedding
has the potential to retrieve the partial match formula with respect to query
formula. The obtained result has shown that the proposed approach retrieves
the parent formula or subformula which holds similar entities with respect query
formula.
Table 5. Partial relevant retrieved search results
Topic id Query/Topic Retrieved Formulas
b=8⇒a=4
ab = b ⇒ b = 0 or a = 1
B.53 AB = 1 ⇒ BA = 1 b = 0 =⇒ a = 3
ab = a0 =⇒ b = 0
a = 1 =⇒ b = 2
The frailty of the formula embedding approach is shown in table 6, which
depicts that the formula embedding approach sometimes retrieves the totally ir-
relevant result. The proposed approach effectively retrieves the relevant formulas
as well but in some cases, it retrieves irrelevant formulas also.
Table 6. Irrelevant retrieved search results
Topic id Query/Topic Retrieved Formulas
Tn = (0, n1 )
Bn = ( n1 , 1)
n1 ,n2
B.63 lcm(n1 , n2 ) = gcd(n 1 ,n2 )
An = (0, n1 )
n2 = ( 21 , 12 , 1)
In = ( 21 , 32 )
5 Conclusions and Future Scope
The objective of this paper is to analyze the performance of the formula em-
bedding approach which transformed the formula into the variable size vector.
Each weight of vector is an occurrence of a particular entity in a formula and
corresponds to their position in BPIT. To estimate relevance between the index
formula vector and query vector, each bit of the query vector is compared with
bits of index formula vector. The feasibility of this approach has been tested on
the Math Stack Exchange corpus of the ARQMath task and the obtained re-
sults revealed the robustness and frailness of the same. The proposed approach
has the ability to retrieves the syntactically similar formula, subformula, or par-
ent formula. To compared the obtained results, nDCG´ metric is considered as
the primary measure and additionally MAP´ and Precision at 10 measure also
estimated for more generic comparison.
Our future target is to improve the ranking mechanism by assigning prior-
ities to entities while calculating the similarities. To reduce the searching and
retrieval time, index optimization is one of our future task which helps to fast the
retrieval and formula searching process. To enrich the efficiency of the formula
search process and retrieval of semantically similar formula, textual data will be
incorporated with the mathematical data.
Acknowledgment
The authors would like to express their gratitude to the Department of Computer
Science and Engineering and Center for Natural Language Processing, National
Institute of Technology Silchar, India for providing the infrastructural facilities
and support.
References
1. Alexeeva, M., Sharp, R., Valenzuela-Escárcega, M.A., Kadowaki, J., Pyarelal, A.,
Morrison, C.: Mathalign: Linking formula identifiers to their contextual natural
language descriptions. In: Proceedings of The 12th Language Resources and Eval-
uation Conference. pp. 2204–2212 (2020)
2. Anandarajan, M., Hill, C., Nolan, T.: Term-document representation. In: Practical
Text Analytics, pp. 61–73. Springer (2019)
3. Dadure, P., Pakray, P., Bandyopadhyay, S.: An empirical analysis on retrieval of
math information from the scientific documents. In: International Conference on
Communication and Intelligent Systems. pp. 301–308. Springer (2019)
4. Davila, K., Zanibbi, R.: Layout and semantics: Combining representations for
mathematical formula search. In: Proceedings of the 40th International ACM SI-
GIR Conference on Research and Development in Information Retrieval. pp. 1165–
1168 (2017)
5. Gao, L., Jiang, Z., Yin, Y., Yuan, K., Yan, Z., Tang, Z.: Preliminary exploration
of formula embedding for mathematical information retrieval: can mathematical
formulae be embedded like a natural language? arXiv preprint arXiv:1707.05154
(2017)
6. Greiner-Petter, A., Schubotz, M., Müller, F., Breitinger, C., Cohl, H., Aizawa, A.,
Gipp, B.: Discovering mathematical objects of interest—a study of mathematical
notations. In: Proceedings of The Web Conference 2020. pp. 1445–1456 (2020)
7. Greiner-Petter, A., Youssef, A., Ruas, T., Miller, B.R., Schubotz, M., Aizawa, A.,
Gipp, B., et al.: Math-word embedding in math search and semantic extraction.
Scientometrics pp. 1–30
8. Jansen, B.J., Rieh, S.Y.: The seventeen theoretical constructs of information
searching and information retrieval. Journal of the American Society for Infor-
mation Science and Technology 61(8), 1517–1534 (2010)
9. Krstovski, K., Blei, D.M.: Equation embeddings. arXiv preprint arXiv:1803.09123
(2018)
10. Mansouri, B., Rohatgi, S., Oard, D.W., Wu, J., Giles, C.L., Zanibbi, R.: Tangent-
cft: An embedding model for mathematical formulas. In: Proceedings of the 2019
ACM SIGIR international conference on theory of information retrieval. pp. 11–18
(2019)
11. Pathak, A., Pakray, P., Das, R.: Lstm neural network based math information
retrieval. In: 2019 Second International Conference on Advanced Computational
and Communication Paradigms (ICACCP). pp. 1–6. IEEE (2019)
12. Pathak, A., Pakray, P., Gelbukh, A.: A formula embedding approach to math
information retrieval. Computación y Sistemas 22(3), 819–833 (2018)
13. Sakai, T., Kando, N.: On information retrieval metrics designed for evaluation with
incomplete relevance assessments. Information Retrieval 11(5), 447–470 (2008)
14. Scharpf, P., Mackerracher, I., Schubotz, M., Beel, J., Breitinger, C., Gipp, B.:
Annomathtex-a formula identifier annotation recommender system for stem doc-
uments. In: Proceedings of the 13th ACM Conference on Recommender Systems.
pp. 532–533 (2019)
15. Yasunaga, M., Lafferty, J.D.: Topiceq: A joint topic and mathematical equation
model for scientific texts. In: Proceedings of the AAAI Conference on Artificial
Intelligence. vol. 33, pp. 7394–7401 (2019)
16. Zanibbi, R., Oard, D., Agarwal, A., Mansouri, B.: Overview of arqmath 2020: Clef
lab on answer retrieval for questions on math. In: ARQMath Lab @ CLEF 2020.
pp. 1–25 (2020)
17. Zhong, W., Rohatgi, S., Wu, J., Giles, C.L., Zanibbi, R.: Accelerating substructure
similarity search for formula retrieval. In: European Conference on Information
Retrieval. pp. 714–727. Springer (2020)