<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">XY-PHOC Symbol Location Embeddings for Math Formula Retrieval and Autocompletion</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Robin</forename><surname>Avenoso</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Rochester Institute of Technology</orgName>
								<address>
									<addrLine>1 Lomb Memorial Drive</addrLine>
									<postCode>14623</postCode>
									<settlement>Rochester</settlement>
									<region>NY</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Behrooz</forename><surname>Mansouri</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Rochester Institute of Technology</orgName>
								<address>
									<addrLine>1 Lomb Memorial Drive</addrLine>
									<postCode>14623</postCode>
									<settlement>Rochester</settlement>
									<region>NY</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Richard</forename><surname>Zanibbi</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">Rochester Institute of Technology</orgName>
								<address>
									<addrLine>1 Lomb Memorial Drive</addrLine>
									<postCode>14623</postCode>
									<settlement>Rochester</settlement>
									<region>NY</region>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">XY-PHOC Symbol Location Embeddings for Math Formula Retrieval and Autocompletion</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">940AF55DDA17DDB368EED866FEBAA771</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T20:42+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Character Embeddings</term>
					<term>Spatial Embeddings</term>
					<term>Formula Retrieval</term>
					<term>Formula Autocomplete</term>
					<term>Mathematical Information Retrieval (MIR)</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>We describe the participation of the XY-PHOC team from The Document and Pattern Recognition Lab (DPRL) from the Rochester Institute of Technology (RIT, USA) in ARQMath 2021 Task 2 (Formula Retrieval). We generalize a one dimensional spatial encoding for word spotting in handwritten document images, the Pyramidal Histogram of Characters or PHOC, to obtain the two-dimensional XY-PHOC, which provides robust spatial embeddings of symbols with modest storage requirements. XY-PHOC symbol location embeddings capture the relative position of symbols without the need to generate or store explicit edges between symbols. For ARQMath 2021, the XY-PHOC model obtains competitive results in formula similarity search. We also present new results using XY-PHOC for the related task of formula autocompletion from visual queries, in which target formula symbols may be added to the query in an any order.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The XY-PHOC team from The Document and Pattern Recognition Lab (DPRL) from the Rochester Institute of Technology (RIT, USA) participated in ARQMath 2021 Task 2. In this task, the participants are given a formula from questions in Task 1, and are asked to return the top-1000 relevant formulas for each topic <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2,</ref><ref type="bibr" target="#b2">3]</ref>. These retrieved formulas are from the ARQMath Math Stack Exchange corpus, containing posted answers and questions from 2010-2018. Formula topic queries are then taken from posts in 2019 (ARQMath 2020) and 2020 (ARQMath 2021).</p><p>We provided one run for Task 2, using the XY-PHOC system <ref type="bibr" target="#b3">[4]</ref>. XY-PHOC uses a very simple relative spatial encoding to capture the locations of symbols in a formula at different levels of granularity. Embeddings for the location of individual symbols are stored in an inverted index, which can then be compared with symbol location embeddings for the query formula using cosine similarity.</p><p>Current state-of-the-art systems for formula similarity search rely on graph-structured data, such as in Tangent-S <ref type="bibr" target="#b4">[5]</ref> and Tangent-CFT <ref type="bibr" target="#b5">[6]</ref>. These systems require special indexing strategies in order to build inverted indexes on non-traditional keys. The scoring of these systems also requires several operations (e.g., a subtree alignment step in Tangent-S) or trained embedding models for tree paths (e.g., in Tangent-CFT).  The XY-PHOC representation requires less storage than does a graph-based model: in the XY-PHOC symbol location embeddings index keys are symbols, and postings consist of 29-bit binary vectors stored in 32-bit words, accompanied by an integer identifier for the associated formula (Section 2 provides details). Storing standard types in the index will allow for standard information retrieval tools and techniques to be utilized. The representation for XY-PHOC can be put into a standard inverted index used for document retrieval with only small modifications for scoring. Scoring matches for XY-PHOC embeddings is performed with an optimized rankequivalent cosine similarity.</p><p>Interestingly, the XY-PHOC retrieval model was designed with formula autocompletion rather than formula retrieval in mind, as initially we believed that formula search using symbol positions alone would be over-constraining and produce many false negatives/misses. The ARQMath results obtained indicate that this is not the case, possibly because the presence of a symbol anywhere in a formula is represented. The XY-PHOC symbol location retrieval model can be applied as an autocompletion model or a similarity search model simply by switching from conjunctive queries (requiring all query symbols) to disjunctive queries (requiring at least one query symbol).</p><p>Currently there are only two parameters in our model, controlling the number of regions and the function used to identify which regions a symbol belongs to. Scoring is performed using cosine similarity over binary vectors defined for individual symbols. Indeed, our symbol location vectors may be understood as a form of 'term location frequency' vectors. Our retrieval model is nearly identical to the TF portion of a standard TF-IDF retrieval model over words, but where the presence of symbols in multiple overlapping regions are captured, rather than the number of symbol occurrences. Levels are encoded by increasing numbers of regions: the first bit represents Level 1 (whole formula), the next two bits Level 2 (horizontal), followed by Level 2 ′ (vertical), and so on. Levels are illustrated in Figure <ref type="figure" target="#fig_0">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">XY-PHOC: Bidirectional Pyramidal Histogram of Characters</head><p>In this section we describe the XY-PHOC model, which generalizes Pyramidal Histogram of Character (PHOC) bit vectors representing character locations in words <ref type="bibr" target="#b6">[7]</ref> to capture horizontal and vertical symbol positions within formulas. An illustration of XY-PHOC symbol locations and their vector representation can be found in Figure <ref type="figure" target="#fig_0">1</ref>, and Figure <ref type="figure" target="#fig_1">2</ref> illustrates how the bit vector per symbol looks using the first four levels.</p><p>The Pyramidal Histogram of Characters (PHOC) embedding model generalized for XY-PHOC comes from the field of word spotting <ref type="bibr" target="#b6">[7]</ref>. In word spotting, the goal is to retrieve word images matching a query word given as either a character string (e.g., in UTF-8) or raster image. PHOC is a binary vector indicating the presence of a character within each horizontal region. For each level 𝑛 there are 𝑛 regions beginning with one word-level region (level 1), left and right regions after splitting the formula in half (level 2), followed by additional identically-sized splits up to 𝑛 regions. The presence of a symbol in region(s) at each level gives the represention of locations where the symbol appears. After region embeddings are concatenated, a total of 𝑛(𝑛 + 1) − 1 regions are represented. So for example, using up to 5 uniform splits in the horizontal and vertical directions, we obtain 5 • 6 − 1 = 29 regions, which will be represented as a 29 bit vector.</p><p>This embedding was inspired by how the redundant spatial information in the character embeddings make it suitable for capturing both coarse and precise locations for symbols in space, making it useful for visual formula autocompletion. As Level 1 captures all symbols present in a formula, this along with the redundant spatial information are well suited to finding formulas similar to a query, starting from the symbols they share.</p><p>The original PHOC embedding is very sparse, as it uses a vector the length of the alphabet per region to indicate which symbols are in that region. For math the symbol vocabulary is much larger than the Latin alphabet, making the vector even longer. To use XY-PHOC efficiently, we use an inverted index over symbols with each posting as a pair (𝑖𝑑, 𝑣) containing a formula identifier and a bit vector representing the XY-PHOC regions where that symbol appears.</p><p>For spatial regions, we use five levels in both the horizontal (X) and vertical (Y) directions, making a symbol-specific XY-PHOC vector 29 bits long, which we store in a 32-bit integer. Formulas tend to be much wider than tall; we found in experiments that we could better distinguish the vertical location of symbols using a single point, and that using a line to represent the horizontal extent of symbols provided helpful redundancy <ref type="bibr" target="#b3">[4]</ref>.</p><p>An example of a simple 4-level embedding is shown in Figure <ref type="figure" target="#fig_1">2</ref>. When encoding the binary XY-PHOC vectors, as seen in Figure <ref type="figure" target="#fig_0">1</ref>, symbols are represented by a horizontal line that spans the width of the symbol's bounding box, with the line positioned at the vertical center of the bounding box. If any point on this line is included in a horizontal region, the bit for that region is set to 1. Vertical region memberships are identified by the vertical center of the box (a single point). This encoding represents the presence of the symbol in each region, and if a symbol appears in a region more than once it is still represented by a 1. For each horizontal level 𝑛 greater than 1, there will be a level 𝑛 ′ which represents the vertical splits at that level.</p><p>We index each formula identified as visually distinct in the ARQMath corpus once using a single instance (e.g., aiming to have just a single entry for 𝑥<ref type="foot" target="#foot_1">2</ref> ). XY-PHOC symbol location vectors are computed starting from formulas given in L A T E X using the MathJax javascript library<ref type="foot" target="#foot_0">1</ref> . Once formulas have been converted to SVG images, we convert these to lists of symbols with bounding boxes, which are then used to compute our XY-PHOC symbol embeddings. In order to index the large ARQMath collection, several tools are employed. We perform indexing using Apache Spark 2 , using a distributed map-reduce implementation that ultimately produces the index in a text file. The index is then loaded into a Redis<ref type="foot" target="#foot_2">3</ref> database for use in retrieval. A second index maps formula ids to their original file, the normalization constant for each XY-PHOC vector as described in Section 2.1, and the number of symbols in the formula.</p><p>An important advantage of the reduced representation of XY-PHOC is that standard information retrieval techniques and tools can be used to generate an efficient and robust system. Other math formula retrieval systems that work on graph representations need to use custom solutions in order to index the paths that make up formula graphs, whereas the XY-PHOC embedding easily fits into standard tools for text-based search engines.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Scoring using XY-PHOC Symbol Location Vectors</head><p>We are able to retrieve and score formulas in a manner similar to conventional TF-IDF retrieval over words -each query symbol is looked up in the index, XY-PHOC vectors from formulas containing the symbol are retrieved, and then compared to the query symbol XY-PHOC vector. Symbol location match counts are then accumulated across query symbols, and finally scaled by candidate formula size, giving preference to smaller formulas when two or more candidate formulas have the same number of matching symbol locations.</p><p>More concretely, to score candidate formulas, we follow the work of Sudholt et al. <ref type="bibr" target="#b7">[8]</ref> who demonstrated that in the word-spotting context, cosine similarity works well for scoring words represented as one-dimensional PHOCs.</p><p>For query vector a and candidate formula vector b the cosine similarity is:</p><formula xml:id="formula_0">𝑐𝑜𝑠 = 𝑎 • 𝑏 ||𝑎|| ||𝑏|| = ∑︀ 𝑛 𝑖=1 𝑎 𝑖 𝑏 𝑖 √︁ ∑︀ 𝑛 𝑖=1 𝑎 2 𝑖 √︁ ∑︀ 𝑛 𝑖=1 𝑏 2 𝑖 .<label>(1)</label></formula><p>A faster rank-equivalent similarity metric 𝑏𝑐𝑜𝑠 is defined as:</p><formula xml:id="formula_1">𝑐𝑜𝑠(a, b) 𝑟𝑎𝑛𝑘 = 𝑏𝑐𝑜𝑠(a, b) = 1 √︀ |b| 1 |a ∧ b| 1 (2)</formula><p>The dot product of two binary vectors is the Hamming weight (number of 1s) in the logical AND of the vectors, which is equivalent to the 𝐿 1 norm (|| 1 ). The normalization factor for query a is constant across candidate formulas, and so can be removed without altering the rank ordering. To accelerate computation, the normalization factor for b is pre-computed and stored in the formula (secondary) index for lookup at retrieval time.</p><p>Our first retrieval implementation is very simple. We perform term-at-a-time scoring, and do not make use of skip lists, score bounds (e.g., from MaxScore <ref type="bibr" target="#b8">[9]</ref>) or alternative indexing strategies (e.g., block-max <ref type="bibr" target="#b9">[10]</ref>). For ARQMath-1 (2020) topics, this resulted in an average query time of 566 seconds (i.e., roughly 9.5 minutes) per query using a Python implementation running on a desktop Linux system with an Intel i7-8700K CPU (3.7GHz) and 32GB RAM, using Redis <ref type="bibr" target="#b3">[4]</ref>. For a rapid prototype this was workable, but in the future, we hope to produce a much faster implementation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Formula Retrieval Results</head><p>In this section we will present the results for the XY-PHOC retrieval model on the ARQMath formula retrieval task <ref type="bibr" target="#b10">[11]</ref>. We show results for both queries from ARQMath 2020 and ARQMath 2021. Note that our model has only two parameters (number of regions and the symbol region membership function), and only a small amount of training/fitting on the ARQMath 2020 collection was performed <ref type="bibr" target="#b3">[4]</ref>.</p><p>For formula similarity search, we use disjunctive queries (weak AND) over symbol location vectors. This requires that at least one symbol present in the query is present in a candidate for retrieval. This assumes that relevant candidates share at least one symbol with the query formula.</p><p>There are 45 ARQMath 2020 topics. For ARQMath-2, the top results from all teams are shown in Table <ref type="table" target="#tab_1">1</ref>. Our model obtained the third-highest nDCG ′ score. In the table we see that XY-PHOC has a higher P ′ @10 than the baseline system and is within 1.7% in MAP ′ and 8% in nDCG ′ scores relative to the baseline system, Tangent-S <ref type="bibr" target="#b4">[5]</ref>. Tangent-S is a much more complex model, making use of path-based retrieval on both Symbol Layout Trees (SLTs) and Operator Trees (OPTs), followed by aligning the query formula with candidates in both representations before producing a final score. The best-performing system (LtRall) is obtained by re-ranking Tangent-S results after including additional OPT and SLT tree embeddings and tree edit distances within a learning-to-rank framework, adding complexity and computational expense.</p><p>ARQMath 2021 has 60 topics. The top results from all teams is shown in Table <ref type="table" target="#tab_1">1</ref>. Our model obtained the third-highest nDCG ′ , MAP ′ , and P ′ @10. It is interesting to see both the Tangent-S system and LtRall systems performed less well then the XY-PHOC model, while two systems that scored lower than XY-PHOC for ARQMath 2020 topics (MathDowsers and approach0) obtained the highest metrics for ARQMath 2021 topics. Across the ARQMath 2020 and ARQMath 2021 topics, the performance of XY-PHOC model remains relatively stable. Among the top-3 runs for ARQMath 2021, all the systems had very similar scores, with the nDCG ′ having a </p><formula xml:id="formula_2">Rank 𝑒 3𝑖𝜋/2 𝑖 = √ −1 1 𝑒 3𝑖𝜋/2 𝑖 = √ −1 2 𝑒 3𝜋𝑖/2 𝑖 = √ −1. 3 3𝑒 3𝑖𝜋/2 𝑖 = ± √ −1 4 3𝑒 3𝜋𝑖/2 ±𝑖 = √ −1 5 𝑒 2𝑖𝜋/3 𝑖 = √ −1,</formula><p>difference of 0.7%, MAP ′ having a difference of 3.8% and P ′ @10 having a difference of 5.5%. It is interesting that all three of these models use one formula representation for retrieval: OPT for approach0, SLT for MathDowsers, and XY-PHOC for our system. It would be interesting to see how different retrieved formulas are between these systems, and whether an ensemble would produce stronger results.</p><p>To understand the types of queries the XY-PHOC system performs well and poorly on, we present some queries with their top-5 retrieval results from the 2020 ARQMath queries, after removing the formulas that are not assessed and so not used in computing the prime (′) evaluation metrics. In Table <ref type="table" target="#tab_2">2</ref> topics B.27 and B.29 are shown. For both of these queries, the top-5 results have an nDCG ′ @5 of 1. The strong performance is due to the types of formulas marked with a relevance of High (3) -formulas with the same symbols, and roughly the same placement as the queries. For both queries, the top result is the query formula. Formulas in ranks 2-5 also contain the query symbols, with only slight changes in placement or additional punctuation such as a '. ' or a ', '.</p><p>Next we consider two queries that the system did not perform well on. XY-PHOC's preference for candidates with similar structure and symbols can be seen in the results for topic B.2 and B.32 presented in Table <ref type="table" target="#tab_3">3</ref>. Both Topics have an nDCG ′ @5 of 0, meaning no formulas in the collection annotated as relevant were returned in the top 5 after removing unassessed formulas. </p><formula xml:id="formula_3">𝑑𝑥 = 𝑒 𝑥+1 (𝑥 + 1) (∀𝑥)(∃𝑦)(∀𝑧)(𝑧 ∈ 𝑦 ⇐⇒ (∃𝑡)(𝑧 ∈ 𝑡 &amp;𝑎𝑚𝑝; 𝑡 ∈ 𝑥)). 2 𝑑𝑥 𝑑𝑡 = 𝑓 (𝑥) (1) 𝜓(𝑥) = ∃𝑦(𝑦 ∈ 𝑥) 3 𝑑𝑓 𝑑𝑥 = 𝑓 (𝑥, 𝑦) 𝜑(𝑥) = ∃𝑦 (𝑦 ∈ 𝑥) 4 𝑑𝑥 𝑑𝑡 = 𝑥(1 + 𝑥) (̸ ∃𝑥)𝑥 ∈ 𝐸 ∧ 𝐴 ⊆ 𝐸 → (̸ ∃𝑦)𝑦 ∈ 𝐴 5 𝑑𝑦 𝑑𝑥 = 𝑥 𝑥 (log 𝑥 + 1)</formula><p>- </p><formula xml:id="formula_4">1 𝑑𝑓 𝑑𝑥 = (𝑓 (𝑥) + 𝑥)𝑥 [¬∀𝑥 ∈ 𝑦 (𝑝(𝑥)] ⇐⇒ ∃𝑥 ∈ 𝑦 (¬𝑝(𝑥)). 2 𝑑 𝑑𝑥 𝑓 (𝑥) = 𝑐 • 𝑓 (𝑥 + 1) ∀𝑥∃𝑦(𝑝(𝑥) ∨ 𝑞(𝑦)) ⇐⇒ ∃𝑦∀𝑥(𝑝(𝑥) ∨ 𝑞(𝑦)) 3 𝑑 𝑑𝑥 𝑓 (𝑥) = 𝑥 𝑥 (ln(𝑥) + 1) 𝑥 𝐸(Q) 𝑦 ⇐⇒ ∃𝑄 ∈ Q (︀ 𝑥, 𝑦 ∈ 𝑄 )︀ 4 𝑑 𝑑𝑥 𝑓 (𝑥) = 𝑓 (𝑥) 𝑑 𝑑𝑥 + 𝑑𝑓 𝑑𝑥 𝑦 ∈ (𝑥) ⇐⇒ (𝑦) ⊆ (𝑥) 5 𝑑𝐹 𝑑𝑥 = 𝑓 (𝑥𝑦)(𝑥 2 + 𝑥𝑦 + 1) ∀𝑥∃𝑦𝑝(𝑥, 𝑦) ⇐⇒ ∃𝑥∀𝑦∃𝑧𝑅(𝑥, 𝑦, 𝑧)</formula><p>For topic B.32, after removing unrated formulas only 4 formulas remained in the results. To get a better understanding of the the formulas returned, we present results for these topics without removing unrated formulas in Table <ref type="table" target="#tab_4">4</ref>. Based solely on shared symbols and structure, it would appear these filtered formulas may be more similar to the query, and it is possible that results would be stronger if these formulas were annotated.</p><p>Having shown the extremes for topics that obtained nDCG ′ scores of 1.0 and 0, we now present two query topics with an nDCG ′ @5 &gt; 0.7 . These results are presented in Table <ref type="table" target="#tab_5">5</ref> using topics B.12 and B.14. For both of these topics, the top retrieval result matches the query formula, with a relevance of High (3). The relevance scores then decrease with rank. In these two queries, relevant results are retrieved at a high rank based on the symbols locations.</p><p>As originally hoped, the XY-PHOC model generally works well for exact item retrieval, as often the embedding with the highest score will be the query formula. We apply our model to the problem of formula autocompletion in the next Section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Formula Autocompletion Results</head><p>In this section we briefly present the results from formula autocompletion experiments reported by Avenoso <ref type="bibr" target="#b3">[4]</ref>. The XY-PHOC system was originally designed with the goal of autocompletion in mind, and is the first math formula autocomplete system that allows symbols to be input in any order. For example, a query such as shown in Figure <ref type="figure" target="#fig_1">2</ref> can be used to retrieve formulas </p><formula xml:id="formula_5">√ 3) 1/2 Rating 𝑦 = 𝑥𝑦 ′ + 1 2 (𝑦 ′ ) 2 Rating 1 (1 + √ 3𝑖) 1/2 3 𝑦 = 𝑥𝑦 ′ + 1 2 (𝑦 ′ ) 2 3 2 4(1 + √ 3𝑖) 1/2 2 𝑦 = 𝑥𝑦 ′ + 1 2 𝑦 ′2 3 3 (1 + 𝑖 √ 3)/2 1 𝑦𝑦 ′ = 1 2 (𝑦 2 ) ′ 0 4 (35 + 18𝑖 √ 3) 1/3 0 𝑥 = 1 2 𝑦 + 1 2 (−𝑦) 0 5 (1 ± 𝑖 √ 3)/2 1 𝑥 4 + 𝑦 4 = 1 2 (𝑥 2 + 𝑦 2 ) 2 + 1 2 (𝑥 2 − 𝑦 2 ) 2 0</formula><p>containing many different integrands between the integral and 'dx. ' Further, the query symbols can be either entered, or selected from an existing formula in any order. Query symbols do not need comprise a well-formed subexpression (e.g., an SLT), or to be entered as a text encoding from left-to-right (e.g., for L A T E X). Matching is purely spatial.</p><p>For autocompletion, we make some small but important changes from our formula similarity search: (1) we use conjunctive queries, requiring all symbols in the query to be present in all candidates, and (2) the number of symbols in each candidate formula must be no smaller than that for the query formula. This reflects that our autocompletion is intended to exactly match a portion of the target formula, and so formulas that do not meet these requirements are pruned during retrieval.</p><p>To study the effect of inputting symbols in different orders on XY-PHOC autocompletion performance, we tested four possible symbol input orders: Left-to-Right (a roughly 'standard' entry order), Right-to-Left, Outside-in alternating adding symbols from the left and right ends, and Middle-out starting at the middle symbol and alternating between adding a symbol on the left and right side of the query. The formulas used for autocompletion queries were taken from the set of visually distinct ARQMath formulas that have been assessed <ref type="bibr" target="#b3">[4]</ref>. Results for the first 102 queries are shown in this section. In Figure <ref type="figure">3</ref> we show the Mean Reciprocal Rank (MRR) grouped by the percentage of symbols input from the target formula, using groupings of 10 percent. MRR reflects how close target formulas are to the top rank. In an autocomplete system we generally want our target formula to obtain as high a rank as possible using as few symbols as possible, so that the user can easily find it. Note that an MRR of 50% reflects that the harmonic mean of the rank for the 102 query formulas is rank 2, 25% is the same for rank 4, etc.</p><p>Looking at Figure <ref type="figure">3</ref>, we see that the inside-outside order obtains substantially higher MRR scores throughout the input size range. Interestingly, the left-right order obtains the lowest score across the input size range, which might be caused by formulas being less unique at their left end (particularly given the stronger performance for entering symbols right-left), but we have not had time to confirm this. The improved performance for outside-in also reflects that anchoring the left and right ends of the XY-PHOC symbol location vectors after the first two symbols leads to more stable location vectors.</p><p>In terms of how these different input orders for formula autocompletion might be used, we see two possibilities. One is users placing symbols on a canvas, e.g., by recognizing a formula 0.022 0.12 0.22 0.32 0.41 0.51 0.61 0.71 0.8 0.9 </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>We have presented a bidirectional PHOC embedding <ref type="bibr" target="#b3">[4]</ref> that has been applied to the ARQMath 2021 Task 2 (Formula Retrieval). The system uses a simple embedding for relative symbol locations, in order to capture appearance without knowing the underlying writing or mathematical content for formulas. This simplified embedding can easily be used in standard search engines designed for text. For ARQMath-2, the XY-PHOC system ranked a close third based on highest nDCG ′ , MAP ′ and P ′ @10 metrics, using what is perhaps the most naive formula representation, based only on the spatial arrangement of symbols. We also showed how XY-PHOC may be used for autocompletion by switching from disjunctive (weak AND) queries to conjunctive queries. The results from our autocompletion experiment are promising, and illustrate how a system could be built using XY-PHOC for both formula retrieval and autocompletion. For similarity search, an obvious limitation of our approach is that formulas sharing no symbols (e.g., 𝑥 𝑦 and 𝑎 𝑏 ) cannot be used to retrieve one another. We believe unified matches can be implemented through additional symbols in the index that encode mathematical types and symbol alphabets at different levels of granularity (e.g., variable, greek letter, set operation). We also have not incorporated inverse term frequencies into our scoring model, but believe that weighting rarer symbols higher in a TF-IDF or BM25 <ref type="bibr" target="#b12">[13]</ref> manner may improve results substantially.</p><p>Our current naïve implementation is very slow for the whole ARQMath collection, but we believe this can be improved with a more sophisticated implementation along with the use of rank-safe (e.g., skip lists, MaxScore <ref type="bibr" target="#b8">[9]</ref>) and non-rank-safe (e.g., thresholded traversal of score-based sorted posting lists) optimizations.</p><p>The XY-PHOC symbol location retrieval model requires only symbol positions for indexing and search, without any representation of formula structure (visual, operation, or otherwise). This suggests that our approach might be applied to retrieving other two-dimensional notations and graphics (e.g., tables, plots, figures, chemical diagrams, etc.). It would also be worth exploring how additional levels of spatial partitioning beyond the currently used five vertical and horizontal splits improve or adversely impact retrieval effectiveness.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Example XY-PHOC Embedding. Each level numbering indicates the number of regions in the horizontal and vertical ( ′ ) directions. A symbol belongs to regions intersected by a horizontal line placed at the vertical center of its bounding box, where the horizontal line spans the width of the bounding box (vertical-center). For space, symbol region memberships are shown using sets.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure2: Example 4-level XY-PHOC Embedding, with 19-bit Vectors for each Symbol. Levels are encoded by increasing numbers of regions: the first bit represents Level 1 (whole formula), the next two bits Level 2 (horizontal), followed by Level 2 ′ (vertical), and so on. Levels are illustrated in Figure1.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1</head><label>1</label><figDesc>ARQMath Task 2 results. The run with the highest nDCG ′ (in ARQMath-1) is shown for each team. For TU_DBS team, we considered the run available on both ARQMath-1 and -2.</figDesc><table><row><cell>Evaluation Measures</cell><cell></cell></row><row><cell>2020</cell><cell>2021</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 2</head><label>2</label><figDesc>Examples of ARQMath 2020 queries (B.27 and B.29) for which the XY-PHOC had nDCG ′ @5 of 1. All the retrieved formulas have relevance score of High.</figDesc><table><row><cell>Query B.27 Query B.29</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 3</head><label>3</label><figDesc>Examples of ARQMath 2020 queries (B.2 and B.32) for which the XY-PHOC had nDCG ′ @5 of 0. Results shown after deduplication and removing unassessed hits.</figDesc><table><row><cell></cell><cell>Query B.2</cell><cell>Query B.32</cell></row><row><cell cols="2">Rank 𝑑𝑓 𝑑𝑥 = 𝑓 (𝑥 + 1)</cell><cell>𝐸𝑚𝑝𝑡𝑦(𝑥) ⇐⇒ ̸ ∃𝑦(𝑦 ∈ 𝑥)</cell></row><row><cell>1</cell><cell>𝑑𝑓 (𝑥)</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 4</head><label>4</label><figDesc>XY-PHOC Results on ARQMath 2020 Queries without Removing Unrated Formulas (nDCG ′ @5 = 0).</figDesc><table><row><cell>Query B.2</cell><cell>Query B.32</cell></row><row><cell>Rank 𝑑𝑓 𝑑𝑥 = 𝑓 (𝑥 + 1)</cell><cell>𝐸𝑚𝑝𝑡𝑦(𝑥) ⇐⇒ ̸ ∃𝑦(𝑦 ∈ 𝑥)</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 5</head><label>5</label><figDesc>Examples of ARQMath 2020 queries (B.12 and B.14) for which the XY-PHOC with nDCG ′ &gt; 0.7. Ratings of 3 are high, 2 are medium, 1 are low.</figDesc><table><row><cell>Query B.12</cell><cell>Query B.14</cell></row><row><cell>Rank (1 + 𝑖</cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_6"><head></head><label></label><figDesc>Autocompletion Results for Different Symbol Input Orders. Shown is the percentage of target formula symbols entered vs the Mean Reciprocal Rank (MRR) across target formulas.image, using handwriting, or manually entering/placing symbols. Another is interactively selecting symbols in displayed formulas to define queries such as shown in Figure2. To achieve this, we plan to integrate our XY-PHOC based autocompletion into the multimodal MathDeck search interface<ref type="bibr" target="#b11">[12]</ref>.4   </figDesc><table><row><cell></cell><cell>0.8 1.0</cell><cell>Left-to-Right Right-to-Left Outside-in Middle-out</cell></row><row><cell></cell><cell>0.6</cell></row><row><cell>MRR</cell><cell>0.4</cell></row><row><cell></cell><cell>0.2</cell></row><row><cell></cell><cell>0.0</cell></row><row><cell></cell><cell></cell><cell>Percentage of total symbols</cell><cell>1.0</cell></row><row><cell>Figure 3:</cell><cell></cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://www.mathjax.org</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">https://spark.apache.org</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">https://redis.io</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">https://mathdeck.org</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>Our thanks to Wei Zhong for notes from his early investigations into formula autocompletion, and to Yancarlos Diaz and the RIT DPRL lab for helpful discussions. This material is based upon work supported by the Alfred P. Sloan Foundation under Grant No. G-2017-9827 and the National Science Foundation (USA) under Grant No. IIS-1717997. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect those of the National Science Foundation or the Alfred P. Sloan Foundation.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Advancing math-aware search: The arqmath-2 lab at clef 2021</title>
		<author>
			<persName><forename type="first">B</forename><surname>Mansouri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Oard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Zanibbi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">European Conference on Information Retrieval</title>
				<imprint>
			<publisher>Springer</publisher>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Experimental ir meets multilinguality, multimodality, and interaction</title>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Twelfth International Conference of the CLEF Association (CLEF 2021) (????</title>
				<meeting>the Twelfth International Conference of the CLEF Association (CLEF 2021) (????</meeting>
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m">Working Notes of CLEF 2021 -Conference and Labs of the Evaluation Forum</title>
				<imprint/>
	</monogr>
	<note>????</note>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Spatial vs. Graph-Based Formula Retrieval</title>
		<author>
			<persName><forename type="first">R</forename><surname>Avenoso</surname></persName>
		</author>
		<ptr target="https://scholarworks.rit.edu/theses/10784" />
		<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
		<respStmt>
			<orgName>Rochester Institute of Technology</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Master&apos;s thesis</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Layout and Semantics: Combining Representations for Mathematical Formula Search</title>
		<author>
			<persName><forename type="first">K</forename><surname>Davila</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Zanibbi</surname></persName>
		</author>
		<idno type="DOI">10.1145/3077136.3080748</idno>
		<idno>doi:</idno>
		<ptr target="10.1145/3077136.3080748" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval</title>
				<meeting>the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval<address><addrLine>Shinjuku Tokyo Japan</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="1165" to="1168" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Tangent-CFT: An Embedding Model for Mathematical Formulas</title>
		<author>
			<persName><forename type="first">B</forename><surname>Mansouri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Rohatgi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">W</forename><surname>Oard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">L</forename><surname>Giles</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Zanibbi</surname></persName>
		</author>
		<idno type="DOI">10.1145/3341981.3344235</idno>
		<idno>doi:10.1145/3341981.3344235</idno>
		<ptr target="http://doi.org/10.1145/3341981.3344235" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR &apos;19</title>
				<meeting>the 2019 ACM SIGIR International Conference on Theory of Information Retrieval, ICTIR &apos;19<address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="11" to="18" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">PHOCNet: A Deep Convolutional Neural Network for Word Spotting in Handwritten Documents</title>
		<author>
			<persName><forename type="first">S</forename><surname>Sudholt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">A</forename><surname>Fink</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICFHR.2016.0060</idno>
		<ptr target="http://ieeexplore.ieee.org/document/7814076/.doi:10.1109/ICFHR.2016.0060" />
	</analytic>
	<monogr>
		<title level="m">2016 15th International Conference on Frontiers in Handwriting Recognition (ICFHR)</title>
				<meeting><address><addrLine>Shenzhen, China</addrLine></address></meeting>
		<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2016">2016</date>
			<biblScope unit="page" from="277" to="282" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Evaluating Word String Embeddings and Loss Functions for CNN-Based Word Spotting</title>
		<author>
			<persName><forename type="first">S</forename><surname>Sudholt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><forename type="middle">A</forename><surname>Fink</surname></persName>
		</author>
		<idno type="DOI">10.1109/ICDAR.2017.87</idno>
	</analytic>
	<monogr>
		<title level="m">14th IAPR International Conference on Document Analysis and Recognition (ICDAR)</title>
				<imprint>
			<date type="published" when="2017">2017. 2017</date>
			<biblScope unit="volume">01</biblScope>
			<biblScope unit="page" from="493" to="498" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Query evaluation: Strategies and optimizations</title>
		<author>
			<persName><forename type="first">H</forename><surname>Turtle</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Flood</surname></persName>
		</author>
		<idno type="DOI">10.1016/0306-4573(95)00020-H</idno>
		<ptr target="https://doi.org/10.1016/0306-4573(95)00020-H" />
	</analytic>
	<monogr>
		<title level="j">Information Processing &amp; Management</title>
		<imprint>
			<biblScope unit="volume">31</biblScope>
			<biblScope unit="page" from="831" to="850" />
			<date type="published" when="1995">1995</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Faster top-k document retrieval using block-max indexes</title>
		<author>
			<persName><forename type="first">S</forename><surname>Ding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Suel</surname></persName>
		</author>
		<idno type="DOI">10.1145/2009916.2010048</idno>
		<ptr target="https://doi.org/10.1145/2009916.2010048.doi:10.1145/2009916.2010048" />
	</analytic>
	<monogr>
		<title level="m">Proceeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011</title>
				<editor>
			<persName><forename type="first">W</forename><surname>Ma</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Nie</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Baeza-Yates</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Chua</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">W</forename><forename type="middle">B</forename><surname>Croft</surname></persName>
		</editor>
		<meeting>eeding of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011<address><addrLine>Beijing, China</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2011">July 25-29, 2011. 2011</date>
			<biblScope unit="page" from="993" to="1002" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Overview of ARQMath 2020: CLEF Lab on Answer Retrieval for Questions on Math</title>
		<author>
			<persName><forename type="first">R</forename><surname>Zanibbi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">W</forename><surname>Oard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Agarwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mansouri</surname></persName>
		</author>
		<idno type="DOI">10.1007/978-3-030-58219-7_15</idno>
		<ptr target="http://link.springer.com/10.1007/978-3-030-58219-7_15.doi:10.1007/978-3-030-58219-7_15" />
	</analytic>
	<monogr>
		<title level="m">Experimental IR Meets Multilinguality, Multimodality, and Interaction</title>
				<editor>
			<persName><forename type="first">A</forename><surname>Arampatzis</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">E</forename><surname>Kanoulas</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Tsikrika</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Vrochidis</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><surname>Joho</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Lioma</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Eickhoff</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Névéol</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Cappellato</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Ferro</surname></persName>
		</editor>
		<meeting><address><addrLine>Cham</addrLine></address></meeting>
		<imprint>
			<publisher>Springer International Publishing</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="volume">12260</biblScope>
			<biblScope unit="page" from="169" to="193" />
		</imprint>
	</monogr>
	<note>series Title: Lecture Notes in Computer Science</note>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">The mathdeck formula editor: Interactive formula entry combining latex , structure editing, and search</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Diaz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Nishizawa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mansouri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Davila</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Zanibbi</surname></persName>
		</author>
		<idno type="DOI">10.1145/3411763.3451564</idno>
		<idno>doi:10.1145/ 3411763.3451564</idno>
		<ptr target="https://doi.org/10.1145/3411763.3451564" />
	</analytic>
	<monogr>
		<title level="m">CHI &apos;21: CHI Conference on Human Factors in Computing Systems, Virtual Event</title>
				<editor>
			<persName><forename type="first">Y</forename><surname>Kitamura</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Quigley</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Isbister</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Igarashi</surname></persName>
		</editor>
		<meeting><address><addrLine>Yokohama Japan</addrLine></address></meeting>
		<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2021">May 8-13, 2021. 2021</date>
			<biblScope unit="volume">192</biblScope>
			<biblScope unit="page">5</biblScope>
		</imprint>
	</monogr>
	<note>Extended Abstracts</note>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<title level="m" type="main">Search Engines: Information Retrieval in Practice</title>
		<author>
			<persName><forename type="first">B</forename><surname>Croft</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Metzler</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Strohman</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2009">2009</date>
			<publisher>Addison-Wesley Publishing Company</publisher>
			<pubPlace>USA</pubPlace>
		</imprint>
	</monogr>
	<note>1st ed</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
