<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Scaling Feature Based Mathematical Search Engine for Real-World Document Sets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jozef Misutka</string-name>
          <email>misutka@ksi.mff.cuni.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Software Engineering, Charles University in Prague</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>There have been several interesting approaches to mathematical searching described in the last few years. We decided to implement another mathematical search engine, building on the work by Ma et al. described in the paper \Feature Extraction and Clustering-based Retrieval for Mathematical Formulas". We have extended the original algorithms proposed by Ma et al. and implemented them using EgoMath's formula parse trees and applying normalisation and ordering algorithms. The approach has been studied and evaluated on a real-world document set - Wikipedia.org. We introduced several improvements and tested three di erent versions of the original algorithm with interesting results.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>There have been several interesting approaches to mathematical searching
described in the last few years. We decided to implement another mathematical
search engine (MSE), building on the work by Ma et al. described in the paper
\Feature Extraction and Clustering-based Retrieval for Mathematical
Formulas" [Ma+10a]. Ma et al. claim to achieve \promising results" by proposing an
\e ective feature extraction approach" and supporting their claims with an
evaluation done on 884 formulae. We chose the feature based extraction algorithms
proposed in the paper because we think that feature based similarity research
has potential and it has not been applied to real world document set. We will
refer to our implementation of the feature based algorithm search engine as FBA.</p>
      <p>We have extended the original algorithms proposed by Ma et al. and
implemented them using EgoMath's formula parse trees and applying normalisation
and ordering algorithms from the EgoMath mathematical search engine [Mi08b].
The approach has been studied and evaluated on a real-world document set
Wikipedia.org. We introduced several improvements and tested three di erent
versions of the original algorithm with several interesting results.</p>
      <p>We will begin with a very brief description of the proposed algorithms by
Ma et al. Then, we will describe our contributions to the original algorithms in
detail including evaluation and the actual implementation.</p>
    </sec>
    <sec id="sec-2">
      <title>Original Algorithm</title>
      <p>The original proposed solution relies on the Presentation MathML (MML)
format which is represented as a tree. There are two types of features extracted from
a formula: 1) structural and 2) semantic ones. Semantic features are extracted
by traversing the formula tree in preorder. There are operators (operators are
represented as &lt;mo&gt; in MML), functions inside identi ers (&lt;mi&gt; MML tag)
and nodes (except &lt;mrow&gt; MML tag) that have children containing
identiers (&lt;mi&gt; MML tag) extracted. Structural features are obtained from nodes
containing semantic features with depth (in the formula tree) 1. String
obtained by concatenating the names of its parent nodes represents the structural
meaning.</p>
      <p>Let us consider the formula R 5e2x 1 with MML representation (shown on
the left side in Fig. 1) consisting of these tags and their appropriate values
mo:R ; , mn:5; 2; 1, mrow, msup, mi :e; x. We obtain these semantic features
by using the semantic feature extraction algorithm from the original paper: R
(mo), e (mi + exponential function), msup (identi er in children), (mo). After
extracting the semantic features, we can use the second algorithm to extract the
structural features which are (\=" is used as a separator between individual node
names): msup=mrow= ( is a semantic feature and the rest was obtained by
concatenating the values by recursively visiting the parents), msup=e1.</p>
      <p>The last set of features represents constants and variables. The original paper
claims that \Unlike operators or functions, both number constants and variables
are not representative, so their semantic features may not be meaningful" and
that \It is obvious that exact value representations are not meaningful, e.g.,
1 This is incorrectly written in the original paper as e=msup.
12 or x, for formula search purposes". These are rather strong and subjective
claims which we discuss in more detail below. The real values of the numbers and
constants are replaced by var and cn keywords and the same algorithm as for
the extraction of the structural features is used. The result features representing
constants and variables are msup/mrow/cn2 and msup/mrow/var.</p>
      <p>Extracted features are transformed into a feature vector using the tf-idf
weighting scheme and normalised using cosine normalisation. First, the weights
are computed using the term frequency, de ned as
tf (f eature; f ormula) =</p>
      <p># of occurences of feature in a formula
max f# of occurences of any feature in a formulag
and the document frequency de ned as
idf (f eature; f ormula set) = log
The result vector F is normalised to F norm with
# of formulae in a formula set
# of formulae which have the particular feature
:
F norm =</p>
      <p>1
qPii==0n fi2</p>
      <p>F
where n is the number of features and fi is the feature at a particular position.
For a small document set, the feature vector can contain all distinct extracted
features from all formulae. These normalised feature vectors are used for
clustering. The similarity between a query formula and a formula from the document
set represented by a normalised feature vector is the well-known cosine
similarity. The original paper evaluated three di erent clustering algorithms for
retrieval: K-means, self organised maps and agglomerative hierarchical
clustering. However, we will not use clustering because it does not directly add to the
mathematical awareness. The original evaluation was done on 884 formulae with
20 training and 20 test samples. A more detailed description of the extraction
algorithms is available in the original paper [Ma+10a].
3</p>
    </sec>
    <sec id="sec-3">
      <title>Modi cations</title>
      <p>The contribution of the original paper is clear. On the other hand, we must point
out several problems we encountered which question the feasibility of the original
approach to larger document sets. We try to solve these issues and propose a
modi ed version of the algorithms.</p>
      <p>We evaluate the approach on a real-world document set - Wikipedia.org. The
Wikipedia.org document set contains formulae encoded in LATEX. Each formula
2 Another mistake in the original paper states that msup=mrow=cn was produced by
msup=mrow=5 but it could only have been produced either by msup=mrow=2 or
msup=mrow=1.
is parsed into the formula tree using the EgoMath library and the ordering
algorithm is applied. Features are then extracted from the formula tree.</p>
      <p>It should also be noted, that the example in the original paper uses an
optimistic representation of the original formula R 5e2x 1 in the MML presentation
format. Often, automatic converters add more &lt;mrow&gt; elements which make
the extracted features more complex and many visually equal formulae have
different features. By using normalised EgoMath's formula trees we avoid most of
the notation di erences which clearly improves the quality in comparison with
the original algorithm. The di erence between MML representation and the
formula tree is illustrated in Fig. 1.</p>
      <p>We use Wikipedia.org as the document set for our evaluation and reasoning
because it contains common mathematical knowledge and is large enough.
Formulae have been extracted using the EgoMath mathematical search engine. The
original algorithm produced 271;103 unique features using formula trees over the
whole of Wikipedia.org. The precision of this approach on the evaluation queries
was very poor (with most of the evaluated queries having near zero P@53
precision, see Section 4). The reasons are that variables and constants were considered
unimportant. On the other hand, 863;242 features have been extracted when the
variables and constants were not replaced by keywords. This was the motivation
to introduce limits on the extraction algorithms to lower the number of unique
features.</p>
      <p>The overall idea is the same as in the original paper. The extraction and
concatenation of the names of parent nodes is replaced by a more complex
algorithm which: skips binary functions with the same node depth (e.g., a + b c
all leaves have only one parent whose value is extracted), can remove common
nodes like , can limit the number of parents traversed and can replace numbers
and variables with keywords. 23 out of the 50 most used features contained \*".
We added many \*" through ab to a b normalisation; because of this, and
because of the limits set to the feature extraction depth, we added the option to
skip \*" nodes when extracting the values from parents.</p>
      <p>We de ne three algorithms with these settings: 1) FBA 1 - skips from
traversing and replaces constants and variables; 2) FBA 2 - skips ; 3) FBA 3
neither replaces any values nor skips .</p>
      <p>The nal semantic features extracted from R 5e2x 1 by FBA 3 are \R , , e,
^ (upper index or msup in MML), ! (unary -), +, " (in comparison to \R ,
e, ^, " from the original paper). The nal structural features extracted are
int= , int= =e, int= =e=^, e=^= + =! , =e=^=+, e=^= + = , int= =5, int= =e,
^= + =! =1, e=^= =2, e=^= =x (in comparison to ^=e, ^=mrow= , ^=mrow=cn,
^=mrow=var from the original paper).</p>
      <p>The algorithm for extracting numbers and variables visits all nodes and if
the node is either a constant, a number or a variable extracts a path similar to
the structural extraction. The feature is then appended to the feature vector. In
3 Precision ( jfrelevant docs.g\fretrieved docs.gj ) takes all retrieved documents into
jfretrieved docs.gj
account, but it can also be evaluated considering only the top n results returned.
contrast to the original algorithm, we also extract features from nodes with a
depth equal to 1 because in the formula tree, a root node is part of the formula.</p>
      <p>We claim that our changes and usage of EgoMath's formula trees improve
the quality of the extraction algorithms in general. We base our claim on
several case studies and show several top results of di erent queries. Firstly, for
formulae having only one element, like 5, sin would have zero features in the
original algorithm but in our approach they have one feature - the node value
(or replaced keyword) itself. Mathematically equal (in common mathematical
structures) formulae 2 + a and a 2 have di erent feature sets in the
original algorithm but equal feature sets in our algorithm because of the ordering
algorithm being applied (the original algorithm would use di erent MathML
representations e.g., additional + sign in the rst formula). The same holds for 2 x
and 2x due to the normalisation algorithms used. Furthermore, 1x and x1 have
equal feature vectors in the original algorithm but - correctly - di erent ones in
FBA because of the more mathematically precise formula trees. We overcome
MML representation which does not respect operator priority, and therefore
introduces additional mrow elements, by using the formula trees an skipping same
priority operators when traversing to the root node.</p>
      <p>The main algorithm works with a vector of numbers where each position
represents a di erent feature. Finding the relevant features which should be used
to represent every formula in the document set requires either the processing of
the whole document set or at least a representative subset. Afterwards, a set of
U distinct features is selected and is used to represent each formula.</p>
      <p>According to the statistics of the Wikipedia.org document set, 90% (approx.
261;618) of formulae are shorter than 100 characters, including spaces and
brackets etc. and 70% are shorter than 50. Despite this fact, the number of unique
features using the original feature extraction algorithms over EgoMath's formula
trees was in the hundreds of thousands as shown in Fig. 2.</p>
      <p>In the rst run, we try to reduce the number of unique features by processing
only formulae smaller then M characters. Furthermore, the average number of
features extracted for the three feature types (structural, semantic and variables
and constants) were 8:2, 5:8 and 8:5 respectively which gives the rst estimation
for the number of the representative feature count. We also limit the maximum
number of extracted features for each type by N . Finally, the maximum number
of parents visited while concatenating the node names was limited by O. The
depth limit set to three means that the formula depth is limited to four nodes
in child-parent relations but they must also have di erent depths (otherwise,
they would be skipped). All formulae with a bigger depth have some features
stripped. A di erent depth means that operators on the path to the root node
in the formula tree have di erent priorities.</p>
      <p>The number of features extracted with di erent parameters is depicted in
Fig. 2. Finally, we decided to further evaluate M = 100, N = 20 and O = 3
because the number of features considering all FBA algorithms was acceptable
compared to the others and was more than double the average number of features.
The total number of extracted features using these settings with FBA100;20;3 3
(means FBA 3 with M = 100, N = 20 and O = 3) was 239;675 but only 28;539
using FBA100;20;3 1. The FBA100;20;3 1 number of features (28;539) is signi
cantly lower than the number of extracted features by the original algorithm
without bounds (271;103).</p>
      <p>We take the rst 10;000 features sorted by the count of occurrences in the
whole document set. It is reasonable to do so because the number of occurrences
of the rst 30 features dramatically falls and the features become relevant.
Starting from the 29th feature (sorted by the number of occurrences), the occurrence
count is in the order of thousands. Starting from the 412th, in the order of
hundreds and starting from 3346th, in the order of tens. We have experimented
with a lower feature count (100, 1000, 5000) but the precision of the result was
smaller. Because we focus on applicability we will not consider feature count
above 10;000.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Evaluation and Results</title>
      <p>We will use mathematical documents from Wikipedia.org as extracted using
EgoMath mathematical search engine. This document set contains approx. 28;100
mathematical articles with 290;872 unique formulae, out of which, 261;618 are
valid for feature extraction having textual length smaller than 100. Out of these,
258;404 have non empty feature sets.</p>
      <p>FBA algorithms returned many equal formulae with the same high similarity
at the beginning because of the EgoMath normalised parse trees e.g., \npi = c
/d", \npi = nfracfcgfdg". These results are treated as one.</p>
      <p>The evaluation was done by performing set of selected queries. Because
Wikipedia.org very likely contains many common knowledge subjects, we tried to
compile a list of known and famous formulae not restricted by our own research
background. We found a web page called \Famous Equations and Inequalities"4
which we used as the basis. We have evaluated the whole set of queries from the
site but selected a smaller set shown below. Top-5 results listed side-by-side are
shown below of the original algorithm (but based on EgoMath formula trees),
FBA100;20;3 1 and FBA100;20;3 3.</p>
      <p>Original algorithm</p>
      <p>ei = 1
e = (ei ) i = ( 1) i</p>
      <p>e t = e tr2
= eR p(x)dx = eR 1dx =</p>
      <p>e x</p>
      <p>M (x) = eR x2 dx =
e 2 ln x = eln x 2 = x 2
w=h = 3:3
aa = 1</p>
      <p>P h
Laborer = 1</p>
      <p>P h
Laborer = 6
div = 0
1
4 Available at http://www.math.utah.edu/~pa/math/equations/equations.html
(seen March, 2012).</p>
      <p>P1 mntn</p>
      <p>n=0 n!
P1 fn(a)
n=0 n!
P fj(x)</p>
      <p>j gj(x)
P 1</p>
      <p>R dim(R)s
P1 1</p>
      <p>n=1 anxn
cos y + sin y
cos u + sin
cos(z) + sin(z)</p>
      <p>cos b + sin a
sin2(X) + cos2(X) = 1
r4 = 0
l(x2) = 2
E = mc2
E = mc2</p>
      <p>d 2 = 1
M = \zcy
V = V (r)</p>
      <p>( ) =
f (m) = n
nZ = (n)</p>
      <p>Pn1=0 fnn(!a) (x
Pin n fn(a) (x
n=0P1n! xn</p>
      <p>n=0 n!</p>
      <p>P1 zn
P1n=0mnnt!n
n=0 n!
a)n
a)n
5) cos2(x) + sin2(x)</p>
      <p>cos(z) + sin(z)
sin2(X) + cos2(X) = 1</p>
      <p>cos2 + sin2 = 1
sin2(x) + cos2(x) = 1
z = sin2 x + cos2 x
5) E = mc2
r4 = 0
l(x2) = 2
V = 0:26</p>
      <p>r2p = 0
E = mc2</p>
      <p>D3
6) Ax =</p>
      <p>x
M = \zcy
V = V (r)
C = Ab(X)
f (m) = n
t = O( )
f (z)</p>
      <p>P1</p>
      <p>n=0
Pin n ffnnnn(!(a!a))((xx</p>
      <p>n=0
P1
n=0
Pin n fnn(!0) xn
n=0</p>
      <p>T (x) =
fn(x0) (x
n!
k=0 fkk(!c) (z
P1
a)n
a)n
x0)n</p>
      <p>c)k
sin2(x) + cos2(x) = 1
z = sin2 x + cos2 x</p>
      <p>cos(z) + sin(z)
cos2 + sin2 = 1
cos(t)2 + sin(t)2 = 1</p>
      <p>E = mc2
E0 = mc2
E0 = m0c2</p>
      <p>Erest = mc2
Erest = E0 = mc2</p>
      <p>Ax = x
[A][x] = [x]
y = x</p>
      <p>(x)
Aw =</p>
      <p>Bw</p>
      <p>The problematic queries are those which rely on the actual values of
elements because there are too many structurally equal ones. This issue is xed by
FBA100;20;3 2 and FBA100;20;3 3, and as can be seen, they outperform FBA100;20;3
1 because they use more relevant features. Only query 3 proved to be
problematic because it is too simple and the relevant features were not in the selected
10;000.</p>
      <p>The results are surprisingly outstanding for FBA100;20;3 2 and 3 given the
fact that only around 4% of all features were used to characterise each formula.
However, we must stress that there can be cases when a formula has important
features not present in the rst 10;000 features like in query 3 (or even that there
is none).</p>
      <p>There are several interesting results. Firstly, that variable name are
considered unimportant when the structure is similar e.g., FBA100;20;3 3 in query
Pn1=0 fnn(!a) (x a)n with result T (x) = Pn1=0 fnn(x!0) (x x0)n or query cos2(x) +
sin2(x) with results cos2 +sin2 = 1 and cos(t)2+sin(t)2 = 1. The same applies
for subscripts e.g., query E = mc2 with results E0 = m0c2 and Erest = mc2.
It is also interesting that searching for ei = 1 returned ei = 1 + 0i in
5</p>
    </sec>
    <sec id="sec-5">
      <title>Contribution and Conclusion</title>
      <p>We have presented an improved implementation of the feature based search
engine originally proposed by Ma et al. which can search for mathematical
formulae. We have shown that the original algorithm extracts too many non relevant
features and introduced boundaries which limit the number of extracted features.
We found out that, for our testing queries, using 4% of all features extracted by
FBA100;20;3 3 was enough for obtaining interesting results on our testing
document set. Furthermore, we have shown that relying on EgoMath's formula trees
rather than on the Presentation MML makes it possible to nd more relevant
formulae.</p>
      <p>Acknowledegements. This work was supported by the grant SVV-2013-267312.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Ma+10a]
          <string-name>
            <surname>Ma</surname>
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hui</surname>
            <given-names>S. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Feature extraction and clustering-based retrieval for mathematical formulas</article-title>
          .
          <source>In: Software Engineering and Data Mining (SEDM)</source>
          , pp.
          <volume>372</volume>
          {
          <issue>377</issue>
          ,
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          , (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>[Mi08b] Misutka</surname>
            <given-names>J.</given-names>
          </string-name>
          , Galambos L.:
          <article-title>System Description: EgoMath2 As a Tool for Mathematical Searching on Wikipedia.org</article-title>
          ,
          <source>LNCS 6824</source>
          , pp.
          <volume>307</volume>
          {
          <issue>309</issue>
          , Springer, Berlin Heidelberg (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>