<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Modelling String Structure in Vector Spaces</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Division of Mathematics</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Computing Science</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>University of Stirling</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Scotland richard.connor@stir.ac.uk</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Information Science and Technologies (ISTI), CNR</institution>
          ,
          <addr-line>Pisa</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Computer Science, University of St Andrews</institution>
          ,
          <addr-line>St Andrews, Scotland</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Searching for similar strings is an important and frequent database task both in terms of human interactions and in absolute worldwide CPU utilisation. A wealth of metric functions for string comparison exist. However, with respect to the wide range of classi cation and other techniques known within vector spaces, such metrics allow only a very restricted range of techniques. To counter this restriction, various strategies have been used for mapping string spaces into vector spaces, approximating the string distances within the mapped space and therefore allowing vector space techniques to be used. In previous work we have developed a novel technique for mapping metric spaces into vector spaces, which can therefore be applied for this purpose. In this paper we evaluate this technique in the context of string spaces, and compare it to other published techniques for mapping strings to vectors. We use a publicly available English lexicon as our experimental data set, and test two di erent string metrics over it for each vector mapping. We nd that our novel technique considerably outperforms previously used technique in preserving the actual distance.</p>
      </abstract>
      <kwd-group>
        <kwd>Metric Mapping</kwd>
        <kwd>n-Simplex projection</kwd>
        <kwd>Pivoted embed- ding</kwd>
        <kwd>String</kwd>
        <kwd>Jensen-Shannon distance</kwd>
        <kwd>Levenshtein distance</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Searching over strings has been a central activity of database systems since
their inception. In addition to their use in documents composed of character
sequences, strings are commonly used to encode the structure of molecules, DNA
sequences, product descriptions and many other objects drawn from the real
world. The need to search for similar strings arises due to di erences in
encoding and spelling, textual descriptions being slightly di erent, and the need to
nd partial matches. In addition to common tasks such as Web searching, string
Copyright c 2019 for the individual papers by the papers authors. Copying
permitted for private and academic purposes. This volume is published and copyrighted by
its editors. SEBD 2019, June 16-19, 2019, Castiglione della Pescaia, Italy.
similarity also plays an important role in many real-world database tasks
including deduplication and cleaning. Similarity search is also important in the context
of record linkage where similar representations of real word entities require
unication. Searching for similar strings is therefore an important and frequent
database task both in terms of human interactions and in absolute worldwide
CPU utilisation.</p>
      <p>
        A wealth of metric functions for string comparison exist. In previous work,
we have identi ed the n-Simplex projection as a way of mapping a class of metric
space into lower-dimensionality Euclidean spaces with better indexing properties
[
        <xref ref-type="bibr" rid="ref1 ref5 ref6">1, 5, 6</xref>
        ]. Previously we have applied this mapping only to the problem of
similarity search: that is, nding those objects which are most similar to a given query
object from a very large space. As part of this work we have identi ed a subclass
of proper metrics, called supermetric, which are amenable to our techniques.
This class includes those spaces which have the so-called four-point property
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which in turn includes all metric spaces which can be isometrically
embedded in Hilbert space. Furthermore, any Hilbert-embeddable space has a deeper
property, in that any nite set of n points can be isometrically embedded in an
(n 1)-dimensional Euclidean space.
      </p>
      <p>So far, we have examined only metrics and data within Rn spaces. Here,
we extend the work beyond metric indexing, and to non-numeric spaces for the
rst time. Our techniques allow certain string spaces to be represented in vector
spaces in such a way that the distances are maintained with various degrees
of approximation. Our main contribution is to explain the mapping from the
string space to the vector space. Based on a sound mathematical framework,
we demonstrate that this mapping into vector spaces is signi cantly better than
those previously known.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background and Related Work</title>
      <p>Our domain of interest is string spaces, by which we refer to a (typically large)
nite set of character strings S, and a distance function d : S S ! R+ which
compares them. With only a little loss of generality we restrict our interest
further to proper metrics which are common over strings. Such spaces are very
important in many domains; not just where strings themselves are the objects
of interest (e.g., natural language analysis), but also where another semantic
domain is represented in string form (e.g., genome analysis).</p>
      <p>
        Various machine learning techniques have shown great success in the analysis
and classi cation of natural language terms, based on a very large supervised or
unsupervised input, which gives a vector space mapping re ecting the semantics
of the original text [
        <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
        ]. In such cases, the vector space acts as a semantic
proxy space, in that terms with similar meaning will be close when measured
in the vector space. In our context, we are interested only the in structural
similarity of strings, as measured by metrics such as edit distance. The intent is
to map into a vector space while preserving the original string distances as far as
possible. There are some known techniques for mapping string spaces into vector
\string" \metric"
lev jsd lev jsd
string string metric metric
strings stringing matric meteoric
sur ng strings eric metricize
      </p>
      <p>tin striking dearie metrical
ringing stirring petted symmetric
brie ng staring makeup mimetic
\simplex" \error"
lev jsd lev jsd
simples simple error error
dimpled simples terror terror
triplex simpler mirror err
bingley simplest frog borrower
names pimples racy emperor
jugular simply petted horror
spaces, whilst attempting to preserve the integrity of the distance function, but
none of these appear to be particularly e ective.</p>
      <p>This paper presents a mechanism for transforming metric string spaces to
vector spaces, which can be applied whenever the distance function d is a proper
metric; this includes most edit distances.
2.1</p>
      <sec id="sec-2-1">
        <title>String Metrics</title>
        <p>In the literature, many string distance functions have been proposed to measure
the dissimilarity between two text strings. In this work we restrict our attention
to the Levenshtein Distance and Jensen-Shannon distance as explained below.
These two metrics are representative of the wide range of more or less
sophisticated metrics available. For our purposes, we also note a deep mathematical
property which is important to our context: the Jensen-Shannon distance is
isometrically Hilbert embeddable, whereas the Levenshtein distance is not. The
importance of this will be made clear later in the text.</p>
        <p>
          Levenshtein Distance: the best known edit distance between strings de ned
as the minimum number of single-character edit operations (insertions,
deletions or substitutions) needed to transform one string into the other. For
example, the distance between \error" and \horror" is 2 (one insertion and
one substitution), the distance between \writing" and \writer" is 3 (one
deletion, two substitutions). Note that the Levenshtein distance dLev between
two strings of length m and n is at least jm nj, and at most max(m; n).
Jensen-Shannon Distance over Strings: is derived from a distance metric
de ned over labelled trees [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. In outline, the technique of digram shingling
[
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] is used to transform each character string into a very high-dimensional
probability space, where each individual character and each character
digram is represented as a di erent event; thus a string represented in a
127character set will be represented in a space comprising 1272 + 127 di erent
events. Each such event is assigned a frequency according to its appearance
in the token, including notional start and nish characters denoted and
! in the following example. For example, the string "ann" is composed of
the characters and diagrams \ a",\an", \nn" \n!",\a", and \n" with
corresponding frequencies 61 ; 16 ; 16 ; 16 ; 16 , and 13 .
        </p>
        <p>Jensen-Shannon distance is then applied over these probability ensembles4.We
recall that the Jensen-Shannon distance between two probability vectors
x; y is de ned as dJS (x; y) = pJ SD(x; y), where J SD(x; y) is the
JensonShannon divergence.</p>
        <p>J SD(x; y) = 1
1 X xi log2 xi + yi log2 yi
2 i
(xi + yi) log2(xi + yi): (1)
To show a characterisation of these two metrics, Table 1 shows the nearest few
strings within the SISAP lexicon to an appropriate selection of tokens.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>String space to Vector space mappings</title>
        <p>The only information available within our context derives from the distance
function over the string domain; this is treated as a \black box" function and
therefore any approach to map the string space into a vector space must rely
only upon these distances. Usually, the distances between each string and a xed
set of pivots (i.e., reference objects) is exploited to map the string space to a
vector space. There are however various di erent ways of using this information,
ve of are compared in this paper.</p>
        <p>In this section we explain three string-to-vector mappings which are
variously described in the literature, and indeed are the only methods that we know
of other than our own. To introduce a unifying terminology, we refer to these
mechanisms by the following terms: Pivoted Embedding - ` , Pivoted Embedding
1
- `2, LMDS - `2, which we will continue to compare experimentally with two of
our own invention, referred to here as nSimplex - lwb and nSimplex - zen.</p>
        <p>In terms of the construction of a pivot set, it should be noted that in all cases
other than LMDS, the selection of n pivot objects leads to the construction
of an n-dimensional vector space; in the case of LMDS, the dimension of the
constructed space equals the number of positive eigenvalues of the matrix of the
squared distance between the pivots, which is at most n 1.</p>
        <p>We continue by giving explanations of each of these mechanisms; the rst
three in this section, the two nSimplex mappings in Section 3.</p>
        <p>Pivoted Embedding with ` . Pivoted Embedding is a metric space
trans1
formation based on the distances between data objects and a set of pivots.
Given a metric space (U; d), and a set of pivots fp1; : : : ; png, the Pivoted
Embedding fP maps an object o 2 U to a n-dimensional vector fP (o) =
4 We appreciate this outline description is insu cient to accurately describe the
function used in experiments, however it does succinctly give its essence. The interested
reader is referred to the publicly available code base for this paper, which contains
full source code for this metric: bitbucket.org/richardconnor/it_db_conference_
2019.</p>
        <p>
          [d(o; p1); : : : ; d(o; pn)]. Using the triangle inequality it can be proven that for
any o; q 2 U
miax jd(o; pi)
d(q; pi)j
miin jd(o; pi) + d(q; pi)j:
(2)
The maximum distance at any dimension is captured by the Chevyshev5
(`1) distance, and therefore the distance `1(fP (o); fP (q)) is guaranteed to
be a lower-bound of d(o; q). As the number of pivots increases, this
lowerbound distance can reasonably be expected to provide a better
approximation to the true distance, and this technique has been used to e ect in some
metric space contexts, e.g. [
          <xref ref-type="bibr" rid="ref19 ref7">7, 19</xref>
          ].
        </p>
        <p>
          Pivoted Embedding with `2. Other distances over the same pivoted
embedding spaces have also been used. For example, in the context of string spaces
(i.e. U = S), Spillmann et al [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] used the family of Minkowski metrics, which
encompasses the well-known Euclidean distance (`2). It should be noted that
there is no underlying theory for this technique, and in particular the `2
metric applied over the pivoted embedding space has no formal guarantees that
we know of with respect to the string space it is intended to model.
The authors primarily investigate the selection of pivots to best e ect within
several di erent contexts, but make no particular justi cation for their choice
of vector representation - except, perhaps, that it is the most straightforward
possible mapping.
        </p>
        <p>
          LMDS. In the general metric space context, perhaps the best known technique
is metric Multidimensional Scaling (MDS) [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. MDS aims to preserve
interpoint distances. Given m objects and the distances i;j between those points,
it nds a set of m points fx1; : : : ; xmg in a Euclidean space Rk such that
`2(xi; xj ) is close as possible to i;j . Those coordinates are computed using
spectral analysis of the matrix of the squared interpoint distances.
For any metric which is isometrically Hilbert embeddable, it is possible to
nd a perfect `2 embedding for any n objects within n 1 Euclidean
dimensions [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. However, when the number m of data points is large the classical
MDS is too expensive in practice due to a requirement for O(m2) distance
computations and spectral decomposition of an m m matrix.
        </p>
        <p>
          The Landmark MDS (LMDS) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] is a fast approximation of MDS. LMDS
uses a set of k landmark points (i.e. pivots) to compute k m distances
of the data points from the landmark points. It applies classical MSD to
the landmark points and uses a distance-based triangulation procedure to
project the remaining data points. Speci cally if Xk 2 Rk k is the output of
MDS on the pivot set, the embedding of a new point s into Rk is computed
as xs = 12 Xk+( s2 2 ), where ( s2)i = d(s; pi), ( 2 )j = n1 Pin=1 d(pi; pj ) and
Xk+ is the pseudo inverse of Xk.
5 The Chebyshev distance (`1) is a true metric de ned on a vector space to be the
greatest di erence along any coordinate dimension.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>The n-Simplex Projection</title>
      <p>
        The n-Simplex Projection projection is described in full in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]; here we give a
sketch of the approach. Although it requires a little mathematical background,
the outcome is that there exists a relatively simple and e cient mapping from
any space with the correct properties into an n-dimensional Euclidean space for
an arbitrary choice of n. The mapping is contractive, and the associated error
asymptotically approaches zero as n increases.
1. Any metric space (U; d) which is isometrically embeddable in a Hilbert space
has the so-called n-point property. This means that, for any nite
selection of n objects, an isometric embedding of just those objects exists in
(n 1)-dimensional Euclidean space. That is, for any set fu1; : : : ; ung U
there exists a mapping f : fu1; : : : ; ung ! R(n 1) such that d(ui; uj ) =
`2(f (ui); f (uj )) for any two objects ui; uj . This background work is
attributed to mathematicians such as Blumenthal, Menger, Wilson etc. [
        <xref ref-type="bibr" rid="ref14 ref18 ref3">3,
14, 18</xref>
        ]
2. In practical terms, we have identi ed that metrics with this property
include: Euclidean, Cosine, Jensen-Shannon, Triangular, and Quadratic Form
distances [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In this context we also note that other proper metrics such as
Chebyshev, Levenshtein, and Hamming do not possess the property.
However, for any metric space (U; d) and any n 4 there exist a constant
n 1=2 such that for all the 0 &lt; n the space (U; d ) has the
npoint property [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Speci cally, Blumenthal proved that 4 = 1=2, and thus
for any proper metric space (U; d), the space (U; pd) has the 4-point
property. For n &gt; 4, Deza and Maehara [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] proved that n 0:72=n, thus for
any 0 &lt; 0:72=n the space (U; d ) has the n-point property. Moreover,
it is worth noting that for any nite metric space (U; d) of negative type [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
(e.g. hypermetric, `1-embeddedable metric, ultrametric) then (U; pd) has
the n-point property.
3. In [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], we give the simple corollary of the n-point property that any n
objects in the space can be used to form the vertices of a simplex in (n
1)dimensional space, such that the edge lengths of the simplex are in one-to-one
correspondence with the distances measured in the original space, and
furthermore we give an algorithm to construct the Euclidean coordinates of
its vertices. This therefore gives a concrete instantiation of the existential
function f mentioned above.
4. This function is de ned inductively, and at each step constructs a new apex
in k dimensions based on a new object and a simplex previously formed in
(k 1)dimensions, called base simplex, according to the distances between the
new object and each object represented by the vertices of the base simplex.
5. Finally, we observe that once a base simplex is formed, this may be used
to create any number of apexes based on it, using objects from the original
space. For a choice of n reference objects, each of these apexes corresponds
to a point in n-dimensional Euclidean space.
      </p>
      <sec id="sec-3-1">
        <title>Dissimilarity functions over the n-Simplex projected points</title>
        <p>
          For a xed set of pivots, fp1; : : : ; png, let fS : U ! Rn be the n-Simplex
projection obtained using the pivots to form the vertices of the base simplex. We have
shown [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] that for any two objects o; q 2 (U; d)
vu n
uX(xi
t
i=1
vun 1
uX(xi
t
        </p>
        <p>i=1
yi)2
where fS (o) = [x1; : : : ; xn], fS (q) = [y1; : : : yn]. Thus the Euclidean distance
`2(fS (o); fS (q)) between any two apexes is a lower-bound of the distance between
the corresponding objects within the original space, and an upper-bound could
be computed as well. Furthermore, as more dimensions are chosen for the base
simplex, these distance bounds asymptotically approach the true distance. The
lower-bound (d0Lwb) is a proper metric function on Rn, while the upper-bound
(d0Upb) is a dissimilarity function but not a proper metric, because it does not
satisfy the identity postulate (i.e. d0Upb(x; x) may not be equal to zero).</p>
        <p>
          Other dissimilarity functions over the apex vectors may be considered as
well, in particular, any function that is always between the lower-bound and the
upper-bound will asymptotically approach the true distance when increasing the
number of pivots. An example is the mean between the lower- and upper-bound,
which as been proved to be particularly e ective for similarity search task [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
Here we introduce the Zenith dissimilarity function
        </p>
        <p>vun 1
d0Zen(x; y) = tuX(xi
i=1
yi)2 + xi2 + yi2;
8 x; y 2 Rn
(4)
which always satis es d0Lwb(x; y) d0Zen(x; y) d0Upb(x; y), but has a
geometrical interpretation stronger than the arithmetic mean between the upper- and
the lower-bound. In fact, the Zenith distance, as well as the lower-bound and the
upper-bound, can be interpreted as the Euclidean distance between two points
in Rn+1. Figure 2 shows the case n = 2.</p>
        <p>In general, given the pivots fp1; : : : ; png and two objects o; q we know that
there exists an isometric embedding of those n + 2 points into Rn+1.</p>
        <p>
          Let vo; vq; vp1 ; : : : ; vpn be the mapped points, i.e. vectors such that d(o; q) =
`2(vo; vq), d(o; pi) = `2(vo; vpi ), d(q; pi) = `2(vq; vpi ), d(pi; pj ) = `2(vp1 ; vpj ), for
all i; j = 1 : : : ; n. The points vp1 ; : : : ; vpn are the vertices of the so-called base
simplex, which lies in a (n 1)dimensional subspace. The points vp1 ; : : : ; vpn
and vo lies in a hyperplane of Rn+1, which we refer to as H. By rotating vq
around the base simplex (i.e. rotate the point while preserving the distances to
the pivots) we have two possible projections onto the hyperplane H (one above
and one below the base simplex). Let vq+ and vq be those points, which coincide
with the intersections of hyperplane H and the n balls of centre pi and radius
d(pi; q). In general, we don't explicitly know the distance d(o; q), so we don't
know the true altitude of the vertex vq over the hyperplane H. The vertices
Lower-bound
Fig. 1: Example of n-Simplex projection for n = 2, where the two pivots p1; p2
and two data objects o; q are rst isometrically embedded in 3D Euclidean space
and then projected in 2D Euclidean plane. The mapped points are indicated
with vp1 ; vp2 ; vo; vq in the graph. By rotating vq around the edge vp1 vp2 until it
is coplanar with vp1 ; vp2 ; vo, we obtain two possible apices (vq+ and vq ), that are
the intersections of the hyperplane containing vp1 ; vp2 ; vo, and the the two balls
centred in vp1 ; vp2 and radii d(q; p1), d(q; p2), respectively. The apex vzen is that
q
at the highest altitude over the hyperplane that still preserves the distance to
the two pivots.
vq+ and vq are those obtained by approximating vq with points at zero altitude
such that the distances to the pivots are preserved. Another possible choice
is considering the zenith point vzen which is the point at the highest altitude
q
over H that still preserves the distances to the pivots. In [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] we show that the
quantities `2(vo; vq+) and `2(vo; vq ) provide a lower-bound and an upper-bound
of d(o; q). By construction, the quantity `2(vo; vqzen) will always been between
those bounds, and it corresponds to the quadratic mean of the lower- and
upperbounds, i.e. d0Zen = q (d0Lwb)2 +2(d0Upb)2 .
3.2
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Application of n-Simplex to Levenshtein</title>
        <p>As previously mentioned, with respect to our techniques there is one essential
di erence between the two metrics we are testing: Jensen-Shannon distance is
Hilbert-embeddable, and therefore has the n-point property; Levenshtein does
not. This means that, in the form de ned above, it cannot be used to build a
simplex in n-dimensional space for some arbitrary n.</p>
        <p>
          However, as observed above, for any n &gt; 4 there exists some exponent n with
0:72=n n 0:5, such that for any 2 (0; n] the space (S; dlev) satis es the
n-point property [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Therefore, for a xed in the desired range we can use the
n-simplex projection to transform the space (S; dlev) into (Rn; d0), where d0 is a
dissimilarity function over the projected points (e.g., the n-simplex lower-bound
or the zenith measure). If vi; vj 2 Rn are the projections of the strings si; sj 2 S,
then we can approximate the actual distance dlev(si; sj ) with (d0(vi; vj ))1= . It
is worth noting that using the value ~n = 0:72=n guarantees that (U; d~n ) has
the n-point property for any metric space (U; d); however for a speci ed metric
space it is possible to nd a suitable exponent that is greater than ~n. We
experimentally observed that, in our context, = 0:5 works well for (S; dlev);
in fact we have found no counter-examples where it is impossible to embed any
n objects from our domain into an (n 1)-dimensional Euclidean space. Thus
we applied the n-Simplex projection on (S; pdlev). Note we do not claim this
property for S in general, although we are interested in pursuing this further as
we suspect it to be the case.
        </p>
        <p>
          Given the above, when comparing the n-Simplex techniques in the context of
the Levenshtein distance, we rst cast the space (S; dlev) into (S; pdlev) to
create the vector space, and then square the results before comparison. In terms of
intrinsic dimensionality, or discriminability, this is very undesirable; the
application of this metric-preserving transformation decreases the variance of measured
distances, which makes a proximity query more unstable as the the square-rooted
distance has less power in discriminating between nearest and furthest neighbour
[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Thus, the techniques which do not require to perform this mapping should
bene t considerably in terms of relative error. This e ect shows in some of our
results, as mentioned later.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experimental Analysis</title>
      <p>To test the ve techniques outlined above, we use a single collection of strings
drawn from a public English lexicon of 69,069 words6 and consider Levenshtein
distance (dLev) and Jensen-Shannon distance (dJS ) over this collection. For
varying numbers of pivots, we apply the ve di erent methods and measure
distortion, relative error, and absolute squared error.</p>
      <p>For each measured number n of pivots, we repeatedly select a subset random
subset of n strings to act as pivots, and a further random selection of 10,000
strings to act as the nite data set. We consider all n2 pairwise distances within
the set, i.e. nearly 50 million distances, and for each pair measure both the
original distance and the mapped distance to calculate the properties listed. We
repeat this process until the standard error of the mean for each measurement
is within an acceptable limit, and then report the mean values obtained.</p>
      <p>Notationally, we refer to an original metric space (S; d) which is mapped into
a new space (S; d), where si; sj 2 S map to s0i; s0j 2 S and so on. We measure the
following properties of the mapped data:
Distortion For each mapping, the distortion is the minimum R 1 such that
there exists a multiplying factor r &gt; 0 such that, for all si; sj 2 S,
r d(s0i; s0j )
6 available from www.sisap.org</p>
      <p>Note that distortion gives a \worst case" analysis and is driven by outliers
within the mapping rather than typical errors; the larger the sample of
distances, the larger the distortion is likely to be.</p>
      <p>Average Relative Error For each pair of distances d(si; sj ) and d(s0i; s0j ) we
calculate the ratio, and use the mean of these ratios as a scaling factor .
Then the average relative error is calculated as
1
jSj
2</p>
      <p>X jd(si; sj )
i;j</p>
      <p>d(s0i; s0j )j
Mean Squared Error For each pair of distances d(si; sj ) and d(s0i; s0j ) we
calculate the ratio, and use the mean of these ratios as a scaling factor . Then
the mean squared error is calculated as
The results of these experiments are plotted in Figure 2.</p>
      <p>The clearest observation is that, for almost all measures, the nSimplex-Zenith
projection is the best. It is clearly by far the best estimator for all measures
over Jensen-Shannon distance, and also for Levenshtein in all aspects other than
distortion. Note the scale of the improvement is quite dramatic, the Y axis being
given in log scale for all charts. It is also notable how little the performance
degrades when less pivots are used compared with the other projections.</p>
      <p>The n-Simplex lower-bound estimator also performs well, and our conclusion
is that, at least for the Jensen-Shannon space, the n-Simplex projection is clearly
preferable to any of the other established methods.</p>
      <p>The di erence between the two metrics is, we believe, due to the presence
of the Hilbert embedding property of the Jensen-Shannon metric, and thus we
expect this behaviour to extend to other such metrics. The problem with
Levenshtein, which again we would expect to extend to other non-embeddable metrics,
is that in order to apply the projection it is necessary to scale the original
distance into a much less discriminable form; this we believe in particular is the
reason for the relatively poor gures for distortion. This e ect deserves further
attention to give a better understanding.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>We have described the use of the n-Simplex projection to create mappings from
strings into vector spaces, and compared properties of these mappings with three
others. This is the rst time n-Simplex has been demonstrated over string
functions, and in particular we have introduced for the rst time the Zenith estimator
function. This has been shown to be signi cantly better in terms of the accuracy
of estimations in the mapped space than any previously published mapping.
1,00
r
o
r
r
E
e
v
i
lta0,10
e
R
n
a
e
M
0,01
4
3
3
100,00
ro10,00
r
r
E
d
e
r
au 1,00
q
S
n
a
e 0,10
M
0,01</p>
      <p>4
100,00 3
n
o
i
t
ro10,00
t
s
i
D
1,00
4</p>
      <sec id="sec-5-1">
        <title>Levenshtein Distance</title>
      </sec>
      <sec id="sec-5-2">
        <title>Jensen-Shannon Distance</title>
        <p>that both Pivoted Embedding and n-Simplex projection provide a mapping into
n dimensions. For the LMDS approach, the dimensionality of the projected space
equals the number of positive eigenvalues of the matrix of the squared distance
between the pivots. These are indicated by the numbers along the LMDS lines
in the graphs.
Lucia Vadicamo acknowledges nancial support from the VISECH ARCO-CNR
project, CUP B56J17001330004, and the Italian National Research Council (CNR)
for a Short-Term Mobility Grant (STM) at the University of Stirling.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Amato</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chavez</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Connor</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Falchi</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gennaro</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vadicamo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Reranking permutation-based candidate sets with the n-Simplex projection</article-title>
          .
          <source>In: Proceedings of SISAP 2018</source>
          . pp.
          <volume>3</volume>
          {
          <fpage>17</fpage>
          . Springer (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Beyer</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goldstein</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramakrishnan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shaft</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>When is nearest neighbor meaningful?</article-title>
          <source>In: Proceedings of ICDT 1999</source>
          . pp.
          <volume>217</volume>
          {
          <fpage>235</fpage>
          . Springer (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Blumenthal</surname>
            ,
            <given-names>L.M.</given-names>
          </string-name>
          :
          <article-title>Theory and applications of distance geometry</article-title>
          . Clarendon Press (
          <year>1953</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Connor</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cardillo</surname>
            ,
            <given-names>F.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vadicamo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rabitti</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Hilbert exclusion: Improved metric search through nite isometric embeddings</article-title>
          .
          <source>ACM T Inform. Syst</source>
          .
          <volume>35</volume>
          (
          <issue>3</issue>
          ),
          <volume>17</volume>
          :1{
          <fpage>17</fpage>
          :27 (Dec
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Connor</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vadicamo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cardillo</surname>
            ,
            <given-names>F.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rabitti</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Supermetric search</article-title>
          .
          <source>Inform. Syst</source>
          .
          <volume>80</volume>
          ,
          <issue>108</issue>
          {
          <fpage>123</fpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Connor</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vadicamo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rabitti</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>High-dimensional simplexes for supermetric search</article-title>
          .
          <source>In: Proceedings of SISAP 2017</source>
          . pp.
          <volume>96</volume>
          {
          <fpage>109</fpage>
          . Springer (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Connor</surname>
            ,
            <given-names>R.C.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>MacKenzie-Leigh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moss</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>High dimensional search using polyhedral query</article-title>
          .
          <source>In: Proceedings of SISAP 2014</source>
          . pp.
          <volume>176</volume>
          {
          <fpage>188</fpage>
          . Springer (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Connor</surname>
            ,
            <given-names>R.C.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Simeoni</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iakovos</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moss</surname>
            ,
            <given-names>R.:</given-names>
          </string-name>
          <article-title>A bounded distance metric for comparing tree structure</article-title>
          .
          <source>Inform. Syst</source>
          .
          <volume>36</volume>
          (
          <issue>4</issue>
          ),
          <volume>748</volume>
          {
          <fpage>764</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. deSilva,
          <string-name>
            <given-names>V.</given-names>
            ,
            <surname>Tenenbaum</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.B.</surname>
          </string-name>
          :
          <article-title>Global versus local methods in nonlinear dimensionality reduction</article-title>
          .
          <source>In: Proceedings NIPS</source>
          . vol.
          <volume>15</volume>
          , p.
          <volume>721728</volume>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Deza</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maehara</surname>
          </string-name>
          , H.:
          <article-title>Metric transforms and euclidean embeddings</article-title>
          .
          <source>T Am. Math. Soc</source>
          .
          <volume>317</volume>
          (
          <issue>2</issue>
          ),
          <volume>661</volume>
          {
          <fpage>671</fpage>
          (
          <year>1990</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Dokmanic</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parhizkar</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ranieri</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vetterli</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Euclidean distance matrices: Essential theory, algorithms, and applications</article-title>
          .
          <source>IEEE Signal Proc. Mag</source>
          .
          <volume>32</volume>
          (
          <issue>6</issue>
          ),
          <volume>12</volume>
          { 30 (Nov
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Kruskal</surname>
            ,
            <given-names>J.B.</given-names>
          </string-name>
          :
          <article-title>Multidimensional scaling by optimizing goodness of t to a nonmetric hypothesis</article-title>
          .
          <source>Psychometrika</source>
          <volume>29</volume>
          (
          <issue>1</issue>
          ),
          <volume>1</volume>
          {
          <fpage>27</fpage>
          (Mar
          <year>1964</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Manber</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>Finding similar les in a large le system</article-title>
          .
          <source>In: USENIX Winter 1994 Technical Conference</source>
          . pp.
          <volume>1</volume>
          {
          <issue>10</issue>
          (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Menger</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>New Foundation of Euclidean Geometry</article-title>
          . Amer. J. Math.
          <volume>53</volume>
          (
          <issue>4</issue>
          ),
          <volume>721</volume>
          {
          <fpage>745</fpage>
          (
          <year>1931</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In: Proceedings of NIPS</source>
          <year>2013</year>
          , pp.
          <volume>3111</volume>
          {
          <fpage>3119</fpage>
          . Curran Associates, Inc. (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Pennington</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Glove:
          <article-title>Global vectors for word representation</article-title>
          .
          <source>In: Proceedings of EMNLP 2014</source>
          . pp.
          <volume>1532</volume>
          {
          <issue>1543</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Spillmann</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neuhaus</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bunke</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pekalska</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duin</surname>
          </string-name>
          , R.:
          <article-title>Transforming strings to vector spaces using prototype selection</article-title>
          .
          <source>In: Structural, Syntactic, and Statistical Pattern Recognition</source>
          . pp.
          <volume>287</volume>
          {
          <fpage>296</fpage>
          . Springer (09
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18. Wilson, W.A.:
          <article-title>A Relation Between Metric and Euclidean Spaces</article-title>
          . Amer. J. Math.
          <volume>54</volume>
          (
          <issue>3</issue>
          ),
          <volume>505</volume>
          {
          <fpage>517</fpage>
          (
          <year>1932</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Zezula</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amato</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dohnal</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Batko</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Similarity search: the metric space approach</article-title>
          , vol.
          <volume>32</volume>
          . Springer Science &amp; Business
          <string-name>
            <surname>Media</surname>
          </string-name>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>