<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Speculative Execution of Similarity Queries: Real-Time Parameter Optimization through Visual Exploration</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>T. Spinner</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>U. Schlegel</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>M. Schall</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>F. Sperrle</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>R. Sevastjanova</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>B. Gobbo</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>J. Rauscher</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>M. El-Assady</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>D. Keim</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff6">6</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Speculative Result Change</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Algorithmic Projection</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Central Projection View</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Parameter Design</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Politecnico di Milano</institution>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Tabular Results View</institution>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>University of Applied Sciences Konstanz</institution>
        </aff>
        <aff id="aff6">
          <label>6</label>
          <institution>University of Konstanz</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p />
      </abstract>
      <kwd-group>
        <kwd>Weight and Attributes Design</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The parameters of complex analytical models often have an
unpredictable influence on the models’ results, rendering parameter
tuning a non-intuitive task. By concurrently visualizing both the
model and its results, visual analytics tackles this issue,
supporting the user in understanding the connection between abstract
model parameters and model results. We present a visual
analytics system enabling result understanding and model refinement
on a ranking-based similarity search algorithm. Our system (1)
visualizes the results in a projection view, mapping their pair-wise
similarity to screen distance, (2) indicates the influence of model
parameters on the results, and (3) implements speculative
execution to enable real-time iterative renfiement on the time-intensive
ofline similarity search algorithm.</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>
        Similarity search in large database systems is a crucial feature
in many applications and often requires a manual adjustment of
parameters to suit various search scenarios [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Such parameters
are hard to optimize by randomly probing the search space, but
they significantly influence the retrieved results’ quality [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In
many cases, even experts with prior domain knowledge struggle
to understand the inner workings of the used mining models and
the influence of abstract model parameters, which prevents them
from reaching the desired analysis goal. Systematic steering and
exploration of diferent parameter settings can help to obtain the
proper combination more efectively. Thus, domain experts need
concurrent access to models, parameters, and results, enabling
them to understand how parameters influence the results and
how they can be refined to match the analysis goal.
      </p>
      <p>
        Visual analytics enables users to explore and analyze data and
models by providing integrated visual representations for data,
models, and parameters. Such visual techniques enable
interactive parameter adjustment during exploration and analysis [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Visual analytics bridges the gap between heuristics to find
suitable parameters and domain experts with the knowledge to steer
results in a human-centered direction. For instance, a visual
interactive what-if analysis facilitates experts to understand black-box
model decisions by enabling direct data and parameter
manipulation [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. The comprehensive understanding of the relationship
between model parameter choices and outcome is a fundamental
requirement for well-informed decision making [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. By applying
standard visual analytics techniques, such as aggregation,
filtering, or speculative execution [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], the vast results- and parameter
spaces can be interactively explored, despite the algorithms being
time- and resource-consuming. Thus, visual analytics supports
the comprehension of parameter choices in similarity search
applications for users and domain experts. Visual analytics enables
informed reasoning about a query’s results, allows the
understanding and diagnosis of parameters, and supports the user in
refining those parameters to get the best possible results.
      </p>
      <p>We propose a visual analytics workspace to support users in
result understanding and model refinement on a ranking-based
similarity search algorithm in the context of large data
foundations. Our system consists of a user-centered visualization of
parameters and results to facilitate the users’ exploration and
understanding of the parameter choices. We enable users to
interactively update model parameters based on their domain knowledge
and findings during the analysis process. Our visual analytics
system further facilitates real-time analysis using speculative
execution on a time-intensive similarity search algorithm, enabling
online exploration and execution of the ofline algorithm.</p>
      <p>Summarizing, we present a visual analytics system for
similarity search, providing the following main contributions: (1)
our system supports the understanding of results and parameters,
emphasizing the most critical data characteristics by mapping
similarity to spatial distance and highlighting communities of
similar attribute combinations. (2) Our system enables the
diagnosis of results and parameters by allowing the real-time interactive
exploration of the parameter space to investigate the influence
of parameter choices, enabled by the speculative execution. (3)
Our system supports the refinement of the involved parameters,
supporting the iterative guided optimization of the model to solve
a given analysis task.
2</p>
    </sec>
    <sec id="sec-3">
      <title>RELATED WORK</title>
      <p>
        To cover the various involved research domains and applications,
we structure our related work into sub-topics, summarizing the
most relevant works regarding one aspect of our approach.
Visual Analytics Foundations — Similarity searches in large
database systems are often automatically executed using
predefined similarity functions and distance measures. However,
user-adaptable similarity search applications increase in
importance, and user integration rises [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Visual analytics combines
automated analysis techniques with interactive visualizations to
enable users to understand and reason about large datasets [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Sacha et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] have presented a knowledge generation model
that describes how knowledge is generated during the
analysis process, building upon prior methodologies in visual
analytics [
        <xref ref-type="bibr" rid="ref12 ref2">2, 12</xref>
        ]. Besides the computer system that visualizes and
models data, they describe the human as a core element whose
creativity, interaction abilities, and perception help find and
comprehend patterns hidden in the data.
      </p>
      <p>
        Weight Space Exploration — As visual analytics is concerned
with integrating human knowledge with automated machine
learning, it is frequently used for model exploration and
optimization. Sedlmair et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] provide a conceptual framework of
visual parameter space analysis, structuring the design space.
Pajer et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] present a tool for the visual analysis and exploration
of weight spaces, tackling the problem of setting abstract weight
parameters. Their tool supports the understanding of sensitivity
and helps identify weight regions of interest for a desired output.
Mühlbacher et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] present TreePOD, a sensitivity-aware
approach to selecting Pareto-optimal decision trees. In contrast to
most existing work, we tackle the exploratory analysis of
similarity queries and rely on the analyst’s intuition rather than on
quality metrics.
      </p>
      <p>
        Parameter Optimization for Mining Models — Parameter
optimization for data mining systems or hyper-parameter
optimization in machine learning is an open problem that frequently
occurs in scientific or industrial use-cases. Analytic
optimization or exhaustive search for parameter optimization is often
impossible in these models due to black-box methods or
highdimensional parameter spaces. Torsney et al. [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] apply a guided
semi-automatic method to this problem by first sampling from
the parameter space and then guiding the user by estimating
the efects of parameter changes on the result. Schall et al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]
propose a heat-map method to superimpose the prediction of a
deep neural network over its input image. This allows the model
engineer to identify problems in the prediction and tune the
hyper-parameters accordingly. The resulting workflow is
iterative and guided by the provided visualization. This method is
applied to ofline handwriting recognition, where spatial
information is essential but not available in ground-truth data.
      </p>
      <sec id="sec-3-1">
        <title>Speculative Execution and Guidance — Sperrle et al. [18]</title>
        <p>
          present an adaptation of speculative execution for visual
analytics to support exploratory model analysis and -optimization in
visual analytics. Inspired by speculative execution in CPUs, they
define it as “the proactive, near-real-time computation of
competing model alternatives” to support model state-space exploration.
Our system uses speculative execution to execute queries
automatically using adapted weights, serving two purposes: first,
speculatively preparing those results while the system would
otherwise idle enables a near-realtime analysis of related parameter
configurations. Second, our system compares all obtained results
and guides the user in their exploration by visually highlighting
alternative feature weights that produce significantly diferent
results. In recent years, such guidance has been identified as one
of the main challenges in visual analytics [
          <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
          ] characterized by
user and machine teaching each other while mutually learning
from each other [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. Such guidance enables a more eficient
human-machine collaboration and paves the way towards true
mixed-initiative [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] systems.
        </p>
        <p>
          Application Background — Related to our application, we
focus on work for similarity search on heterogeneous data
collections. Gionis et al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] tackle the curse of dimensionality for search
in high-dimensional attribute spaces by hashing data entities
and performing an approximate nearest-neighbor search on the
hashes. Sun et al. [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] present a metapath-based search algorithm,
deriving similarity from linkage paths in the network, addressing
the advent of heterogeneous information networks. Patroumpas
and Skoutas [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] frame the problem as search on enriched,
geographical data, i.e., geospatial attributes with additional textual,
numerical, or temporal information. Our approach builds upon
their work, tackling the open challenge of user-centered model
optimization.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>THE SIMILARITY SEARCH SYSTEM</title>
      <p>
        While search is an essential tool to locate entities of interest in
large data foundations, it has significant limitations when the
data distribution is unknown and, hence, explorative access to the
data is required. Specifically, the exact attribute combination of
the results might not be known beforehand, or multiple entities
in a particular region might be of interest. The used similarity
search (SimSearch) algorithm [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] fulfills these requirements by
considering entities that feature attribute combinations close to
the desired search parameters. By specifying the number  of
ranked closest matches, the analyst can explore the region of
interest and refine the search parameters according to the
analysis goal. The high-dimensional search space poses particular
challenges for the visual representation of the results: pair-wise
distances between entities and the root search have to be
considered, as well as the influence of each single search attribute.
      </p>
      <p>The variety of data types and -domains that might occur in the
data attributes requires the concurrent use of diferent distance
SimSearch VA Backend b
SimSearch Workspace a
e
h
c
a
C</p>
      <sec id="sec-4-1">
        <title>Transform</title>
      </sec>
      <sec id="sec-4-2">
        <title>Project</title>
      </sec>
      <sec id="sec-4-3">
        <title>Redis</title>
        <p>e
h
c
a
C</p>
        <p>POST</p>
        <p>DB</p>
        <p>SQL
Relational</p>
        <p>Data
functions, rendering an objective comparison between the
obtained distances impossible. For example, a geospatial attribute
might have a real-world geographical distance function
associated, while a numeric attribute could exemplary have a
logarithmic distance function defined. Figure 4a illustrates the
noncomparability of those two measures in a two-axis plot. The
similarity search algorithm allows specifying weight parameters
in the interval [0; 1] to balance the distance functions between
diferent attributes, tackling this problem. Figure 4b illustrates,
how applying weights can scale the search space accordingly.</p>
        <p>However, no objective can be optimized to automatically
determine the ideal set of weights for a query since it heavily depends
on the data domain and the analysis task, rendering human
feedback essential for parameter optimization.</p>
        <p>We, therefore, identify three fundamental challenges: (1) the
high-dimensional and interconnected results must be presented
such that the analyst understands their meaning, mapping
similarity to the spatial distances in the visualization, (2) the analyst
must understand the influence of the parameters on the results,
and (3) the interactive exploration of the parameter space must
be possible to refine the parameters targeting the analysis goal.</p>
        <p>Our proposed visual analytics system makes the similarity
search model accessible in a comprehensive workspace,
combining diferent views and panels to address the identified challenges.
3.1</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>The Similarity Search Backend</title>
      <p>
        To avoid computationally-, time-, and storage-expensive
operations in the frontend, our implementation splits the SimSearch
system into frontend (2a) and backend (2b). The backend
interfaces with the similarity search model, being exposed via REST
API. The result of a request to the SimSearch API consists of
(1) a ranked list of the top- similar results together with (2) a
 ×  cross-similarity matrix, denoting the pair-wise similarities
between every two entities. The raw results are cached by the
backend application for later search queries with similar
parameters. The results are then transformed from the -dimensional
attribute space down to the two-dimensional screen space and
converted into a graph representation using a specified projection
algorithm. We include diferent projection methods to achieve
good results for varying search attributes and input parameters:
for low-dimensional searches ( ≤ 4), the system supports PCA
and MDS, based directly on the attribute values or the
crosssimilarity matrix, respectively. For higher-dimensional searches,
UMAP [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] can provide fast and stable projections highlighting
connections in the data while preserving its global topology. The
decisive criterion for choosing the provided projection methods
was their ability to derive a stable transformation under a
changing set of input vectors. The cross-similarity matrix is filtered
for its top  values and converted into a list representation to
reduce network load and computational complexity in the
frontend. Both results, the projected graph, and the cross-similarity
list are then cached and returned to the frontend application.
Figure 2 shows the architectural details of the system, including
data paths, caching, and the applied data transformations.
Caching Strategy and Requirements — The cache has a
crucial impact on the system’s responsiveness, requiring the caching
strategy to obtain the best possible balance between data
topicality and system performance. Since this choice is strongly
dependent on the frequency with which data evolution events occur
in the data foundation, we tackle this challenge by occasionally
querying the similarity search engine despite the results already
being present in the cache. Since this strategy triggers a request of
multiple similar parameter combinations, the results in the local
search space are updated, maximizing the probability of future
cache hits with the most recent data entities. The cache’s required
storage space is neglectable since, in a typical scenario,  = 50
can be taken as a reasonable upper-bound for the top- results
of interest. The storage consumption for a query grows linearly
with , except for the  ×  cross similarity matrix, which grows
quadratically. Taking the upper bound of  = 50, we can estimate
its storage consumption as 50 · 50 · 64 bit = 160 000 bit ≈ 20 kB.
3.2
      </p>
    </sec>
    <sec id="sec-6">
      <title>The Similarity Search Workspace</title>
      <p>To allow the interactive analysis of the SimSearch algorithms’
results and enable informed decision making during the parameter
tuning process, our proposed similarity search workspace
combines multiple components in a comprehensive user interface,
shown in Figure 1.</p>
      <sec id="sec-6-1">
        <title>Central Projection View (1a) — After defining a search query</title>
        <p>and receiving the similarity search engine results, the analyst
must understand (1) the connection between results and root
query and (2) the pair-wise relationship between the results. The
SimSearch workspace is built around a central projection view,
mapping the n-dimensional data points to the two-dimensional
screen space while preserving the distances between entities
as well as possible. The search attributes are projected as an
additional, virtual entity to set the result entities into relation
with the specified search parameters.</p>
        <p>Besides the spatial position of entities in the result space, the
pair-wise relation between entities is essential to interpret
connections and reveal proximities in the data that the projection
could not preserve. Therefore, we indicate these relations by
extracting the top k values from the cross-similarity matrix and
displaying them as links between the respective entities. The
edges’ line width is proportional to the similarity between two
entities, visually highlighting the most important connections.</p>
        <p>Important information for each entity is attached directly to
the projected node: the similarity rank is annotated persistently
on each node, while the exact attribute combinations and
similarity scores for each attribute can be displayed by hovering an
entity either in the projection view or in the tabular results view.
By coloring the results according to their spatial position in the
projection using a two-dimensional colormap, the entities are
visually clustered and linked to the tabular results view.</p>
        <p>Besides displaying the inter-linkage, we also apply k-means
clustering to the projected points, reducing visual clutter by
forming local groups and highlighting results with spatial proximity
in the projection space instead of the attribute space. While the
cross-similarity would ideally correspond with the k-means
clusters in the -dimensional space, this is not valid for the projected
entities since not all information can be preserved in the
projection. Therefore, the clustered entities can share similar attributes,
which, at the same time, might diverge from the most similar
entities denoted by the cross-similarity matrix. I.e., entities might
be close in only a subset of their attributes, causing them to be
assigned to the same cluster, while the total similarity across all
attributes might be vanishing, preventing their cross-similarity
link from being strong enough to be displayed.</p>
        <p>Tabular Result View (1b) — Complementing the projection
view, we include the tabular results view in the SimSearch
workspace, showing the ranked entities together with their attribute
set and the corresponding similarity scores. The table’s rows
are linked to the nodes in the projection view, simultaneously
highlighting a specific node in both views on mouse hover. By
clicking the table header for one attribute column, the column
can be re-ordered according to its contained values, enabling the
direct comparison between the individual similarity scores for
each attribute.</p>
        <p>Parameter Designer (1c) — The parameter designer is the
primary interface for specifying and refining search queries,
projection settings, and weight parameters. Search attributes can
be added from a list of all available attributes in the dataset,
allowing to set a target value for each selected parameter. A slider
attached to each attribute enables the analyst to set the attribute’s
relative importance concerning all other defined attributes,
giving full control over the balance between attributes and their
corresponding distance function.</p>
        <p>To diagnose the weight parameters’ influence on the result
set, hovering a weight slider triggers the projection and tabular
result view to switch to the speculative execution state. In the
speculative execution state, the views indicate the change in the
result set under a speculative de- and increase of the respective
attribute weight. In the projection view, this is done by inserting
the possible new positions of the entities under the changing
projection, marking the results under a positive weight
adjustment with a red outline and the results under a negative weight
adjustment with a green outline. Complementing the projection,
the tabular results view is extended by two additional columns,
indicating the change in each result entity’s rank and marking
Search C</p>
        <p>Search A</p>
        <p>Search B
entities that are descending from the top  results, causing them
to lose their place in the table.</p>
        <p>Time-consuming search operations are executed speculatively
before an actual user interaction is performed, enabling the
iterative refinement of search parameters. When the user performs an
action, and the resulting parameter combination causes a cache
hit, the results can be delivered and visualized in real-time.
Besides the increase in responsiveness, more and more discrete
samples of the local search space are present in the cache with
the ongoing analysis process. By setting the frequency of an
entity in the latest result sets into relation with the total number
of results, we derive a measure for an entity’s stability over the
changing search parameters, as shown in Figure 3. The stability
is then mapped to the node size in the projection view, with
larger nodes indicating entities that appear more frequently in
the recent result sets.
4</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>USE-CASES</title>
      <p>We show the applicability and advantages of the proposed
SimSearch workspace based on two exemplary use-cases. The first
use-case (subsection 4.1) is hands-on and describes in detail how
our proposed system can be used to reach the analysis goal,
while the second use-case (subsection 4.2) demonstrates how our
system can be applied to varying tasks and domains.
4.1</p>
    </sec>
    <sec id="sec-8">
      <title>Assessing the Local Business Landscape</title>
      <p>This use-case is based on a real-world, large-scale (≈ 120 GB)
dataset containing information about companies in Italy.</p>
      <p>In the use-case, a small company with ≈ 50 employees plans to
expand, for which several potential new locations are considered.
Since the company is dependant on the local infrastructure and
other supplying companies, geographical proximity to those
companies is an essential requirement. Simultaneously, the company
wants to avoid direct local competition through other companies
working in the same sector and having a similar corporate
structure. Our proposed SimSearch workspace supports the search
and interactive exploration of the potential company locations
to fulfill the company’s requirements.</p>
      <p>By specifying the attribute combinations in the parameter
designer according to the desired or declined company profiles
together with the considered company location, the local search
space can be explored. The projection view reveals the most
similar companies and indicates their pair-wise relationships,
revealing communities and enabling the analyst to assess the most
influential search attributes. In doing so, it becomes clear that the
geolocation only has marginal influence on the search results,
and the shown companies are too far for a business relationship.</p>
      <p>Since the numerical search attributes, such as the number of
employees, can not be objectively compared to the geospatial
n
o
it
a
c
o
l
o
e
g
∆
s(c0,c1)</p>
      <p>s(c1,c2)
s(c0,c2)
∆ num_employees
(a)
n
o
it
a
c
o
l
o
e
g
∆
·
w0
s(c0,c1)</p>
      <p>s(c1,c2)
s(c0,c2)
w1 · ∆ num_employees
(b)
company location, the weights in the parameter designer have
to be iteratively refined to match the analyst’s understanding of
each attribute’s desired influence on the results. Figure 4 shows
how the weight adjustment helps to balance the diferent
distance functions. By indicating the changes in the result set for a
possible weight adjustment, the analyst can exploit the systems
speculative execution feature to observe changes in real-time
and assess the most purposeful operation before the actual,
timeconsuming execution. Using the tabular results view, the analyst
can verify the possible changes in detail by observing how each
attribute’s ranking would change under the operation or if the
company would be excluded from the result set. By iteratively
refining the search parameters, the analyst can explore the search
space ideally for each potential location, leading to well-informed
decision making for the new location.
4.2</p>
    </sec>
    <sec id="sec-9">
      <title>Mail Forwarding</title>
      <p>This use-case is based on internal mail forwarding within a large
company. Incoming postal mail is automatically opened and
digitized, using an OCR system, on arrival at the company
headquarters. The digitized mail item is then used as a search query
on a structured database of the company’s customers, contracts,
products, or projects to electronically forward the scanned
document to the staf responsible for working this task needing the
document. This use-case requires both a robust search engine for
retrieving database entries (e.g., contracts) containing keywords,
names, or numerical values similar to the query document and
semantic understanding of the content to weigh these attributes.</p>
      <p>The structured data in the database consists of categorical
attributes, person or item names, as well as spatial, temporal,
numerical, or general ontological values. These may occur within
the scanned document with diferent individual similarities as
well as in many diferent combinations. Thus the need arises
to weigh these database attributes against each other to model
the overall semantic similarity. This configuration of the search
query likely is done by a human engineer with expert domain
knowledge. One approach here is to use a set of example
documents for evaluation and repeatedly querying for them and
modifying the attribute weights until relevant database entries
are found with high overall similarity to the query document,
with less relevant entries being significantly dissimilar.</p>
      <p>We propose to use SimSearch in this process to both see the
overall similarity of the diferent database entries using the
current configuration and identify clusters in the embedding space.
The embedding method will be chosen to reflect the expert’s
domain knowledge of semantically diferent and similar documents.
Cross-similarities will show potential miss-classifications. This
allows adjusting the weights of the similarity search to increase
the similarity to semantically relevant documents and separate
them from semantically distinct ones.
5</p>
    </sec>
    <sec id="sec-10">
      <title>DISCUSSION AND FUTURE WORK</title>
      <p>
        While the presented similarity search workspace implements a
variety of features and techniques to make the data search space
and the model parameter space accessible by the analyst,
possible extensions could further strengthen the system’s usefulness.
Such extensions could include improvements to the search
functionality and the explanation of results or the implementation
of advanced guiding techniques. Furthermore, the presented
approach could be generalized to other domains and tasks with
a similar problem setting, i.e., where high-dimensional result
entities of complex mining models have to be visualized, and the
model must be refined to match a particular analysis task.
Extending the Search Functionality — Additional views could
augment the existing visualizations with an abstract overview of
possible actions and the resulting changes, enabling the analyst
to identify possible changes at first glance before descending into
detailed views. For example, an additional view visualizing all
possible weight combinations probed by the speculative
execution component and their likely outcomes could provide first
hints where the region of interest might be located. Additional
interestigness measures could augment the parameter designer’s
weight sliders with information on the intervals corresponding
with the most significant changes in the result set. Extending the
interestingness feature, decision boundaries could be estimated
by probing the search space in regions with a high gradient,
providing a sensitivity analysis for each parameter.
Extending Guidance — The system currently provides
orienting guidance to users alerting them to similar weight
configurations that produce significantly diferent search results. In
addition to highlighting diferent possible weight settings, the
system could actively propose user actions like moving weight
sliders or switching to diferent projection methods. By analyzing
and learning from user interactions, the system could identify
the users’ preferences and provide suggestions adapted to their
understanding of the domain and analysis task. By giving the
system more initiative in the exploration process, the system
should become both more efective and eficient to use.
Generalization as Visual Analytics Technique — There are
several other problems in automated data mining pipelines with
the same or a similar structure as the similarity search
application addressed in the presented system, such as clustering,
classification, or graph merging. Specifically, our approach can
be generalized to understand, diagnose, and refine models where
(1) the result is a number of -dimensional entities with arbitrary
distance functions associated, and (2) the outcome depends on a
set of parameters whose influence on individual results is opaque.
Scalability — The system’s scalability is directly dependent on
the underlying similarity search algorithm. Despite implementing
various techniques (caching, speculative execution) to enable
interactive visual analytics on the ofline search algorithm, the
similarity search model’s response time is the limiting factor for
the approach. While response times of 1 − 30 s can be bridged
by applying the implemented techniques, longer response times
render an online analysis increasingly dificult since (1)
nonideal sampling points might have been chosen for speculative
execution or (2) the analyst might change the search space context
more rapidly than results can be preemptively queried and cached.
The response times of the similarity search algorithm could be
reduced by parallelizing the main stages of the algorithm, namely
(1) generating a ranked list of results for each queried attribute
and (2) compiling the ranked lists into a list of top- results [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
Limitations and Future Work — Currently, views of higher
abstraction giving the analyst reference points on promising
analysis directions are missing. We will tackle this issue by adding a
third view to the similarity search workspace, showing all
possible weight combinations in a matrix view and indicating the
regions of the highest expected result change. Currently, the
analyst has to evaluate the speculative changes in results manually
by observing the predicted outcomes and comparing them across
the diferent parameter combinations. In future versions, we will
automatically highlight regions of interest using the number
of changes in the result set for each combination as an
interestingness measure. This functionality will be strengthened by
implementing interactive, adaptive guidance. If one operation
has significantly higher interestingness than others, it will be
actively proposed as a possibly rewarding action. Furthermore,
by tracking recent interactions of the user with the system, we
will estimate the likelihood of future interactions based on the
history, adapting the guidance to user preferences. Despite the
presented use-cases proving our approach’s applicability in
different real-world scenarios and data domains, a future user study
will further validate the system’s usefulness and provide insights
on both benefits and open challenges. Besides measuring
quantitative criteria, such as task completion time and comparing the
analysis results to ground-truth data, an additional qualitative
evaluation will expose additional user requirements and future
points for improvements of the system.
6
      </p>
    </sec>
    <sec id="sec-11">
      <title>CONCLUSION</title>
      <p>Applying complex data mining models to large data foundations
introduces particular challenges to the analysis process. Both the
parameter space and the search space might be opaque, requiring
manual probing to approach the regions of interest and, hence,
rendering an interactive exploration impossible. Applying visual
analytics, models, parameters, and results can be made accessible
through interconnected visualizations, revealing hidden
connections between components and providing advanced mechanisms,
such as speculative execution, to enable the real-time exploration
of otherwise time-consuming data processing pipelines.</p>
      <p>The presented system implements views and techniques to
make the parameters and results of a novel similarity search
algorithm accessible to the analyst. Specifically, we provide a
projected view of the search results, highlighting the similarity
to the root query, the pair-wise similarity between the result
entities, the stability of the results, as well as communities of close
entities. The projected view is complemented with and linked to
a tabular view of the results, indicating their rank and providing
sorting functions on distinct attributes or their corresponding
similarity. Supporting parameter refinement and search space
exploration, the system implements speculative execution on the
time-consuming similarity search operation, presenting the user
with possible outcomes of parameter changes on-demand before
actually performing an action. The projection and tabular views
are coupled with the parameter refinement functionality,
integrating the speculative results into their visual representation.</p>
      <p>We show our proposed similarity search workspace’s
applicability and usefulness based on two use-cases, both anchored in
real-world application examples and datasets.</p>
    </sec>
    <sec id="sec-12">
      <title>ACKNOWLEDGEMENTS</title>
      <p>This work has received funding from the European Union’s
Horizon 2020 research and innovation programme under grant
agree</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>James</surname>
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Allen</surname>
            ,
            <given-names>Curry I. Guinn</given-names>
          </string-name>
          , and
          <string-name>
            <given-names>Eric</given-names>
            <surname>Horvitz</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>Mixed-initiative interaction</article-title>
          .
          <source>IEEE Intelligent Systems and their Applications</source>
          <volume>14</volume>
          ,
          <issue>5</issue>
          ,
          <fpage>14</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Matthew</given-names>
            <surname>Brehmer</surname>
          </string-name>
          and
          <string-name>
            <given-names>Tamara</given-names>
            <surname>Munzner</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>A multi-level typology of abstract visualization tasks</article-title>
          .
          <source>IEEE Trans. on Vis. and Comput. Graphics</source>
          <volume>19</volume>
          ,
          <issue>12</issue>
          ,
          <fpage>2376</fpage>
          -
          <lpage>2385</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Davide</given-names>
            <surname>Ceneda</surname>
          </string-name>
          , Theresia Gschwandtner, and
          <string-name>
            <given-names>Silvia</given-names>
            <surname>Miksch</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>A Review of Guidance Approaches in Visual Data Analysis: A Multifocal Perspective</article-title>
          .
          <source>Comput. Graphics Forum 38</source>
          ,
          <issue>3</issue>
          ,
          <fpage>861</fpage>
          -
          <lpage>879</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Collins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Andrienko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Schreck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Choo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Engelke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jena</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Dwyer</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Guidance in the human-machine analytics process</article-title>
          .
          <source>Visual Informatics</source>
          <volume>2</volume>
          ,
          <fpage>166</fpage>
          -
          <lpage>180</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Aristides</given-names>
            <surname>Gionis</surname>
          </string-name>
          , Piotr Indyk, and
          <string-name>
            <given-names>Rajeev</given-names>
            <surname>Motwani</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>Similarity Search in High Dimensions via Hashing</article-title>
          .
          <source>In Proc. of the 25th Intl. Conference on Very Large Data Bases (VLDB '99)</source>
          . San Francisco, CA, USA,
          <fpage>518</fpage>
          -
          <lpage>529</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Keim</surname>
          </string-name>
          , Jörn Kohlhammer, Geofrey Ellis, and
          <string-name>
            <given-names>Florian</given-names>
            <surname>Mansmann</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Mastering The Information Age - Solving Problems with Visual Analytics</article-title>
          . Eurographics Association.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Sean D MacArthur</surname>
            ,
            <given-names>Carla E</given-names>
          </string-name>
          <string-name>
            <surname>Brodley</surname>
          </string-name>
          ,
          <article-title>Avinash C Kak,</article-title>
          and Lynn S Broderick.
          <year>2002</year>
          .
          <article-title>Interactive content-based image retrieval using relevance feedback</article-title>
          .
          <source>Comput. Vision and Image Understanding</source>
          <volume>88</volume>
          ,
          <issue>2</issue>
          ,
          <fpage>55</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Leland</surname>
            <given-names>McInnes</given-names>
          </string-name>
          , John Healy, and
          <string-name>
            <given-names>James</given-names>
            <surname>Melville</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:stat</article-title>
          .ML/
          <year>1802</year>
          .03426
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mühlbacher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Linhardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Möller</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Piringer</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>TreePOD: Sensitivity-Aware Selection of Pareto-Optimal Decision Trees</article-title>
          .
          <source>IEEE Trans. on Vis. and Comput. Graphics</source>
          <volume>24</volume>
          ,
          <issue>1</issue>
          ,
          <fpage>174</fpage>
          -
          <lpage>183</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pajer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Streit</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
            Torsney-Weir,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Spechtenhauser</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Möller</surname>
            , and
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Piringer</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>WeightLifter: Visual Weight Space Exploration for Multi-Criteria Decision Making</article-title>
          .
          <source>IEEE Trans. on Vis. and Comput. Graphics</source>
          <volume>23</volume>
          ,
          <issue>1</issue>
          ,
          <fpage>611</fpage>
          -
          <lpage>620</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Kostas</given-names>
            <surname>Patroumpas</surname>
          </string-name>
          and
          <string-name>
            <given-names>Dimitrios</given-names>
            <surname>Skoutas</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Similarity search over enriched geospatial data</article-title>
          .
          <source>In Proc. of the Sixth Intl. ACM SIGMOD Workshop on Managing and Mining Enriched Geo-Spatial Data. ACM.</source>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Peter</given-names>
            <surname>Pirolli</surname>
          </string-name>
          and
          <string-name>
            <given-names>Stuart</given-names>
            <surname>Card</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>The sensemaking process and leverage points for analyst technology as identified through cognitive task analysis</article-title>
          .
          <source>In Proc. of Intl. Conference on Intelligence Analysis</source>
          , Vol.
          <volume>5</volume>
          .
          <string-name>
            <surname>McLean</surname>
            ,
            <given-names>VA</given-names>
          </string-name>
          , USA,
          <fpage>2</fpage>
          -
          <lpage>4</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Dominik</surname>
            <given-names>Sacha</given-names>
          </string-name>
          , Michael Sedlmair, Leishi Zhang, John A Lee, Jaakko Peltonen, Daniel Weiskopf, Stephen C North, and Daniel A Keim.
          <year>2017</year>
          .
          <article-title>What you see is what you can change: Human-centered machine learning by interactive visualization</article-title>
          .
          <source>Neurocomputing</source>
          <volume>268</volume>
          ,
          <fpage>164</fpage>
          -
          <lpage>175</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Dominik</surname>
            <given-names>Sacha</given-names>
          </string-name>
          , Andreas Stofel, Florian Stofel, Bum Chul Kwon, Geofrey Ellis, and
          <string-name>
            <surname>Daniel</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Keim</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Knowledge Generation Model for Visual Analytics</article-title>
          .
          <source>IEEE Trans. on Vis. and Comput. Graphics</source>
          <volume>20</volume>
          ,
          <issue>12</issue>
          ,
          <fpage>1604</fpage>
          -
          <lpage>1613</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Martin</surname>
            <given-names>Schall</given-names>
          </string-name>
          , Dominik Sacha, Manuel Stein, Matthias O Franz, and Daniel A Keim.
          <year>2018</year>
          .
          <article-title>Visualization-assisted development of deep learning models in ofline handwriting recognition</article-title>
          .
          <source>In Symp. on Vis. in Data Science at IEEE VIS.</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sedlmair</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Heinzl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bruckner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Piringer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Möller</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Visual Parameter Space Analysis: A Conceptual Framework</article-title>
          .
          <source>IEEE Trans. on Vis. and Comput. Graphics</source>
          <volume>20</volume>
          ,
          <issue>12</issue>
          ,
          <fpage>2161</fpage>
          -
          <lpage>2170</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Seidl</surname>
          </string-name>
          and
          <string-name>
            <surname>Hans-Peter Kriegel</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>Eficient user-adaptable similarity search in large multimedia databases</article-title>
          .
          <source>In VLDB</source>
          , Vol.
          <volume>97</volume>
          .
          <fpage>506</fpage>
          -
          <lpage>515</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Fabian</surname>
            <given-names>Sperrle</given-names>
          </string-name>
          , Jürgen Bernard, Michael Sedlmair, Daniel Keim, and
          <string-name>
            <surname>Mennatallah</surname>
          </string-name>
          El-Assady.
          <year>2019</year>
          .
          <article-title>Speculative Execution for Guided Visual Analytics. arXiv:cs</article-title>
          .HC/
          <year>1908</year>
          .02627v1
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Fabian</surname>
            <given-names>Sperrle</given-names>
          </string-name>
          , Astrik Jeitler, Jürgen Bernard, Daniel A. Keim, and
          <string-name>
            <surname>Mennatallah</surname>
          </string-name>
          El-Assady.
          <year>2020</year>
          .
          <article-title>Learning and Teaching in Co-Adaptive Guidance for MixedInitiative Visual Analytics</article-title>
          . In EuroVis Workshop on Visual Analytics (EuroVA),
          <string-name>
            <given-names>K.</given-names>
            <surname>Vrotsou</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Turkay</surname>
          </string-name>
          (Eds.).
          <source>The Eurographics Association.</source>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Thilo</surname>
            <given-names>Spinner</given-names>
          </string-name>
          , Udo Schlegel, Hanna Schäfer, and
          <string-name>
            <surname>Mennatallah</surname>
          </string-name>
          El-Assady.
          <year>2019</year>
          .
          <article-title>explAIner: A visual analytics framework for interactive and explainable machine learning</article-title>
          .
          <source>IEEE Trans. on Vis. and Comput. Graphics</source>
          <volume>26</volume>
          ,
          <issue>1</issue>
          ,
          <fpage>1064</fpage>
          -
          <lpage>1074</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Yizhou</surname>
            <given-names>Sun</given-names>
          </string-name>
          , Jiawei Han, Xifeng Yan,
          <string-name>
            <surname>Philip S. Yu</surname>
            , and
            <given-names>Tianyi</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks</article-title>
          .
          <source>Proc. of the VLDB Endowment 4</source>
          ,
          <issue>11</issue>
          ,
          <fpage>992</fpage>
          -
          <lpage>1003</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Thomas</surname>
            Torsney-Weir,
            <given-names>Ahmed</given-names>
          </string-name>
          <string-name>
            <surname>Saad</surname>
          </string-name>
          , Torsten Moller,
          <string-name>
            <surname>Hans-Christian</surname>
            <given-names>Hege</given-names>
          </string-name>
          , Britta Weber,
          <string-name>
            <surname>Jean-Marc Verbavatz</surname>
            , and
            <given-names>Steven</given-names>
          </string-name>
          <string-name>
            <surname>Bergner</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Tuner: Principled parameter finding for image segmentation algorithms using visual response surface exploration</article-title>
          .
          <source>IEEE Trans. on Vis. and Comput. Graphics</source>
          <volume>17</volume>
          ,
          <issue>12</issue>
          ,
          <fpage>1892</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>