Speculative Execution of Similarity Queries: Real-Time
           Parameter Optimization through Visual Exploration
        T. Spinner1 , U. Schlegel1 , M. Schall1,2 , F. Sperrle1 , R. Sevastjanova1 , B. Gobbo3 , J. Rauscher1 ,
                                            M. El-Assady1 , D. Keim1
              1 University of Konstanz                    2 University of Applied Sciences Konstanz                                    3 Politecnico di Milano


 Parameter Design a                            Central Projection View b                                                    Tabular Results View   c
                Algorithmic                                                                       * Cross Similarity View
                                  Speciﬁc K                                                                                                  Speculative Results Rank Tendency
                Projection


                                                    Root Query


               Weight
               and Attributes Design                                 Speculative Result Change


Figure 1: The SimSearch workspace, built around the central projection view (a), showing the projected results of the
similarity search algorithm for the query defined in the parameter designer (b). The tabular results view (c) shows the same
results, linked to the projected nodes by color and hover events. Both the projection and table view switch to a speculative
execution state on hovering a weight slider, indicating the changes occurring if the weight is adjusted accordingly.

ABSTRACT                                                                                         exploration of different parameter settings can help to obtain the
The parameters of complex analytical models often have an un-                                    proper combination more effectively. Thus, domain experts need
predictable influence on the models’ results, rendering parameter                                concurrent access to models, parameters, and results, enabling
tuning a non-intuitive task. By concurrently visualizing both the                                them to understand how parameters influence the results and
model and its results, visual analytics tackles this issue, support-                             how they can be refined to match the analysis goal.
ing the user in understanding the connection between abstract                                       Visual analytics enables users to explore and analyze data and
model parameters and model results. We present a visual analyt-                                  models by providing integrated visual representations for data,
ics system enabling result understanding and model refinement                                    models, and parameters. Such visual techniques enable interac-
on a ranking-based similarity search algorithm. Our system (1) vi-                               tive parameter adjustment during exploration and analysis [6].
sualizes the results in a projection view, mapping their pair-wise                               Visual analytics bridges the gap between heuristics to find suit-
similarity to screen distance, (2) indicates the influence of model                              able parameters and domain experts with the knowledge to steer
parameters on the results, and (3) implements speculative execu-                                 results in a human-centered direction. For instance, a visual inter-
tion to enable real-time iterative refinement on the time-intensive                              active what-if analysis facilitates experts to understand black-box
offline similarity search algorithm.                                                             model decisions by enabling direct data and parameter manipula-
                                                                                                 tion [13]. The comprehensive understanding of the relationship
                                                                                                 between model parameter choices and outcome is a fundamental
1    INTRODUCTION                                                                                requirement for well-informed decision making [20]. By applying
Similarity search in large database systems is a crucial feature                                 standard visual analytics techniques, such as aggregation, filter-
in many applications and often requires a manual adjustment of                                   ing, or speculative execution [18], the vast results- and parameter
parameters to suit various search scenarios [17]. Such parameters                                spaces can be interactively explored, despite the algorithms being
are hard to optimize by randomly probing the search space, but                                   time- and resource-consuming. Thus, visual analytics supports
they significantly influence the retrieved results’ quality [7]. In                              the comprehension of parameter choices in similarity search ap-
many cases, even experts with prior domain knowledge struggle                                    plications for users and domain experts. Visual analytics enables
to understand the inner workings of the used mining models and                                   informed reasoning about a query’s results, allows the under-
the influence of abstract model parameters, which prevents them                                  standing and diagnosis of parameters, and supports the user in
from reaching the desired analysis goal. Systematic steering and                                 refining those parameters to get the best possible results.
                                                                                                    We propose a visual analytics workspace to support users in
© 2021 Copyright for this paper by its author(s). Published in the Workshop Proceed-             result understanding and model refinement on a ranking-based
ings of the EDBT/ICDT 2021 Joint Conference (March 23–26, 2021, Nicosia, Cyprus)
on CEUR-WS.org. Use permitted under Creative Commons License Attribution 4.0
                                                                                                 similarity search algorithm in the context of large data founda-
International (CC BY 4.0)                                                                        tions. Our system consists of a user-centered visualization of
parameters and results to facilitate the users’ exploration and un-      the parameter space and then guiding the user by estimating
derstanding of the parameter choices. We enable users to interac-        the effects of parameter changes on the result. Schall et al. [15]
tively update model parameters based on their domain knowledge           propose a heat-map method to superimpose the prediction of a
and findings during the analysis process. Our visual analytics           deep neural network over its input image. This allows the model
system further facilitates real-time analysis using speculative ex-      engineer to identify problems in the prediction and tune the
ecution on a time-intensive similarity search algorithm, enabling        hyper-parameters accordingly. The resulting workflow is iter-
online exploration and execution of the offline algorithm.               ative and guided by the provided visualization. This method is
    Summarizing, we present a visual analytics system for sim-           applied to offline handwriting recognition, where spatial infor-
ilarity search, providing the following main contributions: (1)          mation is essential but not available in ground-truth data.
our system supports the understanding of results and parameters,         Speculative Execution and Guidance — Sperrle et al. [18]
emphasizing the most critical data characteristics by mapping            present an adaptation of speculative execution for visual ana-
similarity to spatial distance and highlighting communities of           lytics to support exploratory model analysis and -optimization in
similar attribute combinations. (2) Our system enables the diagno-       visual analytics. Inspired by speculative execution in CPUs, they
sis of results and parameters by allowing the real-time interactive      define it as “the proactive, near-real-time computation of compet-
exploration of the parameter space to investigate the influence          ing model alternatives” to support model state-space exploration.
of parameter choices, enabled by the speculative execution. (3)          Our system uses speculative execution to execute queries au-
Our system supports the refinement of the involved parameters,           tomatically using adapted weights, serving two purposes: first,
supporting the iterative guided optimization of the model to solve
                                                                         speculatively preparing those results while the system would oth-
a given analysis task.
                                                                         erwise idle enables a near-realtime analysis of related parameter
                                                                         configurations. Second, our system compares all obtained results
2    RELATED WORK                                                        and guides the user in their exploration by visually highlighting
To cover the various involved research domains and applications,         alternative feature weights that produce significantly different
we structure our related work into sub-topics, summarizing the           results. In recent years, such guidance has been identified as one
most relevant works regarding one aspect of our approach.                of the main challenges in visual analytics [3, 4] characterized by
                                                                         user and machine teaching each other while mutually learning
Visual Analytics Foundations — Similarity searches in large
                                                                         from each other [19]. Such guidance enables a more efficient
database systems are often automatically executed using pre-
                                                                         human-machine collaboration and paves the way towards true
defined similarity functions and distance measures. However,
                                                                         mixed-initiative [1] systems.
user-adaptable similarity search applications increase in impor-
tance, and user integration rises [17]. Visual analytics combines        Application Background — Related to our application, we fo-
automated analysis techniques with interactive visualizations to         cus on work for similarity search on heterogeneous data collec-
enable users to understand and reason about large datasets [6].          tions. Gionis et al. [5] tackle the curse of dimensionality for search
Sacha et al. [14] have presented a knowledge generation model            in high-dimensional attribute spaces by hashing data entities
that describes how knowledge is generated during the analy-              and performing an approximate nearest-neighbor search on the
sis process, building upon prior methodologies in visual ana-            hashes. Sun et al. [21] present a metapath-based search algorithm,
lytics [2, 12]. Besides the computer system that visualizes and          deriving similarity from linkage paths in the network, addressing
models data, they describe the human as a core element whose             the advent of heterogeneous information networks. Patroumpas
creativity, interaction abilities, and perception help find and com-     and Skoutas [11] frame the problem as search on enriched, geo-
prehend patterns hidden in the data.                                     graphical data, i.e., geospatial attributes with additional textual,
                                                                         numerical, or temporal information. Our approach builds upon
Weight Space Exploration — As visual analytics is concerned
                                                                         their work, tackling the open challenge of user-centered model
with integrating human knowledge with automated machine
                                                                         optimization.
learning, it is frequently used for model exploration and opti-
mization. Sedlmair et al. [16] provide a conceptual framework of
visual parameter space analysis, structuring the design space. Pa-       3    THE SIMILARITY SEARCH SYSTEM
jer et al. [10] present a tool for the visual analysis and exploration   While search is an essential tool to locate entities of interest in
of weight spaces, tackling the problem of setting abstract weight        large data foundations, it has significant limitations when the
parameters. Their tool supports the understanding of sensitivity         data distribution is unknown and, hence, explorative access to the
and helps identify weight regions of interest for a desired output.      data is required. Specifically, the exact attribute combination of
Mühlbacher et al. [9] present TreePOD, a sensitivity-aware ap-           the results might not be known beforehand, or multiple entities
proach to selecting Pareto-optimal decision trees. In contrast to        in a particular region might be of interest. The used similarity
most existing work, we tackle the exploratory analysis of simi-          search (SimSearch) algorithm [11] fulfills these requirements by
larity queries and rely on the analyst’s intuition rather than on        considering entities that feature attribute combinations close to
quality metrics.                                                         the desired search parameters. By specifying the number 𝑘 of
Parameter Optimization for Mining Models — Parameter                     ranked closest matches, the analyst can explore the region of
optimization for data mining systems or hyper-parameter opti-            interest and refine the search parameters according to the anal-
mization in machine learning is an open problem that frequently          ysis goal. The high-dimensional search space poses particular
occurs in scientific or industrial use-cases. Analytic optimiza-         challenges for the visual representation of the results: pair-wise
tion or exhaustive search for parameter optimization is often            distances between entities and the root search have to be consid-
impossible in these models due to black-box methods or high-             ered, as well as the influence of each single search attribute.
dimensional parameter spaces. Torsney et al. [22] apply a guided            The variety of data types and -domains that might occur in the
semi-automatic method to this problem by first sampling from             data attributes requires the concurrent use of different distance
SimSearch Model                                       SimSearch VA Backend b                                       SimSearch Workspace a

                                      POST                                                             POST

             SQL                                                   Transform
                                        1                                                                 1
   DB


                                                         Cache


                                                                                       Cache
                                     Tabular                                                          Tabular
                       Similarity    Ranked                                                           Ranked
          Relational                 Results                                                          Results
                        Search                                          Project
             Data
                        Engine          2                                                                 2
                                      Cross                                                             Cross
                                    Similarity                                                        Similarity
                                      Matrix                                                            Matrix
                                                                        Redis                                       Parallel, speculative execution


Figure 2: The SimSearch visual analytics system’s architecture, split into frontend (a) and backend (b) applications. Search
queries are issued to the SimSearch engine, which returns (1) a table of the top-𝑘 ranked results and (2) a 𝑘 × 𝑘 cross-
similarity matrix, encoding the pair-wise similarities between entities. The results are cached, filtered, projected, and
transformed by the SimSearch visual analytics backend before delivering them to the SimSearch workspace frontend.

functions, rendering an objective comparison between the ob-                    decisive criterion for choosing the provided projection methods
tained distances impossible. For example, a geospatial attribute                was their ability to derive a stable transformation under a chang-
might have a real-world geographical distance function associ-                  ing set of input vectors. The cross-similarity matrix is filtered
ated, while a numeric attribute could exemplary have a loga-                    for its top 𝑘 values and converted into a list representation to
rithmic distance function defined. Figure 4a illustrates the non-               reduce network load and computational complexity in the fron-
comparability of those two measures in a two-axis plot. The                     tend. Both results, the projected graph, and the cross-similarity
similarity search algorithm allows specifying weight parameters                 list are then cached and returned to the frontend application.
in the interval [0; 1] to balance the distance functions between                Figure 2 shows the architectural details of the system, including
different attributes, tackling this problem. Figure 4b illustrates,             data paths, caching, and the applied data transformations.
how applying weights can scale the search space accordingly.                    Caching Strategy and Requirements — The cache has a cru-
   However, no objective can be optimized to automatically deter-               cial impact on the system’s responsiveness, requiring the caching
mine the ideal set of weights for a query since it heavily depends              strategy to obtain the best possible balance between data topical-
on the data domain and the analysis task, rendering human feed-                 ity and system performance. Since this choice is strongly depen-
back essential for parameter optimization.                                      dent on the frequency with which data evolution events occur
   We, therefore, identify three fundamental challenges: (1) the                in the data foundation, we tackle this challenge by occasionally
high-dimensional and interconnected results must be presented                   querying the similarity search engine despite the results already
such that the analyst understands their meaning, mapping simi-
                                                                                being present in the cache. Since this strategy triggers a request of
larity to the spatial distances in the visualization, (2) the analyst
                                                                                multiple similar parameter combinations, the results in the local
must understand the influence of the parameters on the results,                 search space are updated, maximizing the probability of future
and (3) the interactive exploration of the parameter space must                 cache hits with the most recent data entities. The cache’s required
be possible to refine the parameters targeting the analysis goal.               storage space is neglectable since, in a typical scenario, 𝑘 = 50
   Our proposed visual analytics system makes the similarity                    can be taken as a reasonable upper-bound for the top-𝑘 results
search model accessible in a comprehensive workspace, combin-                   of interest. The storage consumption for a query grows linearly
ing different views and panels to address the identified challenges.            with 𝑘, except for the 𝑘 × 𝑘 cross similarity matrix, which grows
                                                                                quadratically. Taking the upper bound of 𝑘 = 50, we can estimate
3.1     The Similarity Search Backend                                           its storage consumption as 50 · 50 · 64 bit = 160 000 bit ≈ 20 kB.
To avoid computationally-, time-, and storage-expensive opera-
tions in the frontend, our implementation splits the SimSearch
system into frontend (2a) and backend (2b). The backend inter-                  3.2    The Similarity Search Workspace
faces with the similarity search model, being exposed via REST                  To allow the interactive analysis of the SimSearch algorithms’ re-
API. The result of a request to the SimSearch API consists of                   sults and enable informed decision making during the parameter
(1) a ranked list of the top-𝑘 similar results together with (2) a              tuning process, our proposed similarity search workspace com-
𝑘 × 𝑘 cross-similarity matrix, denoting the pair-wise similarities              bines multiple components in a comprehensive user interface,
between every two entities. The raw results are cached by the                   shown in Figure 1.
backend application for later search queries with similar parame-
ters. The results are then transformed from the 𝑛-dimensional                   Central Projection View (1a) — After defining a search query
attribute space down to the two-dimensional screen space and                    and receiving the similarity search engine results, the analyst
converted into a graph representation using a specified projection              must understand (1) the connection between results and root
algorithm. We include different projection methods to achieve                   query and (2) the pair-wise relationship between the results. The
good results for varying search attributes and input parameters:                SimSearch workspace is built around a central projection view,
for low-dimensional searches (𝑛 ≤ 4), the system supports PCA                   mapping the n-dimensional data points to the two-dimensional
and MDS, based directly on the attribute values or the cross-                   screen space while preserving the distances between entities
similarity matrix, respectively. For higher-dimensional searches,               as well as possible. The search attributes are projected as an
UMAP [8] can provide fast and stable projections highlighting                   additional, virtual entity to set the result entities into relation
connections in the data while preserving its global topology. The               with the specified search parameters.
   Besides the spatial position of entities in the result space, the
                                                                                                             Search A
pair-wise relation between entities is essential to interpret con-                         Search C
nections and reveal proximities in the data that the projection
could not preserve. Therefore, we indicate these relations by ex-
tracting the top k values from the cross-similarity matrix and
                                                                                                       Search B
displaying them as links between the respective entities. The
edges’ line width is proportional to the similarity between two
entities, visually highlighting the most important connections.         Figure 3: Cached subsets of the search space covered by
   Important information for each entity is attached directly to        three consecutive searches with slightly changed parame-
the projected node: the similarity rank is annotated persistently       ters. The stability of the central entities is maximal, while
on each node, while the exact attribute combinations and simi-          the stability for the border-cases vanishes.
larity scores for each attribute can be displayed by hovering an
entity either in the projection view or in the tabular results view.    entities that are descending from the top 𝑘 results, causing them
By coloring the results according to their spatial position in the      to lose their place in the table.
projection using a two-dimensional colormap, the entities are              Time-consuming search operations are executed speculatively
visually clustered and linked to the tabular results view.              before an actual user interaction is performed, enabling the itera-
   Besides displaying the inter-linkage, we also apply k-means          tive refinement of search parameters. When the user performs an
clustering to the projected points, reducing visual clutter by form-    action, and the resulting parameter combination causes a cache
ing local groups and highlighting results with spatial proximity        hit, the results can be delivered and visualized in real-time. Be-
in the projection space instead of the attribute space. While the       sides the increase in responsiveness, more and more discrete
cross-similarity would ideally correspond with the k-means clus-        samples of the local search space are present in the cache with
ters in the 𝑛-dimensional space, this is not valid for the projected    the ongoing analysis process. By setting the frequency of an en-
entities since not all information can be preserved in the projec-      tity in the latest result sets into relation with the total number
tion. Therefore, the clustered entities can share similar attributes,   of results, we derive a measure for an entity’s stability over the
which, at the same time, might diverge from the most similar            changing search parameters, as shown in Figure 3. The stability
entities denoted by the cross-similarity matrix. I.e., entities might   is then mapped to the node size in the projection view, with
be close in only a subset of their attributes, causing them to be       larger nodes indicating entities that appear more frequently in
assigned to the same cluster, while the total similarity across all     the recent result sets.
attributes might be vanishing, preventing their cross-similarity
link from being strong enough to be displayed.                          4     USE-CASES
Tabular Result View (1b) — Complementing the projection                 We show the applicability and advantages of the proposed Sim-
view, we include the tabular results view in the SimSearch work-        Search workspace based on two exemplary use-cases. The first
space, showing the ranked entities together with their attribute        use-case (subsection 4.1) is hands-on and describes in detail how
set and the corresponding similarity scores. The table’s rows           our proposed system can be used to reach the analysis goal,
are linked to the nodes in the projection view, simultaneously          while the second use-case (subsection 4.2) demonstrates how our
highlighting a specific node in both views on mouse hover. By           system can be applied to varying tasks and domains.
clicking the table header for one attribute column, the column
can be re-ordered according to its contained values, enabling the       4.1    Assessing the Local Business Landscape
direct comparison between the individual similarity scores for
                                                                        This use-case is based on a real-world, large-scale (≈ 120 GB)
each attribute.
                                                                        dataset containing information about companies in Italy.
Parameter Designer (1c) — The parameter designer is the pri-               In the use-case, a small company with ≈ 50 employees plans to
mary interface for specifying and refining search queries, pro-         expand, for which several potential new locations are considered.
jection settings, and weight parameters. Search attributes can          Since the company is dependant on the local infrastructure and
be added from a list of all available attributes in the dataset, al-    other supplying companies, geographical proximity to those com-
lowing to set a target value for each selected parameter. A slider      panies is an essential requirement. Simultaneously, the company
attached to each attribute enables the analyst to set the attribute’s   wants to avoid direct local competition through other companies
relative importance concerning all other defined attributes, giv-       working in the same sector and having a similar corporate struc-
ing full control over the balance between attributes and their          ture. Our proposed SimSearch workspace supports the search
corresponding distance function.                                        and interactive exploration of the potential company locations
   To diagnose the weight parameters’ influence on the result           to fulfill the company’s requirements.
set, hovering a weight slider triggers the projection and tabular          By specifying the attribute combinations in the parameter
result view to switch to the speculative execution state. In the        designer according to the desired or declined company profiles
speculative execution state, the views indicate the change in the       together with the considered company location, the local search
result set under a speculative de- and increase of the respective       space can be explored. The projection view reveals the most
attribute weight. In the projection view, this is done by inserting     similar companies and indicates their pair-wise relationships, re-
the possible new positions of the entities under the changing           vealing communities and enabling the analyst to assess the most
projection, marking the results under a positive weight adjust-         influential search attributes. In doing so, it becomes clear that the
ment with a red outline and the results under a negative weight         geolocation only has marginal influence on the search results,
adjustment with a green outline. Complementing the projection,          and the shown companies are too far for a business relationship.
the tabular results view is extended by two additional columns,            Since the numerical search attributes, such as the number of
indicating the change in each result entity’s rank and marking          employees, can not be objectively compared to the geospatial
                                                                                                       The embedding method will be chosen to reflect the expert’s do-
                                                                                                       main knowledge of semantically different and similar documents.
                                                                                                       Cross-similarities will show potential miss-classifications. This


                                                    w0 · ∆ geolocation
∆ geolocation


                                         s(c1,c2)                                                      allows adjusting the weights of the similarity search to increase
                   s(c0,c1)                                                                 s(c1,c2)
                                                                           s(c0,c1)
                                                                                                       the similarity to semantically relevant documents and separate
                              s(c0,c2)
                                                                                                       them from semantically distinct ones.
                                                                                      s(c0,c2)


                                                                                                       5   DISCUSSION AND FUTURE WORK
                    ∆ num_employees                                      w1 · ∆ num_employees          While the presented similarity search workspace implements a
                              (a)                                                 (b)                  variety of features and techniques to make the data search space
Figure 4: Similarity search results {𝑐 0, 𝑐 1, 𝑐 2 } and cross-                                        and the model parameter space accessible by the analyst, possi-
similarities 𝑠 (𝑐𝑎 , 𝑐𝑏 ) with 𝑎, 𝑏 ∈ {0, 1, 2}. Search attributes                                     ble extensions could further strengthen the system’s usefulness.
originating from different data domains render an objec-                                               Such extensions could include improvements to the search func-
tive comparison of the similarity scores impossible (a). By                                            tionality and the explanation of results or the implementation
applying weightings {𝑤 1, 𝑤 2 }, the analyst can adjust the dis-                                       of advanced guiding techniques. Furthermore, the presented ap-
tance functions according to his domain knowledge (b).                                                 proach could be generalized to other domains and tasks with
                                                                                                       a similar problem setting, i.e., where high-dimensional result
company location, the weights in the parameter designer have                                           entities of complex mining models have to be visualized, and the
to be iteratively refined to match the analyst’s understanding of                                      model must be refined to match a particular analysis task.
each attribute’s desired influence on the results. Figure 4 shows                                      Extending the Search Functionality — Additional views could
how the weight adjustment helps to balance the different dis-                                          augment the existing visualizations with an abstract overview of
tance functions. By indicating the changes in the result set for a                                     possible actions and the resulting changes, enabling the analyst
possible weight adjustment, the analyst can exploit the systems                                        to identify possible changes at first glance before descending into
speculative execution feature to observe changes in real-time                                          detailed views. For example, an additional view visualizing all
and assess the most purposeful operation before the actual, time-                                      possible weight combinations probed by the speculative execu-
consuming execution. Using the tabular results view, the analyst                                       tion component and their likely outcomes could provide first
can verify the possible changes in detail by observing how each                                        hints where the region of interest might be located. Additional
attribute’s ranking would change under the operation or if the                                         interestigness measures could augment the parameter designer’s
company would be excluded from the result set. By iteratively                                          weight sliders with information on the intervals corresponding
refining the search parameters, the analyst can explore the search                                     with the most significant changes in the result set. Extending the
space ideally for each potential location, leading to well-informed                                    interestingness feature, decision boundaries could be estimated
decision making for the new location.                                                                  by probing the search space in regions with a high gradient,
                                                                                                       providing a sensitivity analysis for each parameter.
4.2             Mail Forwarding                                                                        Extending Guidance — The system currently provides orient-
This use-case is based on internal mail forwarding within a large                                      ing guidance to users alerting them to similar weight config-
company. Incoming postal mail is automatically opened and dig-                                         urations that produce significantly different search results. In
itized, using an OCR system, on arrival at the company head-                                           addition to highlighting different possible weight settings, the
quarters. The digitized mail item is then used as a search query                                       system could actively propose user actions like moving weight
on a structured database of the company’s customers, contracts,                                        sliders or switching to different projection methods. By analyzing
products, or projects to electronically forward the scanned docu-                                      and learning from user interactions, the system could identify
ment to the staff responsible for working this task needing the                                        the users’ preferences and provide suggestions adapted to their
document. This use-case requires both a robust search engine for                                       understanding of the domain and analysis task. By giving the
retrieving database entries (e.g., contracts) containing keywords,                                     system more initiative in the exploration process, the system
names, or numerical values similar to the query document and                                           should become both more effective and efficient to use.
semantic understanding of the content to weigh these attributes.
                                                                                                       Generalization as Visual Analytics Technique — There are
    The structured data in the database consists of categorical
                                                                                                       several other problems in automated data mining pipelines with
attributes, person or item names, as well as spatial, temporal, nu-
                                                                                                       the same or a similar structure as the similarity search appli-
merical, or general ontological values. These may occur within
                                                                                                       cation addressed in the presented system, such as clustering,
the scanned document with different individual similarities as
                                                                                                       classification, or graph merging. Specifically, our approach can
well as in many different combinations. Thus the need arises
                                                                                                       be generalized to understand, diagnose, and refine models where
to weigh these database attributes against each other to model
                                                                                                       (1) the result is a number of 𝑛-dimensional entities with arbitrary
the overall semantic similarity. This configuration of the search
                                                                                                       distance functions associated, and (2) the outcome depends on a
query likely is done by a human engineer with expert domain
                                                                                                       set of parameters whose influence on individual results is opaque.
knowledge. One approach here is to use a set of example doc-
uments for evaluation and repeatedly querying for them and                                             Scalability — The system’s scalability is directly dependent on
modifying the attribute weights until relevant database entries                                        the underlying similarity search algorithm. Despite implementing
are found with high overall similarity to the query document,                                          various techniques (caching, speculative execution) to enable
with less relevant entries being significantly dissimilar.                                             interactive visual analytics on the offline search algorithm, the
    We propose to use SimSearch in this process to both see the                                        similarity search model’s response time is the limiting factor for
overall similarity of the different database entries using the cur-                                    the approach. While response times of 1 − 30 s can be bridged
rent configuration and identify clusters in the embedding space.                                       by applying the implemented techniques, longer response times
render an online analysis increasingly difficult since (1) non-         are coupled with the parameter refinement functionality, inte-
ideal sampling points might have been chosen for speculative            grating the speculative results into their visual representation.
execution or (2) the analyst might change the search space context         We show our proposed similarity search workspace’s applica-
more rapidly than results can be preemptively queried and cached.       bility and usefulness based on two use-cases, both anchored in
The response times of the similarity search algorithm could be          real-world application examples and datasets.
reduced by parallelizing the main stages of the algorithm, namely
(1) generating a ranked list of results for each queried attribute      ACKNOWLEDGEMENTS
and (2) compiling the ranked lists into a list of top-𝑘 results [11].   This work has received funding from the European Union’s Hori-
Limitations and Future Work — Currently, views of higher                zon 2020 research and innovation programme under grant agree-
abstraction giving the analyst reference points on promising anal-      ment No 825041.
ysis directions are missing. We will tackle this issue by adding a
third view to the similarity search workspace, showing all pos-         REFERENCES
                                                                         [1] James F. Allen, Curry I. Guinn, and Eric Horvitz. 1999. Mixed-initiative
sible weight combinations in a matrix view and indicating the                interaction. IEEE Intelligent Systems and their Applications 14, 5, 14–23.
regions of the highest expected result change. Currently, the an-        [2] Matthew Brehmer and Tamara Munzner. 2013. A multi-level typology of
alyst has to evaluate the speculative changes in results manually            abstract visualization tasks. IEEE Trans. on Vis. and Comput. Graphics 19, 12,
                                                                             2376–2385.
by observing the predicted outcomes and comparing them across            [3] Davide Ceneda, Theresia Gschwandtner, and Silvia Miksch. 2019. A Review
the different parameter combinations. In future versions, we will            of Guidance Approaches in Visual Data Analysis: A Multifocal Perspective.
automatically highlight regions of interest using the number                 Comput. Graphics Forum 38, 3, 861–879.
                                                                         [4] C. Collins, N. Andrienko, T. Schreck, J. Yang, J. Choo, U. Engelke, A. Jena, and
of changes in the result set for each combination as an inter-               T. Dwyer. 2018. Guidance in the human–machine analytics process. Visual
estingness measure. This functionality will be strengthened by               Informatics 2, 166–180.
                                                                         [5] Aristides Gionis, Piotr Indyk, and Rajeev Motwani. 1999. Similarity Search
implementing interactive, adaptive guidance. If one operation
                                                                             in High Dimensions via Hashing. In Proc. of the 25th Intl. Conference on Very
has significantly higher interestingness than others, it will be             Large Data Bases (VLDB ’99). San Francisco, CA, USA, 518–529.
actively proposed as a possibly rewarding action. Furthermore,           [6] Daniel Keim, Jörn Kohlhammer, Geoffrey Ellis, and Florian Mansmann. 2010.
                                                                             Mastering The Information Age – Solving Problems with Visual Analytics. Euro-
by tracking recent interactions of the user with the system, we              graphics Association.
will estimate the likelihood of future interactions based on the         [7] Sean D MacArthur, Carla E Brodley, Avinash C Kak, and Lynn S Broderick.
history, adapting the guidance to user preferences. Despite the              2002. Interactive content-based image retrieval using relevance feedback.
                                                                             Comput. Vision and Image Understanding 88, 2, 55–75.
presented use-cases proving our approach’s applicability in dif-         [8] Leland McInnes, John Healy, and James Melville. 2018. UMAP: Uni-
ferent real-world scenarios and data domains, a future user study            form Manifold Approximation and Projection for Dimension Reduction.
                                                                             arXiv:stat.ML/1802.03426
will further validate the system’s usefulness and provide insights       [9] T. Mühlbacher, L. Linhardt, T. Möller, and H. Piringer. 2018. TreePOD:
on both benefits and open challenges. Besides measuring quanti-              Sensitivity-Aware Selection of Pareto-Optimal Decision Trees. IEEE Trans. on
tative criteria, such as task completion time and comparing the              Vis. and Comput. Graphics 24, 1, 174–183.
                                                                        [10] S. Pajer, M. Streit, T. Torsney-Weir, F. Spechtenhauser, T. Möller, and H. Piringer.
analysis results to ground-truth data, an additional qualitative             2017. WeightLifter: Visual Weight Space Exploration for Multi-Criteria Deci-
evaluation will expose additional user requirements and future               sion Making. IEEE Trans. on Vis. and Comput. Graphics 23, 1, 611–620.
points for improvements of the system.                                  [11] Kostas Patroumpas and Dimitrios Skoutas. 2020. Similarity search over en-
                                                                             riched geospatial data. In Proc. of the Sixth Intl. ACM SIGMOD Workshop on
                                                                             Managing and Mining Enriched Geo-Spatial Data. ACM.
                                                                        [12] Peter Pirolli and Stuart Card. 2005. The sensemaking process and leverage
6   CONCLUSION                                                               points for analyst technology as identified through cognitive task analysis. In
                                                                             Proc. of Intl. Conference on Intelligence Analysis, Vol. 5. McLean, VA, USA, 2–4.
Applying complex data mining models to large data foundations           [13] Dominik Sacha, Michael Sedlmair, Leishi Zhang, John A Lee, Jaakko Peltonen,
introduces particular challenges to the analysis process. Both the           Daniel Weiskopf, Stephen C North, and Daniel A Keim. 2017. What you see
                                                                             is what you can change: Human-centered machine learning by interactive
parameter space and the search space might be opaque, requiring              visualization. Neurocomputing 268, 164–175.
manual probing to approach the regions of interest and, hence,          [14] Dominik Sacha, Andreas Stoffel, Florian Stoffel, Bum Chul Kwon, Geoffrey
                                                                             Ellis, and Daniel A. Keim. 2014. Knowledge Generation Model for Visual
rendering an interactive exploration impossible. Applying visual             Analytics. IEEE Trans. on Vis. and Comput. Graphics 20, 12, 1604–1613.
analytics, models, parameters, and results can be made accessible       [15] Martin Schall, Dominik Sacha, Manuel Stein, Matthias O Franz, and Daniel A
through interconnected visualizations, revealing hidden connec-              Keim. 2018. Visualization-assisted development of deep learning models in
                                                                             offline handwriting recognition. In Symp. on Vis. in Data Science at IEEE VIS.
tions between components and providing advanced mechanisms,             [16] M. Sedlmair, C. Heinzl, S. Bruckner, H. Piringer, and T. Möller. 2014. Visual
such as speculative execution, to enable the real-time exploration           Parameter Space Analysis: A Conceptual Framework. IEEE Trans. on Vis. and
of otherwise time-consuming data processing pipelines.                       Comput. Graphics 20, 12, 2161–2170.
                                                                        [17] Thomas Seidl and Hans-Peter Kriegel. 1997. Efficient user-adaptable similarity
    The presented system implements views and techniques to                  search in large multimedia databases. In VLDB, Vol. 97. 506–515.
make the parameters and results of a novel similarity search            [18] Fabian Sperrle, Jürgen Bernard, Michael Sedlmair, Daniel Keim, and Menna-
                                                                             tallah El-Assady. 2019. Speculative Execution for Guided Visual Analytics.
algorithm accessible to the analyst. Specifically, we provide a              arXiv:cs.HC/1908.02627v1
projected view of the search results, highlighting the similarity       [19] Fabian Sperrle, Astrik Jeitler, Jürgen Bernard, Daniel A. Keim, and Mennatallah
to the root query, the pair-wise similarity between the result en-           El-Assady. 2020. Learning and Teaching in Co-Adaptive Guidance for Mixed-
                                                                             Initiative Visual Analytics. In EuroVis Workshop on Visual Analytics (EuroVA),
tities, the stability of the results, as well as communities of close        K. Vrotsou and C. Turkay (Eds.). The Eurographics Association.
entities. The projected view is complemented with and linked to         [20] Thilo Spinner, Udo Schlegel, Hanna Schäfer, and Mennatallah El-Assady. 2019.
a tabular view of the results, indicating their rank and providing           explAIner: A visual analytics framework for interactive and explainable ma-
                                                                             chine learning. IEEE Trans. on Vis. and Comput. Graphics 26, 1, 1064–1074.
sorting functions on distinct attributes or their corresponding         [21] Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, and Tianyi Wu. 2011. Path-
similarity. Supporting parameter refinement and search space                 Sim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information
                                                                             Networks. Proc. of the VLDB Endowment 4, 11, 992–1003.
exploration, the system implements speculative execution on the         [22] Thomas Torsney-Weir, Ahmed Saad, Torsten Moller, Hans-Christian Hege,
time-consuming similarity search operation, presenting the user              Britta Weber, Jean-Marc Verbavatz, and Steven Bergner. 2011. Tuner: Prin-
with possible outcomes of parameter changes on-demand before                 cipled parameter finding for image segmentation algorithms using visual
                                                                             response surface exploration. IEEE Trans. on Vis. and Comput. Graphics 17, 12,
actually performing an action. The projection and tabular views              1892–1901.