=Paper=
{{Paper
|id=Vol-3630/paper08
|storemode=property
|title=Database and Workflow Optimizations for Spatial-Geometric Queries in GeoMine
|pdfUrl=https://ceur-ws.org/Vol-3630/LWDA2023-paper8.pdf
|volume=Vol-3630
|authors=Martin Poppinga,Joel Graef,Konrad Diedrich,Matthias Rarey,Norbert Ritter
|dblpUrl=https://dblp.org/rec/conf/lwa/PoppingaGDRR23
}}
==Database and Workflow Optimizations for Spatial-Geometric Queries in GeoMine==
Database and Workflow Optimizations for
Spatial-Geometric Queries in GeoMine
Martin Poppinga1,2 , Joel Graef2 , Konrad Diedrich2 , Matthias Rarey2 and
Norbert Ritter1
1
Universität Hamburg, Fachbereich Informatik, 22527 Hamburg, Germany
2
Universität Hamburg, ZBH – Center for Bioinformatics, 20146 Hamburg, Germany
Abstract
Addressing computational problems in science often involves customized algorithmic approaches, which
can lead to overlooking well-established solutions in data management and storage. When scientific
datasets grow, these customized approaches may struggle to query data efficiently. Effective data
management is essential for ensuring accurate and fast analysis of scientific data. Describing changes in
the GeoMine software, this paper highlights the potential for improvements in data-driven science.
GeoMine enables spatial-geometric searches in three-dimensional molecular space, facilitating tasks
such as pharmaceutical drug discovery by finding similar geometric patterns in protein-ligand complexes.
The original GeoMine application utilized a relational database solely for fundamental data storage and
combined it with a tailored algorithmic pattern-matching strategy, leaving room for improvements.
This work presents a technical overview of database and workflow optimizations in GeoMine to handle
the increasing data size. Our improvements focus on moving the main computational tasks from the
application level to the database system and optimizing the database utilization. A new query design,
better utilization of indexes, and optimizations in textual queries led to a 15x speedup in our experiments,
reducing the mean runtime of queries to under 8 seconds.
The presented improvements are essential for GeoMine to be offered as a service-oriented web
application. The success of these improvements highlights the significance of database optimization in
science, demonstrating the potential and necessity of proper data management.
Keywords
Database optimization, query optimization, data management, databases for bioinformatics
1. Introduction
Mining huge datasets is a central task in research. Analyzing molecular interactions between
proteins and small organic molecules is essential for understanding disease treatments and
advancing medical research. This includes searching for spatial similarities and geometric
arrangements, which can provide vital insights into the functional aspects of proteins. Results
can be used for further research, for example, in pharmaceutical drug discovery or biotechnol-
ogy [1]. With the growth of accessible datasets, searching for patterns in this data becomes
LWDA’23: Lernen, Wissen, Daten, Analysen. October 09–11, 2023, Marburg, Germany
$ martin.poppinga@uni-hamburg.de (M. Poppinga); graef@zbh.uni-hamburg.de (J. Graef);
diedrich@zbh.uni-hamburg.de (K. Diedrich); rarey@zbh.uni-hamburg.de (M. Rarey);
norbert.ritter@uni-hamburg.de (N. Ritter)
0000-0001-8529-8376 (M. Poppinga); 0000-0001-8327-4936 (J. Graef); 0000-0001-8171-0888 (K. Diedrich);
0000-0002-9553-6531 (M. Rarey); 0000-0002-1502-1395 (N. Ritter)
© 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
increasingly challenging [2, 3]. Besides the continuous growth of available experimental data,
machine-learning-based structure predictions add millions of new structural models [4].
GeoMine [3] is an application enabling a visual-guided geometric pattern search of molecular
data in three-dimensional space. It is embedded in the proteins.plus1 server [5], a collection
of different web-based tools for various tasks in protein-based research. The server is a free
service based on publicly available datasets handling over half a million page requests per year.
The back end of GeoMine was derived in prior work from the PELIKAN application developed
in the same group [6], which was utilizing a custom algorithmic approach for query processing.
With the Protein Data Bank (PDB) [7] as a fast-growing dataset underlying GeoMine and the
shift from a desktop application to a server-based approach, GeoMine required an overhaul of
the original query workflow to maintain the ability to provide results in a fast manner.
With this work, we investigate the potential of adopting a database-driven architecture,
focusing on the database as the main part of query execution and reducing application-side
processing. We were able to reduce the mean runtime in our experiments from about 2 minutes
per query to less than 8 seconds, utilizing changes in the workflow and database optimizations.
As we present in this work, a substantial performance enhancement has been achieved by
shifting to a more database-centric method.
The paper is organized as follows: Section 2 provides an overview of the field of work, the
data structure, and the query design; Section 3 details the improvements made to the query
workflow and database optimizations; Section 4 presents and discusses the experimental results;
Section 5 concludes the paper and outlines future work.
2. Background and Related Work
2.1. Data Management and Storage
Data management in scientific research involves the systematic collection, organization, storage,
and sharing of data to facilitate its reusability and ensure the reproducibility of research findings.
In the context of our work, which focuses on querying structured data sets, the storage aspect
is particularly important. In the scientific domain, many existing applications are designed
for single-user usage, often locally storing data in various formats or utilizing object stores
with limited retrieval possibilities [8, 9]. For structured data, Relational Database Management
Systems (RDBMS) are the most commonly used systems, providing robust and efficient solutions.
Commonly, embedded systems are used, such as SQLite [10] for applications with smaller or
medium-sized data sizes or DuckDB [9] for analytical workloads. For Online Transaction Process-
ing (OLTP) workloads which require fast query performance and regular updates, server-based
RDBMS are a popular choice. Large analytical queries are often served by designated Online
Analytical Processing (OLAP) systems such as data warehouses, which are often proprietary
solutions. For handling large-scale semi-structured datasets, NoSQL systems are frequently
used, with columnar and graph databases being popular for analytical queries. The choice
of data management and storage solutions is crucial to ensure efficient processing, reduced
resource consumption, and accurate and fast analysis of scientific data.
1
https://proteins.plus
PostgreSQL GeoMine utilizes PostgreSQL [11], a robust and widely accessible open-source
database management system. As multiple users can access a web-based application such as
GeoMine at the same time, the ability of a client-server-based database system to handle multiple
queries efficiently in parallel is required. PostgreSQL’s widespread adoption [12] enables cloud-
agnostic hosting on every major platform since most cloud platforms offer PostgreSQL solutions
or other PostgreSQL-compatible scalable databases. Additionally, setting up on-premise or local
instances is straightforward. PostgreSQL is suited for OLTP and also OLAP workloads [13]. The
required workloads here can be depicted in the area of OLAP, given the potential complexity
of the designed queries. However, given the use case of an interactive search mask for a web
service, fast responses are a requirement. PostgreSQL’s efficient query planning and extensibility
for additional approaches (e.g., PostGIS [14] for spatial data or Citus [15] for distributed and
columnar storage) make it a suitable foundation for GeoMine’s use case.
2.2. Protein-Ligand Interactions and Binding Pockets
Protein-ligand interactions are of particular interest in biomolecular and pharmaceutical re-
search. Ligands are small molecules that can interact and bind to the generally much larger
proteins. Protein complexes can contain multiple pockets of varying sizes, partly containing
ligands. Drug molecules used as pharmaceuticals are generally designed to target specific
proteins. Researchers can gain valuable insights by investigating specific three-dimensional
structures and searching for potential candidates to bind with these proteins.
Protein Data Bank The PDB [7, 2], established in 1971, is a comprehensive repository
of 3D structural data of proteins and nucleic acids. The structural information is primarily
obtained through experimental methods, predominantly X-ray crystallography, from research
facilities worldwide [2]. As a freely available resource, the PDB has become vital for research in
various fields by providing atomic-scale structural insights for drug design and understanding
biological processes, containing more than 200,000 structures as of April 2023. Further, with the
advantage of Computed Structure Models, which are protein structure predictions, for example,
by AlphaFold2 [4], additional datasets with about 1,000,000 structures are available now [2].
2.3. GeoMine
Discovering similar structures across distinct complexes or finding molecules that bind to a
specific pocket of interest is a major task in medical research. GeoMine is able to construct
comprehensive databases derived from the PDB and supports exploring these databases with a
web-based search interface. [3]
The preprocessing and database creation procedures employ components of the NAOMI
library [16]. For example, pockets are classified in a complex preprocessing pipeline when con-
structing the database [3]. Central components are the DoGSite algorithm [17], which identifies
empty binding pockets within protein structures, and the calculation of interactions [18].
The central part of the search and unique key feature is the ability to specify geometric
properties, for instance, distances and angles between any points, such as atoms. Further,
point properties can be specified, such as an atom’s chemical element and interactions between
points. This way, precise structural motifs (structural patterns) in protein-ligand complexes
can be searched. While GeoMine’s predecessor PELIKAN was a single-user application based
on an integrated SQLite [10] database, the GeoMine back end is aimed at a server-focused
architecture. In the initial development of GeoMine [3], the query execution capabilities of
PELIKAN were extended for new functionality but were not changed in structure to adapt to
the new architecture.
Database Design For our experiments in Section 4, we used a PostgreSQL15 database created
with the PDB dataset from October 2022. For querying the dataset, the database can be considered
read-only. The database requires approximately 165GB of disk space.
For the geometric search, we focus on two tables. The first table, the point table, comprises all
atoms and other definable points, such as the center of aromatic rings. It contains 340,716,693
searchable entries. These points are distributed across 1,382,853 distinct pockets, which serve
as containers for groups of points. The largest pocket identified in our dataset contains 20,306
points, while the smallest pocket only holds 9 points. Each entry in the point table has a unique
identifier, references the containing pocket, and contains various other fields with properties
per point. Some properties, such as the accessible surface area of an atom, are floating point
numbers. Other attributes, such as the chemical element, contain only a few distinct values,
represented as integers or short strings.
The second table, the interaction table, stores pre-calculated interactions [18]. These inter-
actions represent noteworthy connections between two points, for example, hydrogen bonds.
13,018,225 point pairs are stored here.
Query Creation When creating a query, users can specify multiple constraints. The most
fundamental categories encompass Textual and Numerical Searches, wherein metadata filters
at the protein structure or pocket level can be defined. Users can directly pre-select several
structures or create various filters, such as the minimum number of particular chemical elements
or a certain molecular weight range for the ligand. It also enables filtering using patterns that
describe a local environment using the chemical substructure language SMARTS strings [19].
The central search element and origin of GeoMine’s name are geometry-based searches. To
build the query, users may interactively select points in the web front end [21] (see Figure 1),
utilizing an arbitrary PDB file as a template structure or define them without a template.
Users may select an arbitrary number of points, which can be filtered based on different
properties. Moreover, the specification of distance ranges between two points and angles
between specified distances is possible. Further, interactions between points, as stored in the
interaction table, can be added to the query. Together they resemble an atomic substructure,
which will be searched for. Each pocket can be examined individually as the interactions
between one ligand and an individual pocket in a protein are of interest.
Query Execution The initial approach for query execution was first described for the prede-
cessor tool PELIKAN by Inhester et al. [6]. The most significant enhancement for the runtime
in developing the original GeoMine approach — utilizing a PostgreSQL database instead of
SQLite — did not change the workflow of the searching process. The approach remained mostly
(a) View of a pocket (violet mesh) in (b) Specifing points, distances, and
structure 1H1S angles for the query
Figure 1: GeoMine’s three-dimensional view of a binding pocket based on the NGL viewer [20]. Users
can interactively select atoms and other points and specify distances and interactions between them to
generate the query. Here, a pocket around a ligand (bold bonds) is shown, together with the surrounding
atoms of the protein.
algorithmic focused, with all major computational steps performed within the application
(see Figure 2a), as the original PELIKAN software was designed to be a standalone desktop
application. In the original approach of GeoMine [3], four major steps were performed strictly
sequentially for each query to filter the potential results:
1. Textual and Numerical Constraints - A filter eliminates all proteins and pockets that do
not meet specified properties or do not correspond to a given restrictive SMARTS filter.
This step yields a list of all matching proteins and their pockets.
2. Obtaining all point pairs - For each point pair in the query, all possible results are returned,
and distances, as well as interaction constraints, are checked.
3. Clique detection - An algorithm reconstructs the coherent component graph for all
obtained point pairs and checks all defined angle constraints.
4. Less restrictive SMARTS filters for points were applied to the now-generated results.
Steps two and three of the query processing presented particular challenges. All point and
point pair constraints were queried individually in the database. Since a single constraint for a
point pair is often not very specific, it leads to big intermediate results. Only by chaining several
constraints the number of points is sufficiently reduced. The need to cross-verify each point
with all matching points in its pocket demanded significant computational resources, especially
if the filter for the points were unspecific. The list of potential pockets needed to be recreated
for each pair, as only pockets which contained results in prior pair subqueries remained in the
search space. This caused the search to be strictly sequential and required the serialization
and deserialization of long pocket-ID lists for the SQL WHERE clauses. As the application and
database system are separate processes or running on separate servers, the required repeated
transfer of these lists also affected the performance. Because some point-to-point constraints
were specific (less frequent in the dataset) and others were unspecific (frequent in the dataset),
a hand-crafted scoring function was utilized to estimate the best ordering of queries, starting
Textual and Pocket IDs Geometric Filter Textual and Pocket IDs Geometric Filter
Numerical Filter Numerical Filter Angle Filter
Clique
Point To Point Constraint
SMARTS Filter Detection SMARTS Filter
point1 JOIN point2 &
Protein Filter Pocket IDs
Angle Filter Protein Filter Comprehensive Query
SMARTS
Point To Point Constraint
point1 JOIN point2 JOIN Filter
point3 JOIN point4
point2 JOIN point3
Pocket IDs
SMARTS
Point To Point Constraint DB
Filter Results
DB point3 JOIN point4
Pocket IDs
... Results
(b) Improved workflow
(a) Original workflow
Figure 2: The original and the improved processing workflow of GeoMine (Simplified) for a given search
with the most specific queries to reduce the search space early [6]. Although this improved
the join order in many cases, it had the disadvantage of preventing the database system from
executing classical optimizations, such as parallelism and join order optimization.
Further, an additional algorithm was required since the results from the preceding steps
consisted only of point pairs. The Bron-Kerbosch algorithm [22], a graph-based backtracking
algorithm for clique detection, was used. This algorithm recursively verified whether all
discovered point pairs constituted a complete graph and checked for angle constraints. This
demanded substantial computational effort, taking several hours on large potential result sets.
3. Optimizations
This research aims to achieve optimal performance and ease of setup across various envi-
ronments. Alongside the contributions of this work, the application has transitioned to a
containerized setup for cloud environments. The optimizations presented in this work are essen-
tial for facilitating the deployment of a scalable application. In this section, we will distinguish
between the original approach in GeoMine [3] and the improved approach we present in this
work. The yielded results for each query remained identical.
3.1. Optimizing SQL Queries
The most significant change from the original approach was the redesign of the SQL query
generation. Sequential processing of each constraint within a query led to severely limited
query-level parallelism and long processing times as described in Section 2.3. Therefore, all SQL
queries are now designed to make use of PostgreSQL’s internal planning and optimization. In
contrast to the original approach, where each point-to-point constraint was queried separately,
a single comprehensive query containing all attributes and constraints for geometrical patterns
is now constructed, see Figure 2b. This reduces overhead by eliminating the need to repeatedly
serialize extensive lists of pocket IDs or create temporary tables. To achieve this, the point table
joins itself as often as points were specified in the query, usually 5-15 times. As a match occurs
inside a single pocket, we only need to join points within the same pocket. With information
about the distribution of properties like the chemical element, the RDBMS can estimate which
part of the query restricts the search space the most and improve the join order. The original
approach required running the checks on all points within all remaining pockets, not being able
to skip points that were not matched in earlier subqueries. Intermediate results now remain
within the database system and do not require serialization for application transfer. Additionally,
merging all constraints (points, distances, and interactions) into one query eliminates the need
for clique detection, as the output of the RDBMS is a connected and valid result.
Among all the geometric properties, only the angle checking between point pairs remains a
separate step in the application, as this increases the complexity of the query without showing
the benefits of an early reduced search space in our tests. Textual and numerical filters remain in
a separate query to allow prior filtering, as SMARTS patterns require in-application processing.
Allowing the RDBMS to determine the join order and the parallel execution resulted in a
significant speedup of benchmark queries. The results are detailed in Section 4.
3.2. Enhanced Utilization of PostgreSQL Indexes
In the original approach, a single extensive index structure was created, covering 15 out of
17 table columns. Although PostgreSQL allows for the construction of multi-column indexes
with a large number of attributes, these structures are only effective in certain situations due to
their size and depending on the used attributes. However, using multiple single-column indexes
and allowing PostgreSQL to combine them as recommended in the documentation [23] did not
achieve the desired performance improvement.
Only the combination of several attributes could substantially reduce the number of yielded
points. The best-found solution for our workload was a balanced compromise between index
size and utilization, including only the most frequently used columns in a multi-column index.
We identified two separate cases for index usage. Firstly, the earliest scheduled subquery focused
solely on the attributes, disregarding their pocket, in cases without textual and numerical filters.
Secondly, an index for subsequent subqueries was needed to filter for pocket IDs required for the
join. In almost all instances, the optimizer determined to filter for the pocket ID in the second
subquery. In some instances, a parallel index scan was performed. Filtering by the pocket ID
reduced the search space best in these cases since the most restrictive subquery had already
been executed as the first scheduled subquery. Therefore, we introduced a second index with
the pocket identifier positioned first in the index. For both structures, we utilized PostgreSQL’s
default B-Tree index as other index structures seemed not beneficial in our tests. As pockets
usually contain only a few hundred points, spatial indexes, like r-trees provided by PostGIS [14],
did not provide the desired benefits. Filtering points and calculating all distances performed
better in our tests than spatial operations due to the overhead of utilizing a spatial column.
Index creation only needed a few minutes, but additional indexes for specific queries would no
longer fit into the filesystem read cache and reduce performance.
3.3. Improving Text Search
The initial step of the workflow involves filtering structures based on textual and numerical
attributes. These filters target various properties, the most important being the PDB identifiers
used to select a pre-defined or user-defined subset of protein structures. A short alphanumeric
Table 1
Experiment overview. Showing enabled improvments between baseline ex01 and all improvments ex06
Improvement ex01 ex02 ex03 ex04 ex05 ex06 ex07 ex08 ex09 ex10
Index Improvement x x x x x
No Wildcards x x x x x
New Query Design x x x x x
No ILIKE x x x x x
code identifies each structure.
Previously, an SQL ILIKE (case insensitive match) statement with a wildcard match at the
beginning and end of the string was executed to check for the desired properties. For the PDB
codes, we could make two changes. We could discard the wildcards in the query unless explicitly
desired, which enables the utilization of a search index. And as the codes are not case-sensitive,
we can replace the ILIKE with a LIKE, allowing for a case-sensitive search and resulting in a
substantial speedup, as demonstrated in Section 4.
4. Evaluation and Discussion
4.1. Methods
To evaluate the impact of each modification suggested for GeoMine, several experiments were
derived from the original GeoMine approach ex01 (see Table 1). Experiments ex02 to ex05 each
contain only one of the improvements, ex06 contains all improvements, while experiments ex07
to ex10 contain all except one. This way, we show which change impacts the performance most,
as different improvements benefit from each other.
For evaluating the performance across different workloads, we used a set of nine queries
already used in previous work [3], designed to highlight available features, show examples
for common applications and estimate the runtime of different patterns common in GeoMine
practical applications. They emitted between 2 and 7117 results.
We used a PostgreSQL15 database system. All data was stored on an SSD. Unless otherwise
specified, a dedicated server with 400GiB RAM and 80 Cores was used (PostgreSQL 128GB
sharedbuffers, 16 parallel workers). Podman [24] was used to deploy the system. Each experiment
was repeated five times. The GeoMine application was executed on the same node as the
PostgreSQL database. We configured PostgreSQL to utilize less memory than available, as
GeoMine required a high amount of working memory for some workloads. Additionally, we
conducted tests on commodity systems by employing two setups (small/medium) using virtual
servers. Both setups stored data on SSDs and were equipped with 12 cores and 24GB RAM, resp.
18 cores and 48GB RAM.
4.2. Results
Figure 3 shows the mean runtime of the nine test queries for each experiment as depicted in
Table 1. Each change led to better performance, with the highest performance gain occurring
H [ V
H [ V
H [ V
H [ V
( [ S H U L P H Q W
H [ V
4 X H U \
H [ V ( [ S O R L W L Q J B 8 Q X V X D O B , Q W H U D F W L R Q V B R J B V H O H F W L Y L W \ B S H U P L V V L Y H B T X H U \
( [ S O R L W L Q J B 8 Q X V X D O B , Q W H U D F W L R Q V B R J B V H O H F W L Y L W \ B U H V W U L F W L Y H B T X H U \
H [ V ( [ S O R L W L Q J B 8 Q X V X D O B , Q W H U D F W L R Q V B $ G M X V W H G B T X H U \ B &