<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Lernen, Wissen, Daten, Analysen. October</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Database and Workflow Optimizations for Spatial-Geometric Queries in GeoMine</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Martin Poppinga</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joel Graef</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Konrad Diedrich</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matthias Rarey</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Norbert Ritter</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universität Hamburg</institution>
          ,
          <addr-line>Fachbereich Informatik, 22527 Hamburg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universität Hamburg, ZBH - Center for Bioinformatics</institution>
          ,
          <addr-line>20146 Hamburg</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>0</volume>
      <fpage>9</fpage>
      <lpage>11</lpage>
      <abstract>
        <p>Addressing computational problems in science often involves customized algorithmic approaches, which can lead to overlooking well-established solutions in data management and storage. When scientific datasets grow, these customized approaches may struggle to query data eficiently. Efective data management is essential for ensuring accurate and fast analysis of scientific data. Describing changes in the GeoMine software, this paper highlights the potential for improvements in data-driven science. GeoMine enables spatial-geometric searches in three-dimensional molecular space, facilitating tasks such as pharmaceutical drug discovery by finding similar geometric patterns in protein-ligand complexes. The original GeoMine application utilized a relational database solely for fundamental data storage and combined it with a tailored algorithmic pattern-matching strategy, leaving room for improvements. This work presents a technical overview of database and workflow optimizations in GeoMine to handle the increasing data size. Our improvements focus on moving the main computational tasks from the application level to the database system and optimizing the database utilization. A new query design, better utilization of indexes, and optimizations in textual queries led to a 15x speedup in our experiments, reducing the mean runtime of queries to under 8 seconds. The presented improvements are essential for GeoMine to be ofered as a service-oriented web application. The success of these improvements highlights the significance of database optimization in science, demonstrating the potential and necessity of proper data management.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Database optimization</kwd>
        <kwd>query optimization</kwd>
        <kwd>data management</kwd>
        <kwd>databases for bioinformatics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Mining huge datasets is a central task in research. Analyzing molecular interactions between
proteins and small organic molecules is essential for understanding disease treatments and
advancing medical research. This includes searching for spatial similarities and geometric
arrangements, which can provide vital insights into the functional aspects of proteins. Results
can be used for further research, for example, in pharmaceutical drug discovery or
biotechnology [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. With the growth of accessible datasets, searching for patterns in this data becomes
increasingly challenging [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. Besides the continuous growth of available experimental data,
machine-learning-based structure predictions add millions of new structural models [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        GeoMine [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is an application enabling a visual-guided geometric pattern search of molecular
data in three-dimensional space. It is embedded in the proteins.plus1 server [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], a collection
of diferent web-based tools for various tasks in protein-based research. The server is a free
service based on publicly available datasets handling over half a million page requests per year.
The back end of GeoMine was derived in prior work from the PELIKAN application developed
in the same group [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which was utilizing a custom algorithmic approach for query processing.
With the Protein Data Bank (PDB) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] as a fast-growing dataset underlying GeoMine and the
shift from a desktop application to a server-based approach, GeoMine required an overhaul of
the original query workflow to maintain the ability to provide results in a fast manner.
      </p>
      <p>With this work, we investigate the potential of adopting a database-driven architecture,
focusing on the database as the main part of query execution and reducing application-side
processing. We were able to reduce the mean runtime in our experiments from about 2 minutes
per query to less than 8 seconds, utilizing changes in the workflow and database optimizations.
As we present in this work, a substantial performance enhancement has been achieved by
shifting to a more database-centric method.</p>
      <p>The paper is organized as follows: Section 2 provides an overview of the field of work, the
data structure, and the query design; Section 3 details the improvements made to the query
workflow and database optimizations; Section 4 presents and discusses the experimental results;
Section 5 concludes the paper and outlines future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Data Management and Storage</title>
        <p>
          Data management in scientific research involves the systematic collection, organization, storage,
and sharing of data to facilitate its reusability and ensure the reproducibility of research findings.
In the context of our work, which focuses on querying structured data sets, the storage aspect
is particularly important. In the scientific domain, many existing applications are designed
for single-user usage, often locally storing data in various formats or utilizing object stores
with limited retrieval possibilities [
          <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
          ]. For structured data, Relational Database Management
Systems (RDBMS) are the most commonly used systems, providing robust and eficient solutions.
Commonly, embedded systems are used, such as SQLite [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] for applications with smaller or
medium-sized data sizes or DuckDB [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] for analytical workloads. For Online Transaction
Processing (OLTP) workloads which require fast query performance and regular updates, server-based
RDBMS are a popular choice. Large analytical queries are often served by designated Online
Analytical Processing (OLAP) systems such as data warehouses, which are often proprietary
solutions. For handling large-scale semi-structured datasets, NoSQL systems are frequently
used, with columnar and graph databases being popular for analytical queries. The choice
of data management and storage solutions is crucial to ensure eficient processing, reduced
resource consumption, and accurate and fast analysis of scientific data.
PostgreSQL GeoMine utilizes PostgreSQL [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], a robust and widely accessible open-source
database management system. As multiple users can access a web-based application such as
GeoMine at the same time, the ability of a client-server-based database system to handle multiple
queries eficiently in parallel is required. PostgreSQL’s widespread adoption [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] enables
cloudagnostic hosting on every major platform since most cloud platforms ofer PostgreSQL solutions
or other PostgreSQL-compatible scalable databases. Additionally, setting up on-premise or local
instances is straightforward. PostgreSQL is suited for OLTP and also OLAP workloads [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. The
required workloads here can be depicted in the area of OLAP, given the potential complexity
of the designed queries. However, given the use case of an interactive search mask for a web
service, fast responses are a requirement. PostgreSQL’s eficient query planning and extensibility
for additional approaches (e.g., PostGIS [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] for spatial data or Citus [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] for distributed and
columnar storage) make it a suitable foundation for GeoMine’s use case.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Protein-Ligand Interactions and Binding Pockets</title>
        <p>
          Protein-ligand interactions are of particular interest in biomolecular and pharmaceutical
research. Ligands are small molecules that can interact and bind to the generally much larger
proteins. Protein complexes can contain multiple pockets of varying sizes, partly containing
ligands. Drug molecules used as pharmaceuticals are generally designed to target specific
proteins. Researchers can gain valuable insights by investigating specific three-dimensional
structures and searching for potential candidates to bind with these proteins.
Protein Data Bank The PDB [
          <xref ref-type="bibr" rid="ref2 ref7">7, 2</xref>
          ], established in 1971, is a comprehensive repository
of 3D structural data of proteins and nucleic acids. The structural information is primarily
obtained through experimental methods, predominantly X-ray crystallography, from research
facilities worldwide [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. As a freely available resource, the PDB has become vital for research in
various fields by providing atomic-scale structural insights for drug design and understanding
biological processes, containing more than 200,000 structures as of April 2023. Further, with the
advantage of Computed Structure Models, which are protein structure predictions, for example,
by AlphaFold2 [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], additional datasets with about 1,000,000 structures are available now [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. GeoMine</title>
        <p>
          Discovering similar structures across distinct complexes or finding molecules that bind to a
specific pocket of interest is a major task in medical research. GeoMine is able to construct
comprehensive databases derived from the PDB and supports exploring these databases with a
web-based search interface. [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
        </p>
        <p>
          The preprocessing and database creation procedures employ components of the NAOMI
library [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. For example, pockets are classified in a complex preprocessing pipeline when
constructing the database [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Central components are the DoGSite algorithm [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], which identifies
empty binding pockets within protein structures, and the calculation of interactions [18].
        </p>
        <p>
          The central part of the search and unique key feature is the ability to specify geometric
properties, for instance, distances and angles between any points, such as atoms. Further,
point properties can be specified, such as an atom’s chemical element and interactions between
points. This way, precise structural motifs (structural patterns) in protein-ligand complexes
can be searched. While GeoMine’s predecessor PELIKAN was a single-user application based
on an integrated SQLite [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] database, the GeoMine back end is aimed at a server-focused
architecture. In the initial development of GeoMine [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], the query execution capabilities of
PELIKAN were extended for new functionality but were not changed in structure to adapt to
the new architecture.
        </p>
        <p>Database Design For our experiments in Section 4, we used a PostgreSQL15 database created
with the PDB dataset from October 2022. For querying the dataset, the database can be considered
read-only. The database requires approximately 165GB of disk space.</p>
        <p>For the geometric search, we focus on two tables. The first table, the point table, comprises all
atoms and other definable points, such as the center of aromatic rings. It contains 340,716,693
searchable entries. These points are distributed across 1,382,853 distinct pockets, which serve
as containers for groups of points. The largest pocket identified in our dataset contains 20,306
points, while the smallest pocket only holds 9 points. Each entry in the point table has a unique
identifier, references the containing pocket, and contains various other fields with properties
per point. Some properties, such as the accessible surface area of an atom, are floating point
numbers. Other attributes, such as the chemical element, contain only a few distinct values,
represented as integers or short strings.</p>
        <p>The second table, the interaction table, stores pre-calculated interactions [18]. These
interactions represent noteworthy connections between two points, for example, hydrogen bonds.
13,018,225 point pairs are stored here.</p>
        <p>Query Creation When creating a query, users can specify multiple constraints. The most
fundamental categories encompass Textual and Numerical Searches, wherein metadata filters
at the protein structure or pocket level can be defined. Users can directly pre-select several
structures or create various filters, such as the minimum number of particular chemical elements
or a certain molecular weight range for the ligand. It also enables filtering using patterns that
describe a local environment using the chemical substructure language SMARTS strings [19].</p>
        <p>The central search element and origin of GeoMine’s name are geometry-based searches. To
build the query, users may interactively select points in the web front end [21] (see Figure 1),
utilizing an arbitrary PDB file as a template structure or define them without a template.</p>
        <p>Users may select an arbitrary number of points, which can be filtered based on diferent
properties. Moreover, the specification of distance ranges between two points and angles
between specified distances is possible. Further, interactions between points, as stored in the
interaction table, can be added to the query. Together they resemble an atomic substructure,
which will be searched for. Each pocket can be examined individually as the interactions
between one ligand and an individual pocket in a protein are of interest.</p>
        <p>
          Query Execution The initial approach for query execution was first described for the
predecessor tool PELIKAN by Inhester et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The most significant enhancement for the runtime
in developing the original GeoMine approach — utilizing a PostgreSQL database instead of
SQLite — did not change the workflow of the searching process. The approach remained mostly
(a) View of a pocket (violet mesh) in (b) Specifing points, distances, and
        </p>
        <p>
          structure 1H1S angles for the query
algorithmic focused, with all major computational steps performed within the application
(see Figure 2a), as the original PELIKAN software was designed to be a standalone desktop
application. In the original approach of GeoMine [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], four major steps were performed strictly
sequentially for each query to filter the potential results:
1. Textual and Numerical Constraints - A filter eliminates all proteins and pockets that do
not meet specified properties or do not correspond to a given restrictive SMARTS filter.
        </p>
        <p>This step yields a list of all matching proteins and their pockets.
2. Obtaining all point pairs - For each point pair in the query, all possible results are returned,
and distances, as well as interaction constraints, are checked.
3. Clique detection - An algorithm reconstructs the coherent component graph for all
obtained point pairs and checks all defined angle constraints.
4. Less restrictive SMARTS filters for points were applied to the now-generated results.</p>
        <p>Steps two and three of the query processing presented particular challenges. All point and
point pair constraints were queried individually in the database. Since a single constraint for a
point pair is often not very specific, it leads to big intermediate results. Only by chaining several
constraints the number of points is suficiently reduced. The need to cross-verify each point
with all matching points in its pocket demanded significant computational resources, especially
if the filter for the points were unspecific. The list of potential pockets needed to be recreated
for each pair, as only pockets which contained results in prior pair subqueries remained in the
search space. This caused the search to be strictly sequential and required the serialization
and deserialization of long pocket-ID lists for the SQL WHERE clauses. As the application and
database system are separate processes or running on separate servers, the required repeated
transfer of these lists also afected the performance. Because some point-to-point constraints
were specific (less frequent in the dataset) and others were unspecific (frequent in the dataset),
a hand-crafted scoring function was utilized to estimate the best ordering of queries, starting
...</p>
        <p>Clique
Detection</p>
        <p>&amp;
Angle Filter
SMARTS</p>
        <p>Filter
Results</p>
        <p>Comprehensive Query
point1 JOIN point2 JOIN</p>
        <p>point3 JOIN point4
DB
SMARTS</p>
        <p>
          Filter
Results
with the most specific queries to reduce the search space early [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Although this improved
the join order in many cases, it had the disadvantage of preventing the database system from
executing classical optimizations, such as parallelism and join order optimization.
        </p>
        <p>Further, an additional algorithm was required since the results from the preceding steps
consisted only of point pairs. The Bron-Kerbosch algorithm [22], a graph-based backtracking
algorithm for clique detection, was used. This algorithm recursively verified whether all
discovered point pairs constituted a complete graph and checked for angle constraints. This
demanded substantial computational efort, taking several hours on large potential result sets.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Optimizations</title>
      <p>
        This research aims to achieve optimal performance and ease of setup across various
environments. Alongside the contributions of this work, the application has transitioned to a
containerized setup for cloud environments. The optimizations presented in this work are
essential for facilitating the deployment of a scalable application. In this section, we will distinguish
between the original approach in GeoMine [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and the improved approach we present in this
work. The yielded results for each query remained identical.
      </p>
      <sec id="sec-3-1">
        <title>3.1. Optimizing SQL Queries</title>
        <p>The most significant change from the original approach was the redesign of the SQL query
generation. Sequential processing of each constraint within a query led to severely limited
query-level parallelism and long processing times as described in Section 2.3. Therefore, all SQL
queries are now designed to make use of PostgreSQL’s internal planning and optimization. In
contrast to the original approach, where each point-to-point constraint was queried separately,
a single comprehensive query containing all attributes and constraints for geometrical patterns
is now constructed, see Figure 2b. This reduces overhead by eliminating the need to repeatedly
serialize extensive lists of pocket IDs or create temporary tables. To achieve this, the point table
joins itself as often as points were specified in the query, usually 5-15 times. As a match occurs
inside a single pocket, we only need to join points within the same pocket. With information
about the distribution of properties like the chemical element, the RDBMS can estimate which
part of the query restricts the search space the most and improve the join order. The original
approach required running the checks on all points within all remaining pockets, not being able
to skip points that were not matched in earlier subqueries. Intermediate results now remain
within the database system and do not require serialization for application transfer. Additionally,
merging all constraints (points, distances, and interactions) into one query eliminates the need
for clique detection, as the output of the RDBMS is a connected and valid result.</p>
        <p>Among all the geometric properties, only the angle checking between point pairs remains a
separate step in the application, as this increases the complexity of the query without showing
the benefits of an early reduced search space in our tests. Textual and numerical filters remain in
a separate query to allow prior filtering, as SMARTS patterns require in-application processing.
Allowing the RDBMS to determine the join order and the parallel execution resulted in a
significant speedup of benchmark queries. The results are detailed in Section 4.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Enhanced Utilization of PostgreSQL Indexes</title>
        <p>In the original approach, a single extensive index structure was created, covering 15 out of
17 table columns. Although PostgreSQL allows for the construction of multi-column indexes
with a large number of attributes, these structures are only efective in certain situations due to
their size and depending on the used attributes. However, using multiple single-column indexes
and allowing PostgreSQL to combine them as recommended in the documentation [23] did not
achieve the desired performance improvement.</p>
        <p>
          Only the combination of several attributes could substantially reduce the number of yielded
points. The best-found solution for our workload was a balanced compromise between index
size and utilization, including only the most frequently used columns in a multi-column index.
We identified two separate cases for index usage. Firstly, the earliest scheduled subquery focused
solely on the attributes, disregarding their pocket, in cases without textual and numerical filters.
Secondly, an index for subsequent subqueries was needed to filter for pocket IDs required for the
join. In almost all instances, the optimizer determined to filter for the pocket ID in the second
subquery. In some instances, a parallel index scan was performed. Filtering by the pocket ID
reduced the search space best in these cases since the most restrictive subquery had already
been executed as the first scheduled subquery. Therefore, we introduced a second index with
the pocket identifier positioned first in the index. For both structures, we utilized PostgreSQL’s
default B-Tree index as other index structures seemed not beneficial in our tests. As pockets
usually contain only a few hundred points, spatial indexes, like r-trees provided by PostGIS [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ],
did not provide the desired benefits. Filtering points and calculating all distances performed
better in our tests than spatial operations due to the overhead of utilizing a spatial column.
Index creation only needed a few minutes, but additional indexes for specific queries would no
longer fit into the filesystem read cache and reduce performance.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Improving Text Search</title>
        <p>The initial step of the workflow involves filtering structures based on textual and numerical
attributes. These filters target various properties, the most important being the PDB identifiers
used to select a pre-defined or user-defined subset of protein structures. A short alphanumeric
code identifies each structure.</p>
        <p>Previously, an SQL ILIKE (case insensitive match) statement with a wildcard match at the
beginning and end of the string was executed to check for the desired properties. For the PDB
codes, we could make two changes. We could discard the wildcards in the query unless explicitly
desired, which enables the utilization of a search index. And as the codes are not case-sensitive,
we can replace the ILIKE with a LIKE, allowing for a case-sensitive search and resulting in a
substantial speedup, as demonstrated in Section 4.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation and Discussion</title>
      <sec id="sec-4-1">
        <title>4.1. Methods</title>
        <p>To evaluate the impact of each modification suggested for GeoMine, several experiments were
derived from the original GeoMine approach ex01 (see Table 1). Experiments ex02 to ex05 each
contain only one of the improvements, ex06 contains all improvements, while experiments ex07
to ex10 contain all except one. This way, we show which change impacts the performance most,
as diferent improvements benefit from each other.</p>
        <p>
          For evaluating the performance across diferent workloads, we used a set of nine queries
already used in previous work [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], designed to highlight available features, show examples
for common applications and estimate the runtime of diferent patterns common in GeoMine
practical applications. They emitted between 2 and 7117 results.
        </p>
        <p>We used a PostgreSQL15 database system. All data was stored on an SSD. Unless otherwise
specified, a dedicated server with 400GiB RAM and 80 Cores was used (PostgreSQL 128GB
sharedbufers , 16 parallel workers). Podman [24] was used to deploy the system. Each experiment
was repeated five times. The GeoMine application was executed on the same node as the
PostgreSQL database. We configured PostgreSQL to utilize less memory than available, as
GeoMine required a high amount of working memory for some workloads. Additionally, we
conducted tests on commodity systems by employing two setups (small/medium) using virtual
servers. Both setups stored data on SSDs and were equipped with 12 cores and 24GB RAM, resp.
18 cores and 48GB RAM.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Results</title>
        <p>Figure 3 shows the mean runtime of the nine test queries for each experiment as depicted in
Table 1. Each change led to better performance, with the highest performance gain occurring
609.7s
630.4s
570.3s
when all changes were applied together. The required time for performing all nine queries
decreased from 1033sec of the original approach (ex01) to 68sec with all improvements (ex06).</p>
        <p>The new query design (ex04) had the most substantial impact on performance, particularly
visible in the long-running queries. Also, the transition from the ILIKE to the LIKE statement
notably reduced runtime. The performance gain is most noticeable on the medium-running
queries containing a long list of PDB IDs for a preselection. The experiments 02 and 03, the new
index and no wildcards in the PDB ID selection showed only a small improvement. However,
experiment 09, which contains all changes except the wildcard improvement, shows that it has an
impact on the overall runtime, presumably benefiting from the switch to the LIKE statement. The
changes in index structures showed less impact than expected, demonstrating that PostgreSQL
can handle indexes with an inflated number of columns. However, the performance was
drastically worse if no index was used or index structures did not combine multiple attributes.
For instance, combining one index per attribute led to an increase of the sum of the mean
runtimes from 68sec (ex06) to 134sec.</p>
        <p>Unspecific Queries Some of the used queries include a protein filter to reduce the number
of searched pockets. When removing these filters and searching the whole dataset, the original
approach reached its set limits (needing more than 100GB RAM or 1h time) on some of these
and other queries with less restrictive geometric filters. With the improved approach, some
queries with extensive intermediate results could now be computed for the first time, often
within minutes.</p>
        <p>Alternative Setups As large database instances are not always accessible, for example, due
to cost constraints in cloud environments, we also conducted our experiment on two smaller
virtual servers. As shown in Figure 4a, the performance gains were also visible on these smaller
server instances. These tests were performed on shared hardware, so they can only show a
general trend rather than precise comparative data. However, they demonstrate the feasibility
Ex01 Dedicated</p>
        <p>Ex01 Medium
itranaVEx06EDx0e1diScamteadll 7.5s</p>
        <p>Ex06 Medium 9.3s</p>
        <p>Ex06 Small 9.6s
0
of processing on shared virtual servers. Additionally, we observed a substantial speedup while
transitioning from PostgreSQL10 to PostgreSQL15 as displayed in Figure 4b. Combined with
our improvements, we achieved a speedup factor of 32.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion and Future Work</title>
      <p>GeoMine is a unique application for geometric searches in large collections of protein-ligand
complexes with high relevance for life-science research. We showed that it was possible to
achieve a large speedup on our query processing by moving major parts of the processing
from a custom-written logic inside the software to a PostgreSQL database system. Additionally,
diferent approaches in database optimization contributed to further performance gain. Overall,
these achievements are critical for the practical use of the system handling the growing dataset.
Some queries could be executed for the first time on our setup due to these changes. In this work,
we focused on optimizations of the database and query design. We demonstrated the substantial
benefits of database optimizations in scientific applications, achieving a fifteen-fold speedup
in GeoMine. Coupled with a halving of the runtime through the use of a newer PostgreSQL
version, we managed to reduce the average runtime from minutes to seconds.</p>
      <p>Looking ahead, we plan to explore additional database paradigms, such as distributed or
column-based systems, and establish schema changes for further optimizations. The caching of
intermediate results, as well as determining the join order by extended statistics or by utilizing
machine learning, may potentially provide additional benefits. This way, we aim to achieve
even better performance for searching scientific data with a service-oriented web service.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was supported by the German Federal Ministry of Education and Research as part of
CompLS and de.NBI (031L0172 and 031L0105).
and reliable binding site detection and descriptor calculation with dogsite3, Journal
of Chemical Information and Modeling 63 (2023) 3128–3137. doi:10.1021/acs.jcim.
3c00336, pMID: 37130052.
[18] T. Inhester, S. Bietz, M. Hilbig, R. Schmidt, M. Rarey, Index-based searching of interaction
patterns in large collections of protein–ligand interfaces, Journal of Chemical Information
and Modeling 57 (2017) 148–158.
[19] I. Daylight Chemical Information Systems, Smarts-a language for describing molecular
patterns, 2007.
[20] A. S. Rose, A. R. Bradley, Y. Valasatava, J. M. Duarte, A. Prlić, P. W. Rose, NGL viewer:
web-based molecular graphics for large complexes, Bioinformatics 34 (2018) 3755–3758.
doi:10.1093/bioinformatics/bty419.
[21] K. Diedrich, J. Graef, K. Schöning-Stierand, M. Rarey, GeoMine: interactive pattern mining
of protein–ligand interfaces in the Protein Data Bank, Bioinformatics 37 (2020) 424–425.
doi:10.1093/bioinformatics/btaa693.
[22] C. Bron, J. Kerbosch, Algorithm 457: finding all cliques of an undirected graph,
Communications of the ACM 16 (1973) 575–577.
[23] PostgreSQL 15, Documentation, 2023. URL: https://www.postgresql.org/docs/15/.
[24] Containers, podman, 2023. URL: https://podman.io/.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Inhester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rarey</surname>
          </string-name>
          ,
          <article-title>Protein-ligand interaction databases: advanced tools to mine activity data and interactions on a structural level</article-title>
          ,
          <source>WIREs Computational Molecular Science</source>
          <volume>4</volume>
          (
          <year>2014</year>
          )
          <fpage>562</fpage>
          -
          <lpage>575</lpage>
          . doi:
          <volume>10</volume>
          .1002/wcms.1192.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Burley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bhikadiya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bittrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chao</surname>
          </string-name>
          , L. Chen, e. a. Craig,
          <article-title>RCSB Protein Data Bank (RCSB.org): delivery of experimentally-determined PDB structures alongside one million computed structure models of proteins from artificial intelligence/machine learning</article-title>
          ,
          <source>Nucleic Acids Research</source>
          <volume>51</volume>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Graef</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ehrt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Diedrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Poppinga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ritter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rarey</surname>
          </string-name>
          ,
          <article-title>Searching Geometric Patterns in Protein Binding Sites and Their Application to Data Mining in Protein Kinase Structures</article-title>
          ,
          <source>Journal of Medicinal Chemistry</source>
          <volume>65</volume>
          (
          <year>2022</year>
          )
          <fpage>1384</fpage>
          -
          <lpage>1395</lpage>
          . doi:
          <volume>10</volume>
          .1021/acs. jmedchem.1c01046.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Jumper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Evans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pritzel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Green</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Figurnov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Ronneberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Tunyasuvunakool</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bates</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Žídek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Potapenko</surname>
          </string-name>
          , et al.,
          <article-title>Highly accurate protein structure prediction with alphafold</article-title>
          ,
          <source>Nature</source>
          <volume>596</volume>
          (
          <year>2021</year>
          )
          <fpage>583</fpage>
          -
          <lpage>589</lpage>
          . doi:
          <volume>10</volume>
          .1038/s41586-021-03819-2.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Schöning-Stierand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Diedrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fährrolfes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Flachsenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Meyder</surname>
          </string-name>
          , E. Nittinger,
          <string-name>
            <given-names>R.</given-names>
            <surname>Steinegger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rarey</surname>
          </string-name>
          ,
          <article-title>Proteins plus: interactive analysis of protein-ligand binding interfaces</article-title>
          ,
          <source>Nucleic acids research</source>
          <volume>48</volume>
          (
          <year>2020</year>
          )
          <fpage>W48</fpage>
          -
          <lpage>W53</lpage>
          . doi:
          <volume>10</volume>
          .1093/nar/gkaa235.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Inhester</surname>
          </string-name>
          , Mining of Interaction Geometries in Collections of Protein Structures,
          <source>Ph.D. thesis</source>
          , Universität Hamburg,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H. M.</given-names>
            <surname>Berman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Westbrook</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Feng</surname>
          </string-name>
          , G. Gilliland,
          <string-name>
            <given-names>T. N.</given-names>
            <surname>Bhat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Weissig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. N.</given-names>
            <surname>Shindyalov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. E.</given-names>
            <surname>Bourne</surname>
          </string-name>
          ,
          <article-title>The Protein Data Bank</article-title>
          ,
          <source>Nucleic Acids Research</source>
          <volume>28</volume>
          (
          <year>2000</year>
          )
          <fpage>235</fpage>
          -
          <lpage>242</lpage>
          . doi:
          <volume>10</volume>
          . 1093/nar/28.1.235.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Tenopir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. M.</given-names>
            <surname>Rice</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Allard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Baird</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Borycz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Christian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Grant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Olendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Sandusky</surname>
          </string-name>
          ,
          <article-title>Data sharing, management, use, and reuse: Practices and perceptions of scientists worldwide</article-title>
          ,
          <source>PloS one 15</source>
          (
          <year>2020</year>
          )
          <article-title>e0229003</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Raasveldt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mühleisen</surname>
          </string-name>
          ,
          <article-title>Data management for data science-towards embedded analytics</article-title>
          .,
          <source>in: CIDR</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>R. D.</given-names>
            <surname>Hipp</surname>
          </string-name>
          , SQLite,
          <year>2020</year>
          . URL: https://www.sqlite.org/.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>The PostgreSQL Global Development Group</surname>
          </string-name>
          ,
          <article-title>Postgresql: The world's most advanced open source relational database</article-title>
          ,
          <year>2023</year>
          . URL: https://www.postgresql.org.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>solid</surname>
            <given-names>IT</given-names>
          </string-name>
          gmbh, Db-engines ranking,
          <year>2023</year>
          . URL: https://db-engines.com/en/ranking.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conrad</surname>
          </string-name>
          ,
          <article-title>Database of the year: Postgres, IEEE Software 38 (</article-title>
          <year>2021</year>
          )
          <fpage>130</fpage>
          -
          <lpage>132</lpage>
          . doi:
          <volume>10</volume>
          . 1109/MS.
          <year>2021</year>
          .
          <volume>3089730</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>The PostGIS Development Group</surname>
          </string-name>
          , Postgis,
          <year>2023</year>
          . URL: https://postgis.net.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>U.</given-names>
            <surname>Cubukcu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Erdogan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pathak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sannakkayala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Slot</surname>
          </string-name>
          ,
          <article-title>Citus: Distributed postgresql for data-intensive applications</article-title>
          ,
          <source>in: Proceedings of the 2021 International Conference on Management of Data, SIGMOD '21</source>
          ,
          <year>2021</year>
          , p.
          <fpage>2490</fpage>
          -
          <lpage>2502</lpage>
          . doi:
          <volume>10</volume>
          .1145/3448016.3457551.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Urbaczek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolodzik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Fischer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lippert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Heuser</surname>
          </string-name>
          , I. Groth,
          <string-name>
            <given-names>T.</given-names>
            <surname>Schulz-Gasch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rarey</surname>
          </string-name>
          , Naomi:
          <article-title>On the almost trivial task of reading molecules from diferent file formats</article-title>
          ,
          <source>Journal of Chemical Information and Modeling</source>
          <volume>51</volume>
          (
          <year>2011</year>
          )
          <fpage>3199</fpage>
          -
          <lpage>3207</lpage>
          . doi:
          <volume>10</volume>
          . 1021/ci200324e.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>J.</given-names>
            <surname>Graef</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ehrt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rarey</surname>
          </string-name>
          ,
          <article-title>Binding site detection remastered: Enabling fast, robust,</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>