<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>IIR</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Siren Federate: Bridging Document, Relational, and Graph Models for Exploratory Graph Analysis⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Extended Abstract</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Georgeta Bordea</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stéphane Campinas</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matteo Catena</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Renaud Delbru</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>L3i, La Rochelle University - La Rochelle</institution>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Siren - Galway</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>15</volume>
      <fpage>0000</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>Investigative intelligence workflows - spanning domains such as law enforcement, cybersecurity, ifnancial compliance, and investigative journalism - require analysts to iteratively explore and correlate large, heterogeneous datasets, often represented as knowledge graphs (KGs) that integrate structured records, semi-structured logs, and unstructured content like text or multimedia. Analysts typically begin investigations with only partial clues - such as a name, phone number, or suspicious transaction - and must follow complex, multi-hop connections to uncover relevant entities and relationships. This exploratory process involves issuing tens or hundreds of queries, making low-latency responsiveness essential for preserving cognitive flow and enabling rapid hypothesis testing. However, existing graph and relational database systems struggle to support such interactive analysis at scale, especially over massive graphs containing billions of entities and relations. As a result, even modest delays can compound and render real-world investigations slow, shallow, or infeasible. To address these challenges, we introduce Siren Federate, a system designed to enable interactive, lowlatency exploration of multi-modal knowledge graphs by integrating relational and graph processing capabilities directly into document-oriented databases such as Elasticsearch. By bridging document, relational, and graph data models within an unified system, Siren Federate supports search, filtering, and multi-hop path traversal - allowing analysts to execute complex, iterative queries with sub-second to second response times, even at the scale of billions of entities and relations. To achieve the scalability and low-latency required by investigative intelligence workloads, Siren Federate incorporates several key architectural innovations. First, it implements distributed join algorithms optimized for Elasticsearch's log-structured, shard-based architecture. These enable eficient execution of relational operations across distributed datasets by minimizing data movement across the network. Second, it supports columnar, of-heap, in-memory data processing with late materialization and morsel-driven parallelism, which improves CPU cache locality and memory management. Siren Federate further includes a cost-based, adaptive query planner (AQP) that interleaves query planning and execution in stages. This planner selects the most eficient join strategy at runtime, based on statistics collected during previous stages. Query plan folding merges semantically equivalent operators within a query plan to eliminate redundancy, while semantic caching stores compact bitset representations of semi-join outputs for reuse across iterative queries.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Exploratory Graph Analysis</kwd>
        <kwd>Knowledge Graph</kwd>
        <kwd>Database and Information System Architecture</kwd>
        <kwd>Distributed Join Algorithms</kwd>
        <kwd>Document-oriented Database</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        A central contribution in graph query processing is the Semi-Join Decomposition (SJD) technique [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
SJD mitigates the combinatorial explosion of intermediate results by decomposing multi-hop
pathifnding queries into multiple semi-joins. This reduces memory usage and computational overhead,
thereby enhancing scalability and eficiency when working with large graphs. While applicable to
general multi-hop path queries, SJD is especially efective for all-shortest-paths problems, ofering a
practical solution to the challenges faced by alternative methods. Its integration with Siren Federate’s
adaptive query planner and semantic caching further boosts its eficiency for exploratory graph analysis.
      </p>
      <p>We experimentally evaluated Siren Federate to assess its eficiency across diferent scenarios. In a
ifrst series of experiments, we used a synthetic dataset comprising ~15 billion of cell phone location
records – a common data source in investigative contexts – to demonstrate the system’s scalability
with large data volumes in a distributed environment. With these experiments, we also validated our
system’s capability to process semi-joins with sub-second to second response times. This result is
important since semi-joins are fundamental for exploratory graph analysis, as they enable operations
like set-to-set navigation, graph expansion, and pathfinding.</p>
      <p>
        In a second series of experiments, we employed the LDBC Financial Benchmark [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which models
ifnancial industry data and workloads. The dataset used contains ~5 million entities (e.g., person,
accounts) and ~26 million relations (e.g., money transfer), while the queries used required matching
graph patterns of varying complexity and were expressed using the standard Graph Query Language
(GQL) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which is supported by Federate. The experimental results demonstrate our system capability’s
to handle complex graph querying pattern and its ability to process these queries within seconds.
      </p>
      <p>Finally, a real-world deployment at Apollo.io1 confirms Siren Federate’s capacity to operate at scale,
supporting a 350-node cluster managing nearly half a petabyte of data and multiple concurrent users.
The system reduced average query response times from 7 seconds to sub-second, while significantly
improving cluster stability and reducing query failures. These findings demonstrated the applicability
and robustness of our system in a large, highly-concurrent, production environment.
Declaration on Generative AI
During the preparation of this submission, the authors used ChatGPT to improve the writing in parts of
the text and to check grammar and spelling. After using this service, the authors reviewed and edited
the text as needed and take full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Bordea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Campinas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Catena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Delbru</surname>
          </string-name>
          , Siren Federate:
          <article-title>Bridging document, relational, and graph models for exploratory graph analysis</article-title>
          ,
          <source>arXiv preprint arXiv:2504.07815</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Campinas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Catena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Delbru</surname>
          </string-name>
          , Siren Federate:
          <article-title>Bridging the Gap Between Document and Relational Data Systems for Eficient Exploratory Graph Analysis</article-title>
          ,
          <source>in: Proc. IDEAS</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Pini</surname>
          </string-name>
          , G. Tummarello,
          <string-name>
            <given-names>R.</given-names>
            <surname>Delbru</surname>
          </string-name>
          ,
          <article-title>Optimization of Database Sequence of Joins for Reachability and Shortest Path Determination</article-title>
          .
          <source>U.S. Patent 11720564</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Szárnyas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Tong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          , et al.,
          <source>The LDBC Financial Benchmark, arXiv preprint arXiv:2306.15975</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Deutsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Francis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Green</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hare</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Libkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lindaaker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Marsault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Martens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Michels</surname>
          </string-name>
          , et al.,
          <article-title>Graph Pattern Matching in GQL and SQL/PGQ</article-title>
          , in
          <source>: Proc. SIGMOD/PODS</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>