1. Introduction

Benchmarking SPARQL Engines on Wikidata Queries

Peter F. Patel-Schneider

2025

Four open-source SPARQL engines are evaluated on three existing and one new benchmarks for queries against Wikidata, a large community-built knowledge graph with wide usage. Of the engines benchmarked-Blazegraph, MillenniumDB, QLever, and Virtuoso-QLever is the fastest. Blazegraph, which is the SPARQL engine used in the oficial Wikidata Query Service, is significantly slower than some other engines. All of the engines have deviations from the SPARQL standard.

1. Introduction

shows that other modern SPARQL query engines are also much faster than Blazegraph on some Wikidata queries. There are some first-party benchmarks showing that modern SPARQL engines, including MillenniumDB [11] and QLever [8, 12] are faster than Blazegraph on the Wikidata RDF dump, but no large third-party comparison of the engines.

To better test the performance of modern SPARQL engines over Blazegraph an efort to benchmark the query performance of several open-source SPARQL engines on the entire Wikidata RDF dump was undertaken. This is only a part of what is needed in a replacement for Wikidata but is an important part. The analysis of the benchmark results here was designed to be more useful in determining overall performance of a service and not so much designed to determine expected performance as seen by users of the service.

Three open-source systems that were known to be able to reasonably load Wikidata RDF dumps and run SPARQL queries on them were selected. These systems are MillenniumDB [11], QLever, and Virtuoso Open Source [13]. Three existing Wikidata benchmarks were selected and a new benchmark based on Scholia [14] was created. An October 2024 RDF dump of Wikidata was loaded into each of the modern engines and Blazegraph. The benchmarks were run on all four engines and their performance is reported and analyzed here. More information about the benchmarking, including the benchmarks and all code used, is avilable at https://github.com/wikius/benchmark-wikidata.

The closest third-party study of SPARQL engines on Wikidata was performed by Lam et al [15]. They tested the query performance of several SPARQL engines, including an earlier version of QLever, on Wikidata, using 328 sample queries. This early version of QLever performed poorly in their testing. They did not test any of the other performant open-source SPARQL engines and QLever has undergone major improvements since their study.

2. Wikidata in RDF

There is an encoding of Wikidata into RDF, and RDF dumps of Wikidata are made weekly. There are two diferent kinds of dumps. One kind includes only truthy statements (triples), that is, statements without their qualifiers and other information, no deprecated statements, and normal rank statements only if there is no preferred rank statement for the same subject and predicate. The other kind of dump is a full dump that has both truthy statements and a complex encoding of all statements that includes the rank, qualifers, and other information about each statement. As of October 2024, the full dumps of Wikidata had about 20 billion triples. The full dumps in Turtle [16] were over 100GB compressed and over 850GB uncompressed.

There are public services that evaluate SPARQL queries against full dumps of Wikidata for all four of the SPARQL engines selected. The oficial service, that uses Blazegraph, has the most up-to-date information, generally lagging by only a few seconds as updates to Wikidata are processed and then incorporated into its RDF graph. The QLever service uses similar information and also lags only slightly. The QLever service lags by around a week, as it can process the weekly dumps in well under a day. The MillenniumDB service uses the weekly dumps and thus lags by somewhat over a week. The data used by the Virtuoso service is only updated irregularly and can lag by months.

3. The Benchmarks

Three existing benchmarks were selected. These were chosen to provide a varied set of queries with diferent selection criteria and dificulty.

WGPB [17] consists of 50 instantiations of 17 simple1 query patterns. A pattern is, in essence, a small graph whose nodes are shared variables in a set of SPARQL BGPs. Each pattern is instantiated by picking Wikidata properties for each edge and constructing the BGPs, which are then expanded into a full SPARQL query. Finally a LIMIT 1000 is added, resulting in 850 SPARQL queries. 1A simple query here is one with only one SPARQL consruct or a small number of similar SPARQL constructs. A complex query has several diferent SPARQL constructs.

Figure 1: Part of Scholia Page for Richard Feynman (Q39246)

WDBench [18] consists of query fragments from the anonymized Wikidata SPARQL Logs2 evaluated by the Wikidata Query Service in 2017 and 2018 [19]. The queries used were chosen from those that had timed out. The BGPs, property paths, and some other portions of the queries were extracted and categorized into those with a single BGP (280 queries), those with multiple BGPs (681 queries), those with OPTIONAL clauses (498 queries), those with property paths (660 queries), and those that did not ift into any of the above categories (539 queries). Each of these five sets of query fragments are treated as a single benchmark here.

These query fragments have to be expanded into full queries by adding the SELECT portion. The query fragments do not retain FILTER or other limits on the size of the answer set and can return very large answer sets. The original benchmarking thus added a LIMIT 100000 to limit the number of answers. To stress modern SPARQL engines this benchmarking arbitrarily uses instead LIMIT 10000000.

WDQS [12] consists of a set of 298 queries extracted from Wikidata Query Service logs. This benchmark was used to evaluate the comparative performance of several SPARQL engines. Several of the queries return hundreds of millions of answers. For these a LIMIT 10000000 is added to the query here.

A new benchmark was created from the queries used by the Scholia [14] interface to Wikidata. This interface is designed to show information related to scholarly articles. A request for Scholia information is in the form of one, usually, but sometimes more, Wikidata identifiers. The class(es) of these identifier(s) in Wikidata are determined and a template HTML document is selected based on the class(es). There is a default template if there is no template specifically for the type(s). The template document has sections that are replaced by information constructed from the results of SPARQL queries constructed by inserting the identifier(s) in a query template.

For example, a Scholia request for Wikidata identifier Q39246, the item for Richard Feynman, would query Wikidata to find that the item with this identifier is a human and use the author document template to determine what queries to construct and how to create the HTML document partly shown in Figure 1.

Some of the Scholia queries are dificult for the Wikidata Query Service to evaluate and queries time out, resulting in documents with errors in them. Further, running these dificult queries puts a significant load on the Wikidata Query Service. The group maintaining Scholia is thus interested in determining whether a diferent SPARQL engine would do better.

The advantage of using Scholia to construct a benchmark is that many queries can be constructed from the templates. However, there are only about 375 query templates, and some of the templates are similar to each other, so there is not a wide variety of diferent queries. Another problem with using Scholia query templates for benchmarking is that they use extensions to SPARQL that are specific to Blazegraph.

The Scholia benchmark was constructed by determining the query templates for 33 diferent classes. The query templates were then turned into standard SPARQL by expanding named queries replacing the Wikidata Label Service with query fragments to determine English-language labels, and making a few other, minor modifications. For each of these classes, five items belonging to the class were determined. In a few cases these items were selected by hand but in most cases the items are the first five answers to a query that returned instances of the class that had values for properties uses in one or more of the query templates.

Some of the templates are complex. For example, here is a query template for the author document after conversion to standard SPARQL, edited to present better. The target: prefix is instantiated with the URL for the Wikidata identifier being used.

SELECT ?year (count(?work) AS ?numb_of_publs) ?role WHERE { { SELECT (str(?year_) AS ?year) (0 AS ?pp) ("_" AS ?role) WHERE {

?year_item wdt:P31 wd:Q577 . 2The logs of the Wikidata Query Service are considered to be private as they might contain personally-identifying information so constructing public benchmarks from them is not easy.

A few queries returned large answer sets, which is not useful when constructing the final document, so LIMIT clauses were added. A few queries had errors, which caused them to return incorrect answer sets, and were fixed. All these changes were sent to the Scholia repository and have been incorporated into it.

A query run then consists of instantiating each query template with each item and evaluating the resultant query.

4. Running the Benchmarks

The benchmarks were all run on a machine with a Ryzen 9950X CPU, 192GB of main memory, and fast NVMe SSD drives running the Fedora Linux distribution. MillenniumDB, QLever, and Virtuoso were downloaded from their open-source repositories, using the version current as of 05 March 2025 for MillenniumDB, 22 March 2025 for QLever, and 19 March 2025 for Virtuoso.

They were compiled using scripts from the repositories. Blazegraph is run from a docker image for the current version of Blazegraph because of issues with Java. This may slow down Blazegraph by up to 10%, but probably only slows Blazegraph down a few percent. This possible penalty does not afect the main conclusions of the evaluation.

Wikidata RDF dumps from late October 2024 were loaded into all four engines using settings determined in consultation from developers where possible. Loading was relatively easy for MillenniumDB, QLever, and Virtuoso and took less than a day for each, with QLever being fastest at about 4.5 hours. Loading the dumps into Blazegraph took over 10 days and the first try failed, probably due to a bug related to concurrent access to some data. As loading into Blazegraph was dificult no attempt was made to use newer dumps of Wikidata.

Settings for the engines during benchmarking were determined in consultation from developers where possible and set up so that about 3/4 of main memory was used by the engine. This is more memory than is commonly allowed in the public Wikidata services but was chosen to better reflect expected memory growth in the near future. The engines are allowed to use multiple threads, but all except Blazegraph are only lightly threaded when querying.

Each query is run with a 10-minute timeout. This is larger than most public Wikidata services, which generally use a 1-minute timeout, and was chosen to see behavior of the engines on a longer timeframe and to provide some indication about behavior in future with faster computers.

Each benchmark run is performed from a cold start, with system caches emptied, and timed after any startup done by the engine. This means that any engine that defers startup until the first query is evaluated will be slightly penalized. No engine spends more than a few seconds on startup and almost all runs took multiple minutes or even hours so the penalty is insignificant. This also means that any adaptation by the engine to the data in Wikidata or normal queries is considered to be part of the benchmark timing.

Then the multiple queries in each benchmark run are evaluated in succession, with no attempt to clear any cached information between queries. The input and output formats were the same for each engine. The benchmark runs, with the exception of the Scholia benchmark, had hundreds of queries. This much better simulates the situation with a query service than attempting to remove caches.

The controlling program is run on the same computer as the engine. It generally took minimal resources, except when the queries return very large answer sets and receiving the answer set takes some resources on the computer. The processing power required for this does not impact the benchmarking as there are always many threads unused. The memory taken to store the result does have some impact, competing for main memory with the system disk cache. Running the controlling program on the same computer as the engines, however, eliminates the overhead in both time and memory to send the results to a diferent computer. This overhead can be considerable, even when both computers are in the same local network, so running the controlling program on the same computer was deemed better.

The controlling program records the elapsed time between sending the query to the engine and receiving the answers from the engine. This includes any time to transmit the information between the controlling program and the engine, but not all engines provide internal timing information. If this time is longer than the maximum time the query evaluation is determined to have timed out. The output from the engine is checked for any reported errors. For each successful query one piece of information about the answer set is recorded. For queries with multiple or no answers the number of answers is recorded. For queries with one answer the value of the first variable in the query is recorded.

The benchmarking process lasted from late October 2024 to late March 2025. Benchmarks were run multiple times to remove problems in the early runs and as new versions of some of the engines were made available. Initial results of the benchmarks were publicized and made public at https: //www.wikidata.org/wiki/Wikidata:Scaling_Wikidata/Benchmarking and newly-discovered bugs and anomalies were communicated to the teams responsible for the engine involved, resulting in new versions of both QLever and MillenniumDB being available. The results here are for the latest runs for each engine.

Each set of queries for the existing benchmarks was run three times—once as described above, once with the query modified to only return the count of the number of answers, and once with the query modified to return only distinct answers. The second run was performed to eliminate the overhead of transmitting large answer sets. The third run was performed to help see how many times Virtuoso returned incorrect answer sets for transitive path queries. The Scholia benchmark queries were only run unmodified, after the changes described above, as most of them only returned a few answers with no duplicates. In a few cases the engine terminated when evaluating a query. These cases are marked and the engine restarted with the next query.

5. Results

For each engine the results of each set of queries were analyzed to compute the minimum and each quartile elapsed times, the mean elapsed time, the number of timeouts, the number of errors encountered, and the number of times the retained answer information diverges from a single mode for the four engines. The arithmetic mean is used to show how the queries would consume time on servers as opposed to show expectations by users, where geometric means are normally used.

As well, adjusted statistics were computed, where elapsed time is capped at 60 seconds, with times at least this long counting as a timeout, and any error counted as 60 seconds. This adjusted time is computed mostly to penalize engines that had many errors, but also to more closely mirror times in

Engine Blazegraph MillenniumDB QLever Virtuoso WDQS Benchmark Statistics, Unadjusted Timings Count min q1 q2 q3 max Mean Error Timeout Diverge

298 11 88 511 6155 600018 12560 31 1 14 298 1 31 588 24176 602338 103271 0 43 5 298 2 26 103 559 301655 4583 3 0 12 298 1 68 461 3264 600754 14645 13 2 30

WDQS Benchmark Statistics, Adjusted Timings Count min q1 q2 q3 max Mean Error Timeout Diverge

298 17 89 520 6403 60000 10236 21 16 13 298 1 31 588 24176 60000 15482 0 64 3 298 2 26 103 559 60000 2290 1 6 12 298 2 80 577 4503 60000 8506 12 13 28

Engine Blazegraph MillenniumDB QLever Virtuoso

current public services.

The statistics for all three variations of the WDQS benchmark with both unmodified and adjusted timings are shown in Table 1. On this benchmark QLever is significantly the fastest for all three variations, no matter whether the timings are adjusted or not. The relative diference in speed between QLever and MillenniumDB, the slowest engine, is about 25 times for unadjusted timings and about 7 times for adjusted timings. QLever never takes the full 600 seconds for any query and only takes more than 60 seconds for a few, whereas MillenniumDB times out on about 1 in 7 queries.

Blazegraph has quite a few errors on this benchmark, mostly due to running out of memory. Virtuoso has a few errors mostly due to refusal to evaluate the query due to high estimated times. incorrect syntax processing, or issues with transitive paths. The Qlever errors are due to running out of memory. Each engine diverges from a common mode in a few cases. Virtuoso diverging the most, mostly due to invalid duplicates from transitive paths. Most of the divergences for MillenniumDB appear to be from a bug in embedded query processing. The divergences for Blazegraph appear to be mostly from the Blazegraph loading process removing some triples related to Wikidata labels. Some divergences, and most of the QLever divergences, are due to extra processing of numeric and GeoSPARQL values.

6. Summarization and Analysis

These statistics were further processed to produce summaries, removing the some of the statistical information to show combined performance of each engine on the benchmarks, with the five components of WDBench shown separately. This allows the timings and issues for each engine to be shown in a smaller format. For the existing benchmarks six blocks of information are generated—for adjusted and unadjusted on each way the benchmarks have been run. This information is shown in Tables 2 and 3. For the Scholia benchmark again both unadjusted and adjusted information is shown in Table 4, but only for some of the query classes.

Existing Benchmarks The summaries for the existing benchmarks show that all the engines have divergences from a single mode, likely indicating deviations from the SPARQL standard. Penalizing for divergences was not done, because it was not always certain that divergences are incorrect answers. The detailed results were examined and some queries run outside the benchmarking process to determine some reasons for these divergences.

The large number of divergences for Virtuoso are mostly due to two known issues. Virtuoso returns duplicates from transitive path matching where the standard requires no duplicates. Virtuoso also silently only produces at most 1048576 answers for any query. In the WDBench benchmarks this produces over one thousand divergences, with the second cause producing over 60% of the divergences, as shown by the statistics for when only counts are returned. This large number of divergences should be taken into account when considering Virtuoso.

The divergences for MillenniumDB appear to be mostly due to not returning duplicates for alternatives in property paths. Other divergences for MillenniumDB appear to be from a bug in embedded query processing.

Many divergences for Blazegraph are from the Blazegraph loading process removing some triples related to Wikidata labels, and are thus not a problem with Blazegraph itself. Other divergences for Blazegraph come from an incorrect ordering of DISTINCT and LIMIT processing.

QLever and possibly other engines transform numeric RDF literals into internal data, which does not conform to the RDF and SPARQL standards. For example, "1"^^xsd:integer and "01"^^xsd:integer incorrectly become the same RDF node. This causes the majority of the divergences for QLever. GeoSPARQL datatypes were also a source of divergences.

The summaries for the existing benchmarks also show a considerable number of errors, so penalizing the engines for errors is appropriate.

Most of the errors for QLever result from running out of memory. QLever query processing appears to trade of space for time, and QLever can request large amounts of memory for queries, thus running out of space. Optional clauses and requiring results to be distinct appear to afect this tradeof, so much that QLever runs out of memory very often when either of these constructs is present and resultant penalty in the adjusted timings is significant.

Blazegraph also often runs out of memory, with a significant resultant penalty. Blazegraph also regularly reports errors in access to its internal data structures.

Virtuoso first estimates the time it would take to evaluate a query and refuses to run the query if this estimate is out of bounds. Unfortunately, the query estimator regularly produces unbelievable estimates resulting Virtuoso frequently refusing to run a query. The penalty for these errors is significant.

MillenniumDB and QLever are not complete implementations of SPARQL and a few queries contain constructs that they do not handle. Virtuoso also reports a few queries that it cannot handle, mostly relating to transitive property paths. MillenniumDB has very few errors overall. This does need to be balanced against the large number of timeouts for MillenniumDB.

The timings show QLever as the fastest engine for the existing benchmarks, except for the versions with distinct results where Virtuoso is fastest. Otherwise Virtuoso is the second-fastest, but this needs to be balanced with the large number of divergences for Virtuoso.

In unadjusted timings, MillenniumDB is the slowest overall. MillenniumDB is fast when there are simple queries or limited answers but is very slow when there are complex queries (WDBench others and WDQS). It thus appears that MillenniumDB is speedy on atomic operations but does not do a good iv 0 51 83 90 22 20 30 96 iv 0 15 18 06 51 30 82 83 iv 0 1 0 0 5 0 6 2 6 4 2 3 D 3 2 4 3 4 D 3 2 4 3 4 D 3 2 6

1 1

S 1 2 8 1 5 0 0 S 1 5 2 4 5 6 8 S 1 1 4 5 2 n O 0 0 0 0 4 5 3 2

1 U T

A O 0 0 2 62 11 00 46 302 adn TO 0 0 0 0 4 4 3 1 5 4 11 ,s T 1 5 4 1 1 1

U B ,s M d

o n q n le ed lo li if S i e i iuDm ireeu rrE 0 0 0 0 1 0 0 1 re rr 0 0 0 0 1 0 0 1 ,ts rr 0 0 0 0 1 0 0 1 u E l E q u

s w 9 0 2 6 7 7 8 9 d w 9 0 5 0 9 6 2 1 e w 9 9 0 4 7 5 1 5 1 4 1 6 8 30 ie o 5 7 0 9 4 r 6 2 9 9 3 2 8 3 0 1 6 f l 3 9 1 5 d lo 7 4 8 9 8 9 8 3 97 89 09 id S 12 13 26 te S 5 2 87 89 68 1 o n 1

m u nm ean 21 01 42 64 93 37 71 29 n n 1 1 4 0 8 5 2 1 o n 1 7 0 5 6 1 9 9

a 2 0 8 0 0 9 8 9 C a 2 1 0 7 6 1 8 7 5 4 3 2 3 2 2 U e 5 7 5 7 2 4 2 e 3 9 7 0 9 0 U 5 5 7 2 3 4 4 2 2 5 5 1 1 9 6 1 2 2

M 2 8 0 2 M 1 1 1 5 M 1 8 0 1

1 2 1 2 iv 0 5 0 3 4 3 4 9 iv 0 4 0 2 3 9 3 1 iv 0 5 5 7 4 6 4 1

1 2 1 5 1 1 4 1 2 1 7 D D D k r a m h c n e B n 8 6 8 6 1 9 5 3 n 2 6 5 7 6 7 6 9 n 2 6 2 0 5 1 0 6 a 7 2 0 6 9 3 4 5 a 4 2 1 3 8 6 0 7 a 6 1 2 4 6 4 3 7 e 9 7 6 5 6 2 6 4 e 3 7 4 1 6 7 5 5 e 7 1 1 1 0 4 1 7

1 7 5 5 8 4 4 1 0 5 6 7 8 0 4 1 5 7 4 2 M 2 1 1 7 M 1 1 5 M 3 1 1 7 iv 0 5 3 8 0 8 2 6 iv 0 5 3 8 0 8 2 6 iv 0 4 8 9 3 8 2 4

1 3 1 3 1 4 D D D O 0 0 0 0 2 0 0 2 O 0 0 0 9 8 4 6 7 O 0 0 0 0 2 0 0 2

T T 2 4 T raph rrE 2 0 10 23 21 15 31 102 rrE 1 0 6 19 0 9 21 56 rrE 0 0 7 912 15 7 23 424 g e za low 632 602 023 833 001 944 779 012 low 489 572 573 385 336 327 469 628 low 066 611 571 697 337 904 245 962 l B S 2 5 3 12 29 7 43 S 1 4 4 6 81 7 34 S 1 1 71 13 91 59 42 19

1 1 iv 0 1 0 0 36 92 42 71 iv 0 15 01 80 11 11 32 41 iv 0 1 7 5 5 5 2 5 5 0 7 0 0 2 6 D 3 2 6 D 4 3 1 1 0 D 4 2 1 1 9

1 m u E iu s

e n r n le ed lo li t S

n M u

o 2 2 2 1 9 m M i t d ed iv 0 0 0 0 0 2 3 52 te iv 0 0 0 0 1 0 3 4 d iv 0 0 0 0 0 0 1 1

2 u j u d O 0 0 2 4 0 0 4 0 d 2 1 0 6 0 a O 0 0 0 0 4 6 3 3 j

5 4 1 d O 0 0 2 6 1 0 4 3 1 2 1 0 6 0 1 2 n T 1 A T 1 2

U , D lts rr 0 0 0 0 1 0 0 1 ,s rr 0 0 0 0 1 0 0 1 lts rr 0 0 0 0 1 0 0 1 lt E u E u s

e w 9 9 8 0 5 1 3 5 se w 7 0 6 3 2 8 6 2 r w 7 0 0 8 0 1 9 5 3 9 1 3 9 r o 2 9 4 s D u j e t D s n i t s C n 1 7 1 4 7 4 1 5 i n 2 9 9 3 8 2 6 9 D n 2 9 8 7 4 6 0 6 a 2 1 8 0 2 5 4 4 D a 2 5 5 1 1 3 2 2 a 2 5 8 0 6 7 1 2 e 6 5 2 1 3 9 e 7 6 9 3 0 3 0 e 7 9 9 1 9 5 3

7 2 4 5 9 7 7 8 3 3 1 6 4 3 5 5 7 M 1 1 3 M 2 8 0 3 M 1 1 1 5 1 2 raph rrE 0 0 7 108 10 3 13 141 rrE 2 20 363 275 65 99 32 856 rrE 1 19 359 266 31 85 19 780 g e za low 431 611 814 254 956 212 775 792 low 611 114 066 0 09 10 09 27 low 487 149 687 12 13 33 16 97 l 6 3 4 2 8 6 7 7 1 B S 1 13 26 6 21 8 78 S 3 9 20 82 7 23 S 5 29 23 8 19 7 95 1 n 3 9 7 8 7 5 3 2 n 6 3 9 0 6 5 0 9 n 2 8 5 1 7 8 7 8 a 4 1 5 5 2 5 8 4 a 2 7 0 8 1 9 2 1 a 0 0 7 2 7 0 7 6 e 4 6 4 7 6 4 6 0 e 6 8 1 9 7 8 2 4 e 5 9 6 2 7 4 1 6

1 4 3 8 2 0 2 3 6 1 6 6 2 8 5 6 7 1 7 0 9 M 1 3 2 1 9 M 1 2 2 9 1 7 M 3 3 1 2 1 2 1 1 k r a m h c n e B o s E o rr 27 10 0 5 5 5 0 0 0 7 5 0 1 rr 27 10 0 5 5 5 0 0 0 7 5 0 1 2 1 2 1 1 1 2 2 1 2 1 1 1 2

2 E 2 u itrV low 0 415 669 0 0 0 652 0 0 000 802 0 618 low 217 979 669 0 538 232 265 890 410 405 021 741 888

S 1 1 3 2 9 S 9 9 1 4 2 1 1 6 6 6 4 2 1 2 3 2 2 n 2 0 7 0 8 1 9 8 2 5 7 3 6 n 2 7 7 8 7 2 9 1 4 9 3 4 1 a 7 9 5 5 7 5 7 2 0 9 1 5 5 a 7 0 5 8 0 0 7 0 4 9 5 2 8 e 4 5 2 2 8 2 9 1 5 4 3 2 4 e 1 4 2 0 8 6 9 0 9 9 5 8 0

2 3 3 3 2 1 5 0 2 5 0 2 3 8 9 6 5 8 4 M 5 M 1 2 1 3 2 4 1 7 3 iv 5 0 0 0 0 0 0 0 0 1 0 0 7 iv 5 0 0 0 0 0 0 0 0 1 0 0 7

1 1 D D r rr 10 5 0 0 7 0 0 9 2 0 0 1 3 rr 10 5 0 0 7 0 0 9 2 0 0 1 3

1 1 9 1 1 9 n 6 5 8 9 5 0 7 6 2 5 5 0 2 n 5 8 8 9 2 0 7 1 0 5 3 9 5 a 0 4 8 0 5 7 1 7 5 9 3 3 6 a 5 2 8 9 7 7 1 6 3 9 3 8 3 e 0 4 5 7 0 3 4 8 5 4 2 8 6 e 9 4 5 4 9 3 4 1 9 4 4 1 2

3 0 4 2 9 4 1 8 5 0 0 5 2 4 3 9 1 6 M 0 4 4 M 1 3 1 2 1 3

3 5 2 s gn iv 6 0 0 0 0 0 0 0 0 0 0 1 01 sg iv 6 0 0 0 0 0 0 0 0 0 0 1 0 i 1

D n D i m i t ed T t

m O 19 10 5 5 4 7 0 5 9 9 0 6 1 i

5 t O 19 10 5 5 4 7 0 5 9 9 0 6 1 2 1 5 3 3 2 1 5 3 3 5

3 d T 3 m d E

e BD jsu rr 10 0 0 0 0 0 0 0 0 0 0 5 5 ts rr 10 0 0 0 0 0 0 0 0 0 0 5 5

1 1 4 ju E 1 1 4 u a d i n n w 17 88 41 03 50 24 28 38 02 76 50 17 76 A low 432 584 20 38 39 45 77 16 82 58 51 57 99 n U lo 2 0 4 2 0 2 7 0 7 7 4 1 9 4 3 4 6 4 2 8 5 2 3 0 le S 0 9 9 8 2 2 3 0 0 8 5 4 7 S 7 1 4 8 6 4 0 9 4 6 2 9 8 li 3 1 4 5 1 4 1 6 2 1 8 9 0 1 1 1 1 1 1 1 4 2 3 8

1 2 1 1 3 1 3 4 2 2 9 4 M 4 n 9 3 9 3 8 5 5 6 4 1 5 0 1 n 7 2 8 6 1 5 4 9 2 3 4 4 2 a 8 3 2 5 2 7 4 6 0 7 8 7 7 a 8 1 0 2 1 1 9 1 1 5 8 3 9 e 6 5 0 4 9 4 1 1 2 2 6 3 5 e 3 0 0 4 4 0 8 1 8 0 6 0 2

0 9 0 8 2 2 6 0 4 9 5 4 9 3 2 5 3 2 5 2 6 8 7 1 8 7 M 3 1 5 5 1 4 1 6 2 1 8 9 2 M 2 2 1 2 3 1 1 3 4 2 4 1 3 1 2 1 1 3 1 3 4 2 2 9 6

4 D

D rr 15 5 0 5 0 0 0 5 3 3 0 0 5 rr 15 5 0 5 0 0 0 5 3 3 0 0 5 2 1 2 1 4 2 1 2 1 4

1 E 1 n 0 8 7 1 7 2 3 5 3 2 8 6 5 n 8 7 0 6 3 9 0 3 4 4 6 7 3 a 3 3 3 1 7 1 3 4 6 2 2 2 7 a 6 9 3 4 3 0 5 0 3 0 8 7 2 e 5 4 1 2 8 1 5 5 8 2 8 4 7 e 2 0 1 0 0 6 9 9 1 3 5 6 0

8 2 0 4 2 5 3 1 2 4 8 9 2 8 0 5 0 0 7 9 6 9 7 2 8 3 M 0 5 6 0 7 4 6 4 9 4 3 M 2 1 3 3 1 1 3 1 3 5 1 1 1 1 2 2 1 6 3

1 k r a m h c n e B

s s e s m i la le x re y t c e j o

s s e s m i la le x re y t c e j o r -c -e le -s tc tre in oh m m p tn je p te ic eu irp k TA tu eh eh om ev ro ro ro op en ik ro O a c c c e p p p t v w w T

S L r -c -e le -s tc tre in oh m m p tn je p te ic eu irp k TA tu eh eh om ev ro ro ro op en ik ro O a c c c e p p p t v w w T job of producing good query plans for complex queries. When timings are adjusted to account for errors, Blazegraph is the slowest by a large margin over QLever and Virtuoso.

Scholia benchmark The Scholia benchmark also shows the need to adjust timings to account for errors. In the unadjusted timings Virtuoso is the fastest, but it has the most errors. When timings are adjusted, Virtuoso sinks to third and QLever is fastest by a ratio of about two-thirds over Blazegraph. MillenniumDB is the slowest on this benchmark, timing out on many of the queries, and is about 2.7 times slower than QLever. The slowness of MillenniumDB is likely due to the complex queries in the benchmark.

Almost all of the 145 errors for Blazegraph in the Scholia benchmark are due to running out of memory. MillenniumDB produces no output for its 45 errors so the cause cannot be determined, but it is likely that the cause for most of them is unrecognized answers from service calls. Close to half of the 93 errors for QLever are due to running out of memory, with most of the rest due to unrecognized answers from service calls and a few due to unimplemented syntax. Of the 221 errors for Virtuoso, over half are due to unimplemented syntax and most of the rest due to high estimated execution times with most of these estimated times being excessive or abnormal.

There are some divergences in the answers from the engines. As before, Virtuoso has the most divergences, with Blazegraph having the fewest. The reason for most of these divergences is unknown due to the complex nature of the queries. Some divergences appear to be due to the reasons identified above.

7. Summary and Recommendation

QLever is the fastest engine overall, but is slower for distinct answers. Virtuoso is fast but diverges the most by far mostly due to several known causes. MillenniumDB and Blazegraph are the slowest. MillenniumDB is fast on simple queries, but slow on complex queries.

None of the engines are free of errors or divergences, even Blazegraph. That Blazegraph has divergences is a bit surprising because Blazegraph was in use for the oficial Wikidata Query Service while it was still being maintained. Both QLever and MillenniumDB are under active development, which should improve their performance and reduce their errors and divergences.

From the results in these benchmarks, a Wikidata Query Service based on QLever would be significantly faster and produce more answers and fewer errors than one based on Blazegraph. QLever now appears to be a viable replacement for Blazegraph in the oficial Wikidata Query Service as it has recently been extended to allow its RDF graph to be updated while it is running.

Declaration on Generative AI

The author has not employed any Generative AI tools in the work reported on in this paper nor in the preparation of this paper.

Acknowledgments:

This work was partly supported by a grant from Wikimedia Switzerland. [4] SPARQL, SPARQL 1.1 query language, W3C Recommendation, https://www.w3.org/TR/sparql11 -query/, 2013. [5] Richard Cyganiak and David Wood and Markus Lanthaler, RDF 1.1 concepts and abstract syntax,

W3C Recommendation, https://www.w3.org/TR/rdf11-concepts/, 2014. [6] Wikidata:RDF, Wikidata:RDF, https://www.wikidata.org/wiki/Wikidata:RDF, 2025. Accessed 30

April 2025. [7] Blazegraph, Welcome to Blazegraph, blazegraph.com, 2020. Accessed 23 July 2024. [8] H. Bast, B. Buchhold, QLever: A query engine for eficient SPARQL+text search, in: CIKM ’17:

ACM Conference on Information and Knowledge Management, 2017. [9] G. Lederrey, L. Pintscher, D. Causse, Wikidata query service: Where are we? Where is it going?, Data Reuse Days 2025, https://docs.google.com/presentation/d/1DHxnjkZKwly9AKONOJtvf Tk6ls 6DBw1Ab6gHdODM5XA, 2025. [10] Wikidata SPARQL Query Service Backend Update, Wikidata SPARQL query service backend update, https://www.wikidata.org/wiki/Wikidata:SPARQL_query_service/WDQS_backend_update, 2025.

Accessed 30 April 2025. [11] D. Vrgoč, C. Rojas, R. Angles, M. Arenas, D. Arroyuelo, C. Buil-Aranda, A. Hogan, G. Navarro, C. Riveros, J. Romero, MillenniumDB: An open-source graph database system, Data Intelligence 5 (2023). [12] H. Bast, QLever performance evaluation and comparison to other SPARQL engines, https://github .com/ad-f reiburg/qlever/wiki/QLever-performance-evaluation-and-comparison-to-other-SPA RQL-engines, 2025. Accessed 8 May 2025. [13] Virtuoso, Virtuoso open-source edition, https://vos.openlinksw.com/owiki/wiki/VOS, 2024.

Accessed 30 April 2025. [14] F. Å. Nielsen, D. Mietchen, E. Willighagen, Scholia and scientometrics with Wikidata, in: Scientometrics 2017, 2017, pp. 237–259. URL: https://arxiv.org/pdf/1703.04222. [15] A. N. Lam, B. Elvesaeter, F. Martin-Recuerda, in: The Semantic Web: 20th International Conference,

ESWC 2023, 2023, pp. 679–696. doi:http://dx.doi.org/10.1007/978-3-031-33455-9_40. [16] RDF 1.1 Turtle, RDF 1.1 Turtle, W3C Recommendation, https://www.w3.org/TR/turtle/, 2014. [17] A. Hogan, C. Riveros, C. Rojas, A. Soto, A worst-case optimal join algorithm for SPARQL, in:

Proceedings of the 18th International Semantic Web Conference (ISWC), 2019. [18] R. Angles, C. B. Aranda, A. Hogan, C. Rojas, D. Vrgoč, Wdbench: A wikidata graph query benchmark, in: U. Sattler, A. Hogan, M. Keet, V. Presutti, J. P. A. Almeida, H. Takeda, P. Monnin, G. Pirrò, C. d’Amato (Eds.), The Semantic Web – ISWC 2022, Springer, 2022, pp. 714–731. [19] S. Malyshev, M. Krötzsch, L. González, J. Gonsior, A. Bielefeldt, Getting the most out of Wikidata: Semantic technology usage in Wikipedia’s knowledge graph, in: D. Vrandečić, K. Bontcheva, M. C. Suárez-Figueroa, V. Presutti, I. Celino, M. Sabou, L.-A. Kafee, E. Simperl (Eds.), Proceedings of the 17th International Semantic Web Conference (ISWC’18), Springer, 2018, pp. 376–394.

[1]

Vrandečić ,

Krötzsch , Wikidata: A free collaborative knowledgebase, C. of the ACM 57 ( 2014 ) 78 - 85 .

[2] Wikidata , Wikidata main page, https://www.wikidata.org/wiki/Wikidata:Main_Page, 2025 . Accessed 30 April 2025 .

[3]

Wikimedia

Deutschland , Wikibase, https://wikiba.se/, 2025. Accessed 30 April 2025 .